This document describes the implementation of ``MARIA: A multimodal transformer model for incomplete healthcare data´´ (MARIA) in PyTorch. MARIA is a multimodal learning framework specifically designed for the analysis of biomedical tabular data, with a focus on addressing missing values across multiple data modalities without the need for any imputation strategy.
Building upon NAIM, MARIA extends the attention-based missing value handling to the multimodal setting, where each modality is processed by a dedicated modality-specific network. The framework supports three fusion strategies to combine information from different modalities:
- Early Fusion: features from all modalities are concatenated before being fed into a single model.
- Joint Fusion: each modality is processed by a dedicated network, and their intermediate representations are combined through a shared network.
- Late Fusion: each modality is processed independently and the final predictions are combined at decision level.
At every epoch, features are randomly masked (MCAR - Missing Completely At Random) across modalities to prevent co-adaptations among features and to enhance the model's generalization capability, enabling robust performance even in the presence of high percentages of missing values.
We used Python 3.9 for the development of the code. To install the required packages, it is sufficient to run the following command:
pip install -r requirements.txtand install a version of PyTorch compatible with the device available. We used torch==1.13.0.
The execution of the code heavily relies on Facebook's Hydra library.
Specifically, through a multitude of configuration files that define every aspect of the experiment, it is possible to
conduct the desired experiment without modifying the code.
These configuration files have a hierarchical structure through which they are composed into a single configuration
file that serves as input to the program.
More specifically, the main.py file will call the config.yaml file, from which
the configuration files tree begins.
The framework supports five pipeline types:
| Pipeline | Description |
|---|---|
simple |
Single modality classification using all available data |
missing |
Single modality classification with missing data generation |
multimodal_early_fusion |
Early fusion multimodal classification with missing data generation |
multimodal_joint_fusion |
Joint fusion multimodal classification with missing data generation |
multimodal_late_fusion |
Late fusion multimodal classification with missing data generation |
The experiments presented in the paper use two biomedical datasets, each comprising multiple modalities:
| Dataset | Task | Modalities |
|---|---|---|
| ADNI | Diagnosis (CN, EMCI, LMCI, AD) | Assessment, Biospecimen, Image Analysis, Subject Characteristics |
| ADNI | Prognosis | Assessment, Biospecimen, Image Analysis, Subject Characteristics |
| AI4Covid | Death prediction | Blood, History, Personal, State |
| AI4Covid | Prognosis | Blood, History, Personal, State |
The experiment configuration files for each dataset, task, and fusion strategy are available in the ./confs/experiment/ folder.
For example, to run the ADNI diagnosis experiment with joint fusion using NAIM:
python main.py experiment=ADNI_diagnosis_joint_NAIMSimilarly, for AI4Covid death prediction with early fusion:
python main.py experiment=AI4Covid_early_deathThe available experiment configurations follow the naming convention <Dataset>_<task>_<fusion>_<model>.yaml for joint fusion experiments and <Dataset>_<fusion>_<task>.yaml for early and late fusion experiments:
These experiments generate different percentages of missing values (MCAR) in the training and testing sets.
Specifically, the percentages used are indicated by missing_percentages in each experiment configuration file (e.g., [0.0, 0.05, 0.1, 0.3, 0.5, 0.75]).
For single-modality experiments with missing data generation, the classification_with_missing_generation.yaml configuration can be used:
python main.py experiment=classification_with_missing_generation experiment/databases@db=ADNI_diagnosis_assessmentFor each experiment, this code produces a folder named <experiment-name>/<experiment-subname> which contains everything generated by the code.
In particular, the following folders and files are present:
cross_validation: this folder contains a folder for each training fold, indicated as a composition of test and validation folds<test-fold>_<val-fold>, reporting the information on the train, validation and test sets in 3 separate csv files.preprocessing: this folder contains all the preprocessing information divided into 3 main folders:numerical_preprocessing: in this folder, for each percentage of missing values considered, there is a csv file for each fold reporting the information on the preprocessing params of numerical features.categorical_preprocessing: in this folder, for each percentage of missing values considered, there is a csv file for each fold reporting the information on the preprocessing params of categorical features.imputer: in this folder, for each percentage of missing values considered, there are csv files for each fold with information on the imputation strategy applied to handle missing values and a pkl file containing the imputer fitted on the training data of the fold.
saved_models: this folder contains, for each percentage of missing values considered, a folder with the model's name that includes, for each fold, a csv file with the model's parameters and a pkl or pth file containing the trained model.predictions: this folder contains, for each percentage of missing values considered, a folder that reports the predictions obtained from the training and validation sets and separately those of the test set.results: this folder reports, for each percentage of missing values considered, the performance on the train, validation, and test sets separately. Specifically, for each set two folders namedbalancedandunbalancedcontaining the performance are reported, presented in 3 separate files with increasing levels of averaging:all_test_performance.csv: performance evaluated for each fold and each class.classes_average_performance.csv: average performance of the folds for each class.set_average_performance.csv: average performance across folds and classes.
config.yaml: this file contains the configuration file used as input for the experiment.<experiment-name>.log: this is the log file of the experiment.
NOTE: In case an experiment should be interrupted, voluntarily or not, it is possible to resume it from where it was interrupted by setting the
continue_experimentparameter toTruein the experiment configuration file.
As mentioned above, the experiment configuration file is created at the time of code execution starting from the
config.yaml file, in which the configuration file for the experiment to be performed is declared,
along with the device to use and the system paths configuration.
device: cuda # cpu, cuda, or mps
defaults:
- _self_
- experiment: multimodal_joint_fusion_classification_with_missing_generation # Experiment to perform
- experiment/paths/system@: local # System paths configurationThe possible options for the experiment parameter are the experiment configuration files contained in the ./confs/experiment/ folder.
The main experiment types are:
classification: single modality classificationclassification_with_missing_generation: single modality classification with MCAR missing data generationmultimodal_early_fusion_classification_with_missing_generation: early fusion multimodal classificationmultimodal_joint_fusion_classification_with_missing_generation: joint fusion multimodal classificationmultimodal_late_fusion_classification_with_missing_generation: late fusion multimodal classification
To prepare a dataset for the analysis with this code, it is sufficient to prepare a configuration file, specific for the
dataset, similar to those already provided in the folder ./confs/experiment/databases.
The path to the data must be specified in the path parameter in the dataset's configuration file.
Thanks to the interpolation functionality of Hydra the path can be composed using the ${data_path} interpolation key.
Once the dataset configuration file is prepared, it is important that it is placed in the ./confs/experiment/databases folder.
In particular, it is important that the dataset configuration file is structured as follows:
_target_: CMC_utils.datasets.ClassificationDataset # DO NOT CHANGE
_convert_: all # DO NOT CHANGE
name: <dataset-name> # Name of the dataset
db_type: tabular # DO NOT CHANGE
classes: ["<class-1-name>", ..., "<class-n-name>"] # List of the classes
label_type: multiclass # multiclass or binary
task: classification # DO NOT CHANGE
path: ${data_path}/<relative-path-to-file> # Relative path to the file
columns: # Dictionary containing features names as keys and their types as values
<ID-name>: id # Name of the ID column if present
<feature-1-name>: <feature-type> # int, float or category
<feature-2-name>: <feature-type> # int, float or category
# Other features to be inserted
<label-name>: target # Name of the target column
pandas_load_kwargs:
na_values: [ "?" ]
header: 0
index_col: 0
dataset_class: # DO NOT CHANGE
_target_: CMC_utils.datasets.SupervisedTabularDatasetTorch # DO NOT CHANGE
_convert_: all # DO NOT CHANGEIn the columns definition, id and target feature types can be used to define the ID and classes columns respectively.
For multimodal experiments, multiple dataset configuration files must be prepared (one per modality) and referenced in the experiment configuration file using the databases@dbs.<index> key.
The experiment configuration file defines the specifics for conducting the desired pipeline. It begins with some general information, such as the name of the experiment, the pipeline to be executed, the seed for randomness control, training verbosity, and the percentages of missing values to be tested.
experiment_name: ${db_name}_${seed}_multimodal_joint_fusion_classification_with_missing_generation
pipeline: multimodal_joint_fusion
seed: 42
verbose: 1
continue_experiment: False
missing_percentages: [0.0, 0.05, 0.1, 0.25, 0.5]Then, all other necessary configuration files for the different parts of the experiment are declared. The possible options for each part are listed in the table below.
To modify some of the hyperparameters of the models, it is possible to modify the ml_params and dl_params files.
For the ML models it is possible to define the number of estimators (n_estimators), whereas for the DL models it is possible to define the number of epochs (max_epochs), the warm-up number of epochs (min_epochs),
the batch size (batch_size), the early stopping's (early_stopping_patience) and the scheduler's (scheduler_patience) patience and their tolerance for performance improvement (performance_tolerance), the device to use for training (device).
It is also possible to define the learning rates to be tested (learning_rates), but to be compatible with some of the competitors available in the models list, it is necessary to define also the initial learning rate (init_learning_rate) and the final learning rate (end_learning_rate).
n_estimators: 100 # Number of estimators for the ML modelsmax_epochs: 1500 # Maximum number of epochs
min_epochs: 50 # Warm-up number of epochs
batch_size: 32 # Batch size
init_learning_rate: 1e-3 # Initial learning rate
end_learning_rate: 1e-8 # Final learning rate
learning_rates: [1e-3, 1e-4, 1e-5, 1e-6, 1e-7] # Learning rates for the scheduler
early_stopping_patience: 50 # Patience for the early stopping
scheduler_patience: 25 # Patience for the scheduler
performance_tolerance: 1e-3 # Tolerance for the performance improvement
device: cuda # cpu or cuda or mps, device to use for trainingFor any questions, please contact camillomaria.caruso@unicampus.it and valerio.guarrasi@unicampus.it.
@article{caruso2025maria,
title={MARIA: A multimodal transformer model for incomplete healthcare data},
author={Caruso, Camillo Maria and Soda, Paolo and Guarrasi, Valerio},
journal={Computers in Biology and Medicine},
volume={196},
pages={110843},
year={2025},
publisher={Elsevier},
doi={10.1016/j.compbiomed.2025.110843},
}