Skip to content

cosbidev/MARIA

 
 

Repository files navigation

MARIA

DOI

  1. Installation
  2. Usage
    1. Reproducing the experiments
    2. Train & Test on your dataset
      1. Experiment declaration
      2. Dataset preparation
      3. Experiment configuration
  3. Contact
  4. Citation

This document describes the implementation of ``MARIA: A multimodal transformer model for incomplete healthcare data´´ (MARIA) in PyTorch. MARIA is a multimodal learning framework specifically designed for the analysis of biomedical tabular data, with a focus on addressing missing values across multiple data modalities without the need for any imputation strategy.

Building upon NAIM, MARIA extends the attention-based missing value handling to the multimodal setting, where each modality is processed by a dedicated modality-specific network. The framework supports three fusion strategies to combine information from different modalities:

  • Early Fusion: features from all modalities are concatenated before being fed into a single model.
  • Joint Fusion: each modality is processed by a dedicated network, and their intermediate representations are combined through a shared network.
  • Late Fusion: each modality is processed independently and the final predictions are combined at decision level.

At every epoch, features are randomly masked (MCAR - Missing Completely At Random) across modalities to prevent co-adaptations among features and to enhance the model's generalization capability, enabling robust performance even in the presence of high percentages of missing values.


Installation

We used Python 3.9 for the development of the code. To install the required packages, it is sufficient to run the following command:

pip install -r requirements.txt

and install a version of PyTorch compatible with the device available. We used torch==1.13.0.


Usage

The execution of the code heavily relies on Facebook's Hydra library. Specifically, through a multitude of configuration files that define every aspect of the experiment, it is possible to conduct the desired experiment without modifying the code. These configuration files have a hierarchical structure through which they are composed into a single configuration file that serves as input to the program. More specifically, the main.py file will call the config.yaml file, from which the configuration files tree begins.

The framework supports five pipeline types:

Pipeline Description
simple Single modality classification using all available data
missing Single modality classification with missing data generation
multimodal_early_fusion Early fusion multimodal classification with missing data generation
multimodal_joint_fusion Joint fusion multimodal classification with missing data generation
multimodal_late_fusion Late fusion multimodal classification with missing data generation

Reproducing the experiments

The experiments presented in the paper use two biomedical datasets, each comprising multiple modalities:

Dataset Task Modalities
ADNI Diagnosis (CN, EMCI, LMCI, AD) Assessment, Biospecimen, Image Analysis, Subject Characteristics
ADNI Prognosis Assessment, Biospecimen, Image Analysis, Subject Characteristics
AI4Covid Death prediction Blood, History, Personal, State
AI4Covid Prognosis Blood, History, Personal, State

The experiment configuration files for each dataset, task, and fusion strategy are available in the ./confs/experiment/ folder. For example, to run the ADNI diagnosis experiment with joint fusion using NAIM:

python main.py experiment=ADNI_diagnosis_joint_NAIM

Similarly, for AI4Covid death prediction with early fusion:

python main.py experiment=AI4Covid_early_death

The available experiment configurations follow the naming convention <Dataset>_<task>_<fusion>_<model>.yaml for joint fusion experiments and <Dataset>_<fusion>_<task>.yaml for early and late fusion experiments:

Dataset Task Fusion Config file
ADNI Diagnosis Early ADNI_diagnosis_early.yaml
ADNI Diagnosis Joint (NAIM) ADNI_diagnosis_joint_NAIM.yaml
ADNI Diagnosis Joint (FTTransformer) ADNI_diagnosis_joint_FTTransformer.yaml
ADNI Diagnosis Joint (MLP) ADNI_diagnosis_joint_MLP.yaml
ADNI Diagnosis Joint (TabNet) ADNI_diagnosis_joint_TabNet.yaml
ADNI Diagnosis Joint (TabTransformer) ADNI_diagnosis_joint_TabTransformer.yaml
ADNI Diagnosis Late ADNI_diagnosis_late.yaml
ADNI Prognosis Early ADNI_prognosis_early.yaml
ADNI Prognosis Joint (NAIM) ADNI_prognosis_joint_NAIM.yaml
ADNI Prognosis Joint (FTTransformer) ADNI_prognosis_joint_FTTransformer.yaml
ADNI Prognosis Joint (MLP) ADNI_prognosis_joint_MLP.yaml
ADNI Prognosis Joint (TabNet) ADNI_prognosis_joint_TabNet.yaml
ADNI Prognosis Joint (TabTransformer) ADNI_prognosis_joint_TabTransformer.yaml
ADNI Prognosis Late ADNI_prognosis_late.yaml
AI4Covid Death Early AI4Covid_early_death.yaml
AI4Covid Death Joint (NAIM) AI4Covid_joint_death_NAIM.yaml
AI4Covid Death Joint (FTTransformer) AI4Covid_joint_death_FTTransformer.yaml
AI4Covid Death Joint (MLP) AI4Covid_joint_death_MLP.yaml
AI4Covid Death Joint (TabNet) AI4Covid_joint_death_TabNet.yaml
AI4Covid Death Joint (TabTransformer) AI4Covid_joint_death_TabTransformer.yaml
AI4Covid Death Late AI4Covid_late_death.yaml
AI4Covid Prognosis Early AI4Covid_early_prognosis.yaml
AI4Covid Prognosis Joint (NAIM) AI4Covid_joint_prognosis_NAIM.yaml
AI4Covid Prognosis Joint (FTTransformer) AI4Covid_joint_prognosis_FTTransformer.yaml
AI4Covid Prognosis Joint (MLP) AI4Covid_joint_prognosis_MLP.yaml
AI4Covid Prognosis Joint (TabNet) AI4Covid_joint_prognosis_TabNet.yaml
AI4Covid Prognosis Joint (TabTransformer) AI4Covid_joint_prognosis_TabTransformer.yaml
AI4Covid Prognosis Late AI4Covid_late_prognosis.yaml

These experiments generate different percentages of missing values (MCAR) in the training and testing sets. Specifically, the percentages used are indicated by missing_percentages in each experiment configuration file (e.g., [0.0, 0.05, 0.1, 0.3, 0.5, 0.75]).

For single-modality experiments with missing data generation, the classification_with_missing_generation.yaml configuration can be used:

python main.py experiment=classification_with_missing_generation experiment/databases@db=ADNI_diagnosis_assessment

For each experiment, this code produces a folder named <experiment-name>/<experiment-subname> which contains everything generated by the code. In particular, the following folders and files are present:

  1. cross_validation: this folder contains a folder for each training fold, indicated as a composition of test and validation folds <test-fold>_<val-fold>, reporting the information on the train, validation and test sets in 3 separate csv files.
  2. preprocessing: this folder contains all the preprocessing information divided into 3 main folders:
    1. numerical_preprocessing: in this folder, for each percentage of missing values considered, there is a csv file for each fold reporting the information on the preprocessing params of numerical features.
    2. categorical_preprocessing: in this folder, for each percentage of missing values considered, there is a csv file for each fold reporting the information on the preprocessing params of categorical features.
    3. imputer: in this folder, for each percentage of missing values considered, there are csv files for each fold with information on the imputation strategy applied to handle missing values and a pkl file containing the imputer fitted on the training data of the fold.
  3. saved_models: this folder contains, for each percentage of missing values considered, a folder with the model's name that includes, for each fold, a csv file with the model's parameters and a pkl or pth file containing the trained model.
  4. predictions: this folder contains, for each percentage of missing values considered, a folder that reports the predictions obtained from the training and validation sets and separately those of the test set.
  5. results: this folder reports, for each percentage of missing values considered, the performance on the train, validation, and test sets separately. Specifically, for each set two folders named balanced and unbalanced containing the performance are reported, presented in 3 separate files with increasing levels of averaging:
    1. all_test_performance.csv: performance evaluated for each fold and each class.
    2. classes_average_performance.csv: average performance of the folds for each class.
    3. set_average_performance.csv: average performance across folds and classes.
  6. config.yaml: this file contains the configuration file used as input for the experiment.
  7. <experiment-name>.log: this is the log file of the experiment.

NOTE: In case an experiment should be interrupted, voluntarily or not, it is possible to resume it from where it was interrupted by setting the continue_experiment parameter to True in the experiment configuration file.

Train & Test on your dataset

Experiment declaration

As mentioned above, the experiment configuration file is created at the time of code execution starting from the config.yaml file, in which the configuration file for the experiment to be performed is declared, along with the device to use and the system paths configuration.

device: cuda # cpu, cuda, or mps

defaults:
  - _self_
  - experiment: multimodal_joint_fusion_classification_with_missing_generation # Experiment to perform
  - experiment/paths/system@: local # System paths configuration

The possible options for the experiment parameter are the experiment configuration files contained in the ./confs/experiment/ folder. The main experiment types are:

Dataset preparation

To prepare a dataset for the analysis with this code, it is sufficient to prepare a configuration file, specific for the dataset, similar to those already provided in the folder ./confs/experiment/databases. The path to the data must be specified in the path parameter in the dataset's configuration file. Thanks to the interpolation functionality of Hydra the path can be composed using the ${data_path} interpolation key. Once the dataset configuration file is prepared, it is important that it is placed in the ./confs/experiment/databases folder. In particular, it is important that the dataset configuration file is structured as follows:

_target_: CMC_utils.datasets.ClassificationDataset # DO NOT CHANGE
_convert_: all # DO NOT CHANGE

name: <dataset-name> # Name of the dataset
db_type: tabular # DO NOT CHANGE

classes: ["<class-1-name>", ..., "<class-n-name>"] # List of the classes
label_type: multiclass # multiclass or binary

task: classification # DO NOT CHANGE

path: ${data_path}/<relative-path-to-file> # Relative path to the file

columns: # Dictionary containing features names as keys and their types as values
  <ID-name>:        id             # Name of the ID column if present
  <feature-1-name>: <feature-type> # int, float or category
  <feature-2-name>: <feature-type> # int, float or category
  # Other features to be inserted
  <label-name>:     target         # Name of the target column

pandas_load_kwargs:
  na_values: [ "?" ]
  header: 0
  index_col: 0

dataset_class: # DO NOT CHANGE
  _target_: CMC_utils.datasets.SupervisedTabularDatasetTorch # DO NOT CHANGE
  _convert_: all # DO NOT CHANGE

In the columns definition, id and target feature types can be used to define the ID and classes columns respectively.

For multimodal experiments, multiple dataset configuration files must be prepared (one per modality) and referenced in the experiment configuration file using the databases@dbs.<index> key.

Experiment configuration

The experiment configuration file defines the specifics for conducting the desired pipeline. It begins with some general information, such as the name of the experiment, the pipeline to be executed, the seed for randomness control, training verbosity, and the percentages of missing values to be tested.

experiment_name: ${db_name}_${seed}_multimodal_joint_fusion_classification_with_missing_generation
pipeline: multimodal_joint_fusion

seed: 42
verbose: 1
continue_experiment: False

missing_percentages: [0.0, 0.05, 0.1, 0.25, 0.5]

Then, all other necessary configuration files for the different parts of the experiment are declared. The possible options for each part are listed in the table below.

Params Keys Options
Dataset databases@db or databases@dbs.<index> ADNI_diagnosis_assessment, ADNI_diagnosis_biospecimen, ADNI_diagnosis_imageanalysis, ADNI_diagnosis_subjectcharacteristics, ADNI_prognosis_assessment, ADNI_prognosis_biospecimen, ADNI_prognosis_imageanalysis, ADNI_prognosis_subjectcharacteristics, AI4Covid_blood_death, AI4Covid_blood_prognosis, AI4Covid_history_death, AI4Covid_history_prognosis, AI4Covid_personal_death, AI4Covid_personal_prognosis, AI4Covid_state_death, AI4Covid_state_prognosis
Cross Validation cross_validation@test_cv cross_validation@val_cv bootstrap, holdout, kfold, leave_one_out, predefined, stratifiedkfold
Numerical Preprocessing preprocessing/numerical normalize, standardize
Categorical Preprocessing preprocessing/categorical categoricalencode, onehotencode
Imputation Strategy preprocessing/imputer simple, knn, iterative, noimputation
Missing Generation preprocessing/missing_generation MCAR_global, MCAR_feature, MCAR_sample, no_generation
Model model naim, multimodal_learner, adaboost, custom_mlp, dt, fttransformer, histgradientboostingtree, mlp_sklearn, rf, svm, tabnet, tabtransformer, xgboost
Metrics metric@train.set_metrics.<metric-name> metric@performance_metrics.<metric-name> auc, accuracy, recall, precision, f1_score, mcc, gmean

To modify some of the hyperparameters of the models, it is possible to modify the ml_params and dl_params files. For the ML models it is possible to define the number of estimators (n_estimators), whereas for the DL models it is possible to define the number of epochs (max_epochs), the warm-up number of epochs (min_epochs), the batch size (batch_size), the early stopping's (early_stopping_patience) and the scheduler's (scheduler_patience) patience and their tolerance for performance improvement (performance_tolerance), the device to use for training (device). It is also possible to define the learning rates to be tested (learning_rates), but to be compatible with some of the competitors available in the models list, it is necessary to define also the initial learning rate (init_learning_rate) and the final learning rate (end_learning_rate).

n_estimators: 100 # Number of estimators for the ML models
max_epochs: 1500 # Maximum number of epochs
min_epochs: 50 # Warm-up number of epochs
batch_size: 32 # Batch size
init_learning_rate: 1e-3 # Initial learning rate
end_learning_rate: 1e-8 # Final learning rate
learning_rates: [1e-3, 1e-4, 1e-5, 1e-6, 1e-7] # Learning rates for the scheduler
early_stopping_patience: 50 # Patience for the early stopping
scheduler_patience: 25 # Patience for the scheduler
performance_tolerance: 1e-3 # Tolerance for the performance improvement
device: cuda # cpu or cuda or mps, device to use for training

Contact

For any questions, please contact camillomaria.caruso@unicampus.it and valerio.guarrasi@unicampus.it.


Citation

@article{caruso2025maria,
  title={MARIA: A multimodal transformer model for incomplete healthcare data},
  author={Caruso, Camillo Maria and Soda, Paolo and Guarrasi, Valerio},
  journal={Computers in Biology and Medicine},
  volume={196},
  pages={110843},
  year={2025},
  publisher={Elsevier},
  doi={10.1016/j.compbiomed.2025.110843},
} 

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%