This is the repository for the article "A Sobering Look at Tabular Data Generation via Probabilistic Circuits", where we revisit advances in deep generative models for tabular data generation through the lens of hierarchical mixture models; in particular, probabilistic circuits.
We suggest creating a Python virtual environment with Python 3.10. After activating the environment, run:
make installto install all the requirements.
All datasets are publicly available.
To download the datasets used in our article, run the following command:
python datasets_scripts/tabdiff_data_download.pyFollowed by
python datasets_scripts/tabdiff_process_dataset.pyfor processing the datasets.
These scripts are taken from the TabDiff repository.
Due to some legacy code, you may also need to create a placeholder WANDB file under src/private_constants.py. If you are not interested in W&B, you can use the following template:
ENTITY="placeholder"
PROJECT="TabPC"
To replicate the main experiment (i.e. training our PCs with tuned hyperparameters on each of the datasets and computing the desired metrics on the generated data), run:
python scripts/experiment_main.py --train-models all --sample --evaluate --update-tablesThis script has multiple command-line arguments to enable or disable parts of the experiment pipeline. These are:
--train-modelsor-t, followed by a selection of the models to train. Valid choices are:- "all", "tab_pc", "shallow_mixture", "fully_factorized", "none"
--sampleor-s. This runs sampling for each of the trained models in theARTICLE_MODELS_FOLDERdirectory.--evaluateor-e. This runs metrics computation for the generated data of each of our model types (TabPC, ShallowMixture, FullyFactorizedPreprocessed, FullyFactorizedRandom). Generated data should be stored inartifacts/generated_data/{method}.--evaluate-baselinesor-b. This runs metrics computation for the generated data of each baseline model. Generated data should be stored inartifacts/generated_data/{method}. This data must be generated externally from this repository, e.g. by using the TabSyn repo. For full comparability, one must ensure that they have generated 1 dataset from 5 different trained models for each of the baseline (which unfortunately that repository cannot do out of the box).--update-tables. This updates the LaTeX tables stored inarticle_material/article_tables. It also computes the ranked versions of the result CSV files, stored by default inartifacts/ranked_results.
Results for this experiment are stored by default in article_material/our_results (as CSVs) and article_material/article_tables (as the corresponding LaTeX tables, formatted as in the paper).
To replicate the BPD vs C2ST experiment discussed in Section 4.2, run:
python scripts/experiment_likelihoods.py --train --compute-likelihoods --sample --evaluate --scrapeThis script similarly has multiple command-line arguments to enable or disable parts of the experiment pipeline. These are:
--trainor-t. This enables the training of PCs on the specified hyperparameter grid (see below for more details about this grid and how to change it).--compute-likelihoodsor-l. This enables the computation of dataset split likelihoods under the trained PCs in the folder containing the models. These likelihood values are stored in JSON files under the model directory.--sampleor-s. This enables the sampling from the trained PCs in the folder containing the models. These samples are stored under the model directory.--evaluateor-e. This enables the evaluation of generated samples. By default, only C2ST (XGB) is evaluated. Further metrics can be computed by uncommenting lines in theMETRICSdictionary.--scrapeor-r. This enables the scraping of results into consolidated CSV files, one for the C2ST and one for the likelihood values.--models-folder. This allows for the specification of a custom directory in which to store models. Default:artifacts/ll_models.--metrics-folder. This allows for the specification of a custom directory in which to store computed metrics. Default:artifacts/ll_metrics_results.--results-folder. This allows for the specification of a custom directory in which to store consolidated CSV result files. Default:article_material/ll_results.--plots-folder. This allows for the specification of a custom directory in which to store optionally generated plots. Default:article_material/ll_results/plots.--seed. This allows for the specification of a custom seed used for the training of each of the PCs. Default:0(so the training should be deterministic; for random training, set this toNone).--dataset. This allows for the specification of which dataset to run the PC training on. Default: "all" (Other options are the named datasets used in the paper, e.g. "adult", "beijing", ...)--do-plots. This enables the generation of plots similar to those found in the paper. Requires scraped CSVs stored in--results-folder.
In this file, there is also a grid of specified hyperparameters to train models for. To modify this grid, change the entries in the lists LEARNING_RATES, BATCH_SIZES, and NUM_UNITS.
By default, these are set to:
LEARNING_RATES = [0.1, 0.25, 0.5]
BATCH_SIZES = [64, 256, 512]
NUM_UNITS = [128, 512, 2048]For the News dataset, there is a manual check to decrease the number of units from 2048 to 1024. This is due to memory constraints on our used GPUs.
To replicate the conditional sampling experiment also discussed in Section 4.2, run:
python scripts/experiment_conditional.py --train --sample --evaluate --impute --scrape --overwriteAgain, this script has several command-line arguments for customising the experiment pipeline.
--trainor-t. This enables the training of PCs.--sampleor-s. This enables the conditional sampling for each trained model at each of the specified conditioning percentages (see below for more details). Additionally, this--evaluateor-e. This enables the evaluation of generated samples using the specified metrics of Shape (+Trend), wNMIS, and C2ST (XGB) (see below for details on how to modify this).--imputeor-i. This enables the {mean, mode} imputation of samples as a simple baseline. This requires the script to be / have been run with--sampleas it requires the masks saved at this step.--scrapeor-r. This enables the scraping of metrics results into consolidated CSV files. These are saved to--results-folder.--overwriteor-o. This enables the overwriting of existing samples (otherwise generation is skipped if a sample is already stored).--do-plots. This enables the plotting of metric results in a format similar to that in the paper.--uncond-batch-size. This allows for specification of a custom batch size for unconditional sampling (i.e. with 0% conditioning). Default: "None", which enables automatic selection based on GPU memory. Also accepts integer batch sizes.--cond-batch-size. This allows for specification of a custom batch size for conditional sampling (i.e. with non-zero conditioning percentage). Default: 10 (arbitrary choice found to work on our GPU).--use-train-set-for-conditioning. This enables the use of the training set as the conditioning set (which is much larger than the test set but less principled, since the model could just memorise this). If not set, the test set is used for conditioning.--models-folder. This allows for the specification of a custom directory in which to store models. Default:artifacts/cond_sampling_models.--metrics-folder. This allows for the specification of a custom directory in which to store computed metrics. Default:artifacts/cond_sampling_metrics_results.--results-folder. This allows for the specification of a custom directory in which to store consolidated CSV result files. Default:article_material/cond_sampling_results.--plots-folder. This allows for the specification of a custom directory in which to store optionally generated plots. Default:article_material/cond_sampling_results/plots.--seed. This allows for the specification of a custom seed used for the training of each of the PCs. Default:0(so the training should be deterministic; for random training, set this toNone).--dataset. This allows for the specification of which dataset to run the PC training on. Default: "all" (Other options are the named datasets used in the paper, e.g. "adult", "beijing", ...)
The metrics computed for the datasets are stored in the METRICS dictionary in the script. Additional metrics can be added to this dictionary if desired; see scripts/experiment_main.py for examples.
The conditioning percentages are set in the list COND_PERCENTAGES. This runs from 0% to 100% in increments of 10%. Modify if other conditioning amounts are desired.
To run a specific experiment (i.e. combination of model and dataset), run one of the Python files containing experiment configurations in the train_scripts folder. Typically, each script has a --path option allowing for the specification of a given dataset by its path. One can also set the folder in which to store the trained model via the --target-folder option (each file otherwise has its own default path).
- To train an FF model, use
train_scripts/fully_factorized.py. - To train an SM model, use
train_scripts/shallow_mixture.py. - To train a PC with our tuned hyperparameters, use
train_scripts/pc_sota.py.
- To train a PC with desired hyperparameters, use
train_scripts/pc_lls.pyand the command-line arguments--num-units,--batch-size,--lr. - To train a PC with desired hyperparameters and pre-processing components (e.g. as we do in Section D.2), use
train_scripts/pc_ablation.pyand the command-line arguments--num-units,--batch-size,--lrfor the hyperparameters, and--dequantize-all-floats,--handle-inflated-values,--quantile-normalizerfor the pre-processing.
As a simple baseline, we also include train_scripts/copy_data.py which simply copies data (as a sanity check to be able to achieve 'perfect' performance). There is also a test script train_scripts/test.py.
We have several additional features which require some cleaning and reorganisation before they are added to this repository. These include:
- A script for running and collecting the results of the ablation experiment, as we do in Section D.2.
- A script for processing a user-provided dataset into the form required by the codebase.
- A script for replicating the FF (trained) vs FF (random) experiment, as we discuss in Section 2.2.
- A script for generating critical difference diagrams (CDDs), as we report in Section E.2.
Additionally, we plan to create a notebook which demonstrates some of the key features of the codebase, such as training and sampling from TabPC.
Finally, we also plan to upload our final trained model checkpoints and all generated data. This repo will be updated with the link once this is complete.
If you use this repository, please don't forget to cite the corresponding paper:
@misc{scassola2026soberinglooktabulardata,
title={A Sobering Look at Tabular Data Generation via Probabilistic Circuits},
author={Davide Scassola and Dylan Ponsford and Adrián Javaloy and Sebastiano Saccani and Luca Bortolussi and Henry Gouk and Antonio Vergari},
year={2026},
eprint={2603.23016},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2603.23016},
}