Adding an example with CTGAN #107

lotif · 2025-12-10T21:31:30Z

PR Type

Feature

Short Description

Clickup Ticket(s):
https://app.clickup.com/t/868gh35b7
https://app.clickup.com/t/868gh35cp

Adding SDV as a dependency
Adding a CTGAN example with train, synthesize and evaluate scripts
Addressing a small bug on the alpha precision script

Tests Added

NA

coderabbitai · 2025-12-10T21:36:34Z

📝 Walkthrough

Walkthrough

This PR introduces a complete CTGAN single-table training and synthesis example workflow under the examples/gan/ directory. It adds training, synthesis, and evaluation scripts; a configuration file defining directories and model parameters; utility functions for metadata handling and table name resolution; and comprehensive README documentation. The project dependency is updated to include "sdv>=1.18.0". Additionally, the .gitignore is simplified with a broader pattern for results directories, and two CLI arguments in the Alpha Precision evaluation script are made optional with default values.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

examples/gan/train.py: Verify CTGAN initialization, metadata computation, and model persistence logic
examples/gan/synthesize.py: Review conditional training trigger and model loading/sampling flow
examples/gan/evaluate.py: Validate metric computation order, configuration dependencies, and results JSON structure
examples/gan/utils.py: Examine metadata override logic (discrete size thresholds, type mapping) and single-table constraint enforcement
examples/gan/config.yaml: Confirm parameter values align with script expectations
src/midst_toolkit/evaluation/quality/scripts/midst_alpha_precision_eval.py: Assess impact of making --dataname and --model optional on downstream path construction and existing workflows
pyproject.toml: Verify sdv version constraint compatibility with existing dependencies

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Adding an example with CTGAN' accurately summarizes the main change: introducing a new CTGAN example with associated training, synthesis, and evaluation functionality.
Description check	✅ Passed	The PR description follows the template structure with PR Type, Short Description, and Tests Added sections, providing relevant ticket links and clearly listing the three main changes made.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch marcelo/ctgan

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (2)

examples/gan/README.md (1)
90-90: Fix heading formatting.

Remove extra space after hash in the heading.
-###  Additional Metrics
+### Additional Metrics
examples/gan/utils.py (1)
9-27: Consider using next(iter()) for cleaner iteration.

The function correctly enforces single-table constraint and extracts the table name. However, line 27 can be simplified per Ruff suggestion.
-    return list(dataset_meta["tables"].keys())[0]
+    return next(iter(dataset_meta["tables"].keys()))
This avoids creating an intermediate list for a single element extraction.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 40d09cf and e8e23cc.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (9)

.gitignore (1 hunks)
examples/gan/README.md (1 hunks)
examples/gan/config.yaml (1 hunks)
examples/gan/evaluate.py (1 hunks)
examples/gan/synthesize.py (1 hunks)
examples/gan/train.py (1 hunks)
examples/gan/utils.py (1 hunks)
pyproject.toml (1 hunks)
src/midst_toolkit/evaluation/quality/scripts/midst_alpha_precision_eval.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

examples/gan/evaluate.py (5)

examples/gan/utils.py (1)

get_table_name (9-27)

src/midst_toolkit/evaluation/quality/correlation_matrix_difference.py (1)

CorrelationMatrixDifference (7-77)

src/midst_toolkit/evaluation/quality/kolmogorov_smirnov_total_variation.py (1)

KolmogorovSmirnovAndTotalVariation (7-88)

src/midst_toolkit/evaluation/quality/mutual_information_difference.py (1)

MutualInformationDifference (7-102)

src/midst_toolkit/common/logger.py (1)

dump (218-249)

examples/gan/synthesize.py (2)

examples/gan/train.py (1)

main (15-51)

examples/gan/utils.py (1)

get_table_name (9-27)

examples/gan/train.py (2)

examples/gan/utils.py (2)

get_metadata (30-69)

get_table_name (9-27)

examples/gan/synthesize.py (1)

main (14-43)

🪛 LanguageTool

examples/gan/README.md

[locale-violation] ~5-~5: In American English, ‘afterward’ is the preferred variant. ‘Afterwards’ is more commonly used in British English and other dialects.
Context: ...library and then synthesizing some data afterwards. ## Downloading data First, we need ...

(AFTERWARDS_US)

[grammar] ~92-~92: Ensure spelling is correct
Context: ... Additional Metrics The calculation of assitional metrics are set up in the evaluate.py...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🪛 markdownlint-cli2 (0.18.1)

examples/gan/README.md

90-90: Multiple spaces after hash on atx style heading

(MD019, no-multiple-space-atx)

🪛 Ruff (0.14.8)

examples/gan/utils.py

27-27: Prefer next(iter(dataset_meta["tables"].keys())) over single element slice

Replace with next(iter(dataset_meta["tables"].keys()))

(RUF015)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: integration-tests
GitHub Check: unit-tests
GitHub Check: run-code-check

🔇 Additional comments (7)

.gitignore (1)

33-34: Good consolidation of results directory patterns.

Replacing multiple explicit examples/*/results/ patterns with the broader examples/**/*results/ pattern improves maintainability while correctly matching results directories at any nesting level under the examples directory.

src/midst_toolkit/evaluation/quality/scripts/midst_alpha_precision_eval.py (1)

56-74: LGTM! Good usability improvement.

Making these arguments optional with sensible defaults improves the CLI experience without breaking existing functionality.

examples/gan/config.yaml (1)

1-11: LGTM! Clean configuration.

The configuration structure is clear and the default values are reasonable.

examples/gan/evaluate.py (1)

1-79: LGTM! Comprehensive evaluation pipeline.

The evaluation script correctly:

Loads real and synthetic data

Extracts column metadata from meta_info.json

Computes three different quality metrics with preprocessing

Normalizes the MI difference score for better interpretability (line 66)

Saves results in structured JSON format

The consistent use of do_preprocess=True across all metrics ensures fair comparison.

examples/gan/utils.py (1)

30-69: LGTM! Well-designed metadata handling.

The get_metadata function correctly:

Drops ID columns (line 45) to avoid using identifiers in training

Detects metadata from the cleaned dataframe

Applies domain dictionary overrides when provided, with a reasonable heuristic (size < 1000) for discrete→categorical classification

Removes primary keys to prevent issues with synthesis

The logic is sound and handles edge cases appropriately.

examples/gan/train.py (1)

1-55: LGTM! Well-structured training pipeline.

The training script is clean and follows good patterns:

Proper use of Hydra for configuration

Clear logging at each step

Creates output directory before saving

Proper error handling via assertion in utilities

The CTGANSynthesizer API usage aligns with SDV 1.18.0+ documentation. All parameters (metadata, epochs, verbose) and methods (fit, save) are correctly used.

pyproject.toml (1)

29-29: SDV version 1.18.0 exists with no known vulnerabilities.

Version 1.18.0 is available on PyPI and has no documented security vulnerabilities in public databases. The dependency specification is valid.

examples/gan/README.md

examples/gan/synthesize.py

lotif · 2025-12-11T17:13:11Z

src/midst_toolkit/evaluation/quality/scripts/midst_alpha_precision_eval.py

    parser.add_argument(
        "--dataname",
-        required=True,
+        required=False,


Setting those to False so I can run this without passing those parameters in. They are not necessary when the other parameters are being passed in and there is already validation for that.

lotif added 10 commits November 26, 2025 16:27

CTGAN Training and synthesizing

7f2c4a6

small change

0165cb0

Fixes and new params

a8021cc

Adding evaluation instructions

23c946e

Using SVD instead of CTGAN directly

52aeb70

Adding KS and TVD evaluation

a5c7ee5

Adding Correlation Matrix Difference, and Mutual Information Difference.

1fca01d

Adding preprocessing and one additional score

fcd0546

Merge branch 'main' into marcelo/ctgan

f40f2c1

Removing wrongly commited file

e8e23cc

coderabbitai bot reviewed Dec 10, 2025

View reviewed changes

examples/gan/README.md Outdated Show resolved Hide resolved

examples/gan/README.md Show resolved Hide resolved

examples/gan/README.md Outdated Show resolved Hide resolved

examples/gan/synthesize.py Show resolved Hide resolved

Coderabbit CR

0bafdf5

lotif commented Dec 11, 2025

View reviewed changes

lotif requested review from ElahehBassak, amrit110, bzamanlooy, emersodb, fatemetkl, masi-sh and sarakodeiri December 11, 2025 17:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding an example with CTGAN #107

Adding an example with CTGAN #107

Uh oh!

lotif commented Dec 10, 2025

Uh oh!

coderabbitai bot commented Dec 10, 2025

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lotif Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Adding an example with CTGAN #107

Are you sure you want to change the base?

Adding an example with CTGAN #107

Uh oh!

Conversation

lotif commented Dec 10, 2025

PR Type

Short Description

Tests Added

Uh oh!

coderabbitai bot commented Dec 10, 2025

Walkthrough

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lotif Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants