Skip to content

Conversation

@lotif
Copy link
Collaborator

@lotif lotif commented Dec 10, 2025

PR Type

Feature

Short Description

Clickup Ticket(s):
https://app.clickup.com/t/868gh35b7
https://app.clickup.com/t/868gh35cp

  • Adding SDV as a dependency
  • Adding a CTGAN example with train, synthesize and evaluate scripts
  • Addressing a small bug on the alpha precision script

Tests Added

NA

@coderabbitai
Copy link

coderabbitai bot commented Dec 10, 2025

📝 Walkthrough

Walkthrough

This PR introduces a complete CTGAN single-table training and synthesis example workflow under the examples/gan/ directory. It adds training, synthesis, and evaluation scripts; a configuration file defining directories and model parameters; utility functions for metadata handling and table name resolution; and comprehensive README documentation. The project dependency is updated to include "sdv>=1.18.0". Additionally, the .gitignore is simplified with a broader pattern for results directories, and two CLI arguments in the Alpha Precision evaluation script are made optional with default values.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • examples/gan/train.py: Verify CTGAN initialization, metadata computation, and model persistence logic
  • examples/gan/synthesize.py: Review conditional training trigger and model loading/sampling flow
  • examples/gan/evaluate.py: Validate metric computation order, configuration dependencies, and results JSON structure
  • examples/gan/utils.py: Examine metadata override logic (discrete size thresholds, type mapping) and single-table constraint enforcement
  • examples/gan/config.yaml: Confirm parameter values align with script expectations
  • src/midst_toolkit/evaluation/quality/scripts/midst_alpha_precision_eval.py: Assess impact of making --dataname and --model optional on downstream path construction and existing workflows
  • pyproject.toml: Verify sdv version constraint compatibility with existing dependencies

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Adding an example with CTGAN' accurately summarizes the main change: introducing a new CTGAN example with associated training, synthesis, and evaluation functionality.
Description check ✅ Passed The PR description follows the template structure with PR Type, Short Description, and Tests Added sections, providing relevant ticket links and clearly listing the three main changes made.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch marcelo/ctgan

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
examples/gan/README.md (1)

90-90: Fix heading formatting.

Remove extra space after hash in the heading.

-###  Additional Metrics
+### Additional Metrics
examples/gan/utils.py (1)

9-27: Consider using next(iter()) for cleaner iteration.

The function correctly enforces single-table constraint and extracts the table name. However, line 27 can be simplified per Ruff suggestion.

-    return list(dataset_meta["tables"].keys())[0]
+    return next(iter(dataset_meta["tables"].keys()))

This avoids creating an intermediate list for a single element extraction.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 40d09cf and e8e23cc.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (9)
  • .gitignore (1 hunks)
  • examples/gan/README.md (1 hunks)
  • examples/gan/config.yaml (1 hunks)
  • examples/gan/evaluate.py (1 hunks)
  • examples/gan/synthesize.py (1 hunks)
  • examples/gan/train.py (1 hunks)
  • examples/gan/utils.py (1 hunks)
  • pyproject.toml (1 hunks)
  • src/midst_toolkit/evaluation/quality/scripts/midst_alpha_precision_eval.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
examples/gan/evaluate.py (5)
examples/gan/utils.py (1)
  • get_table_name (9-27)
src/midst_toolkit/evaluation/quality/correlation_matrix_difference.py (1)
  • CorrelationMatrixDifference (7-77)
src/midst_toolkit/evaluation/quality/kolmogorov_smirnov_total_variation.py (1)
  • KolmogorovSmirnovAndTotalVariation (7-88)
src/midst_toolkit/evaluation/quality/mutual_information_difference.py (1)
  • MutualInformationDifference (7-102)
src/midst_toolkit/common/logger.py (1)
  • dump (218-249)
examples/gan/synthesize.py (2)
examples/gan/train.py (1)
  • main (15-51)
examples/gan/utils.py (1)
  • get_table_name (9-27)
examples/gan/train.py (2)
examples/gan/utils.py (2)
  • get_metadata (30-69)
  • get_table_name (9-27)
examples/gan/synthesize.py (1)
  • main (14-43)
🪛 LanguageTool
examples/gan/README.md

[locale-violation] ~5-~5: In American English, ‘afterward’ is the preferred variant. ‘Afterwards’ is more commonly used in British English and other dialects.
Context: ...library and then synthesizing some data afterwards. ## Downloading data First, we need ...

(AFTERWARDS_US)


[grammar] ~92-~92: Ensure spelling is correct
Context: ... Additional Metrics The calculation of assitional metrics are set up in the evaluate.py...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🪛 markdownlint-cli2 (0.18.1)
examples/gan/README.md

90-90: Multiple spaces after hash on atx style heading

(MD019, no-multiple-space-atx)

🪛 Ruff (0.14.8)
examples/gan/utils.py

27-27: Prefer next(iter(dataset_meta["tables"].keys())) over single element slice

Replace with next(iter(dataset_meta["tables"].keys()))

(RUF015)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: integration-tests
  • GitHub Check: unit-tests
  • GitHub Check: run-code-check
🔇 Additional comments (7)
.gitignore (1)

33-34: Good consolidation of results directory patterns.

Replacing multiple explicit examples/*/results/ patterns with the broader examples/**/*results/ pattern improves maintainability while correctly matching results directories at any nesting level under the examples directory.

src/midst_toolkit/evaluation/quality/scripts/midst_alpha_precision_eval.py (1)

56-74: LGTM! Good usability improvement.

Making these arguments optional with sensible defaults improves the CLI experience without breaking existing functionality.

examples/gan/config.yaml (1)

1-11: LGTM! Clean configuration.

The configuration structure is clear and the default values are reasonable.

examples/gan/evaluate.py (1)

1-79: LGTM! Comprehensive evaluation pipeline.

The evaluation script correctly:

  • Loads real and synthetic data
  • Extracts column metadata from meta_info.json
  • Computes three different quality metrics with preprocessing
  • Normalizes the MI difference score for better interpretability (line 66)
  • Saves results in structured JSON format

The consistent use of do_preprocess=True across all metrics ensures fair comparison.

examples/gan/utils.py (1)

30-69: LGTM! Well-designed metadata handling.

The get_metadata function correctly:

  • Drops ID columns (line 45) to avoid using identifiers in training
  • Detects metadata from the cleaned dataframe
  • Applies domain dictionary overrides when provided, with a reasonable heuristic (size < 1000) for discrete→categorical classification
  • Removes primary keys to prevent issues with synthesis

The logic is sound and handles edge cases appropriately.

examples/gan/train.py (1)

1-55: LGTM! Well-structured training pipeline.

The training script is clean and follows good patterns:

  • Proper use of Hydra for configuration
  • Clear logging at each step
  • Creates output directory before saving
  • Proper error handling via assertion in utilities

The CTGANSynthesizer API usage aligns with SDV 1.18.0+ documentation. All parameters (metadata, epochs, verbose) and methods (fit, save) are correctly used.

pyproject.toml (1)

29-29: SDV version 1.18.0 exists with no known vulnerabilities.

Version 1.18.0 is available on PyPI and has no documented security vulnerabilities in public databases. The dependency specification is valid.

parser.add_argument(
"--dataname",
required=True,
required=False,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting those to False so I can run this without passing those parameters in. They are not necessary when the other parameters are being passed in and there is already validation for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants