-
Notifications
You must be signed in to change notification settings - Fork 1
Adding an example with CTGAN #107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughThis PR introduces a complete CTGAN single-table training and synthesis example workflow under the examples/gan/ directory. It adds training, synthesis, and evaluation scripts; a configuration file defining directories and model parameters; utility functions for metadata handling and table name resolution; and comprehensive README documentation. The project dependency is updated to include "sdv>=1.18.0". Additionally, the .gitignore is simplified with a broader pattern for results directories, and two CLI arguments in the Alpha Precision evaluation script are made optional with default values. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (2)
examples/gan/README.md (1)
90-90: Fix heading formatting.Remove extra space after hash in the heading.
-### Additional Metrics +### Additional Metricsexamples/gan/utils.py (1)
9-27: Consider usingnext(iter())for cleaner iteration.The function correctly enforces single-table constraint and extracts the table name. However, line 27 can be simplified per Ruff suggestion.
- return list(dataset_meta["tables"].keys())[0] + return next(iter(dataset_meta["tables"].keys()))This avoids creating an intermediate list for a single element extraction.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (9)
.gitignore(1 hunks)examples/gan/README.md(1 hunks)examples/gan/config.yaml(1 hunks)examples/gan/evaluate.py(1 hunks)examples/gan/synthesize.py(1 hunks)examples/gan/train.py(1 hunks)examples/gan/utils.py(1 hunks)pyproject.toml(1 hunks)src/midst_toolkit/evaluation/quality/scripts/midst_alpha_precision_eval.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
examples/gan/evaluate.py (5)
examples/gan/utils.py (1)
get_table_name(9-27)src/midst_toolkit/evaluation/quality/correlation_matrix_difference.py (1)
CorrelationMatrixDifference(7-77)src/midst_toolkit/evaluation/quality/kolmogorov_smirnov_total_variation.py (1)
KolmogorovSmirnovAndTotalVariation(7-88)src/midst_toolkit/evaluation/quality/mutual_information_difference.py (1)
MutualInformationDifference(7-102)src/midst_toolkit/common/logger.py (1)
dump(218-249)
examples/gan/synthesize.py (2)
examples/gan/train.py (1)
main(15-51)examples/gan/utils.py (1)
get_table_name(9-27)
examples/gan/train.py (2)
examples/gan/utils.py (2)
get_metadata(30-69)get_table_name(9-27)examples/gan/synthesize.py (1)
main(14-43)
🪛 LanguageTool
examples/gan/README.md
[locale-violation] ~5-~5: In American English, ‘afterward’ is the preferred variant. ‘Afterwards’ is more commonly used in British English and other dialects.
Context: ...library and then synthesizing some data afterwards. ## Downloading data First, we need ...
(AFTERWARDS_US)
[grammar] ~92-~92: Ensure spelling is correct
Context: ... Additional Metrics The calculation of assitional metrics are set up in the evaluate.py...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🪛 markdownlint-cli2 (0.18.1)
examples/gan/README.md
90-90: Multiple spaces after hash on atx style heading
(MD019, no-multiple-space-atx)
🪛 Ruff (0.14.8)
examples/gan/utils.py
27-27: Prefer next(iter(dataset_meta["tables"].keys())) over single element slice
Replace with next(iter(dataset_meta["tables"].keys()))
(RUF015)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: integration-tests
- GitHub Check: unit-tests
- GitHub Check: run-code-check
🔇 Additional comments (7)
.gitignore (1)
33-34: Good consolidation of results directory patterns.Replacing multiple explicit
examples/*/results/patterns with the broaderexamples/**/*results/pattern improves maintainability while correctly matching results directories at any nesting level under the examples directory.src/midst_toolkit/evaluation/quality/scripts/midst_alpha_precision_eval.py (1)
56-74: LGTM! Good usability improvement.Making these arguments optional with sensible defaults improves the CLI experience without breaking existing functionality.
examples/gan/config.yaml (1)
1-11: LGTM! Clean configuration.The configuration structure is clear and the default values are reasonable.
examples/gan/evaluate.py (1)
1-79: LGTM! Comprehensive evaluation pipeline.The evaluation script correctly:
- Loads real and synthetic data
- Extracts column metadata from meta_info.json
- Computes three different quality metrics with preprocessing
- Normalizes the MI difference score for better interpretability (line 66)
- Saves results in structured JSON format
The consistent use of
do_preprocess=Trueacross all metrics ensures fair comparison.examples/gan/utils.py (1)
30-69: LGTM! Well-designed metadata handling.The
get_metadatafunction correctly:
- Drops ID columns (line 45) to avoid using identifiers in training
- Detects metadata from the cleaned dataframe
- Applies domain dictionary overrides when provided, with a reasonable heuristic (size < 1000) for discrete→categorical classification
- Removes primary keys to prevent issues with synthesis
The logic is sound and handles edge cases appropriately.
examples/gan/train.py (1)
1-55: LGTM! Well-structured training pipeline.The training script is clean and follows good patterns:
- Proper use of Hydra for configuration
- Clear logging at each step
- Creates output directory before saving
- Proper error handling via assertion in utilities
The CTGANSynthesizer API usage aligns with SDV 1.18.0+ documentation. All parameters (metadata, epochs, verbose) and methods (fit, save) are correctly used.
pyproject.toml (1)
29-29: SDV version 1.18.0 exists with no known vulnerabilities.Version 1.18.0 is available on PyPI and has no documented security vulnerabilities in public databases. The dependency specification is valid.
| parser.add_argument( | ||
| "--dataname", | ||
| required=True, | ||
| required=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting those to False so I can run this without passing those parameters in. They are not necessary when the other parameters are being passed in and there is already validation for that.
PR Type
Feature
Short Description
Clickup Ticket(s):
https://app.clickup.com/t/868gh35b7
https://app.clickup.com/t/868gh35cp
Tests Added
NA