feat: P1.7 data simulation agent for small-sample augmentation by Mola-maker · Pull Request #7 · Mola-maker/-MathModel

Mola-maker · 2026-04-20T14:06:16Z

Inserts between P1.5 (data cleaning) and P2 (modeling) to augment cleaned CSVs whose row count falls below MIN_ROWS_FOR_MODELING (30).

Augmentation method

Gaussian-perturbation bootstrap:

bootstrap-sample rows from the cleaned dataframe
for each numeric column, add N(0, sigma * col_std) noise (sigma = 0.05 relative)
non-numeric columns are copied verbatim
tag each row with _sim_origin column ("real" | "simulated")
per-column Kolmogorov-Smirnov stat is logged as quality signal

Integration

agents/data_simulation.py: new DataSimulationAgent (pure stats, no LLM; scipy optional with numpy fallback)
main.py: p1_7 closure + PhaseSpec(on_error="skip", record_experience) inserted between P1.5 and P2; --start choices extended
agents/experience_recorder.py: _PHASE_PROMPTS["P1.7"] prompt + _extract_phase_context branch + _PHASE_NAME entry
ui/server.py: PHASE_META, PHASE_ORDER, PHASE_COMPLETE_MAP updated

Safety

Never overwrites cleaned_.csv — writes augmented_.csv alongside
on_error="skip": missing data / read failures don't halt pipeline
_sim_origin column name safe against existing fabrication regex
ctx.data_simulation.simulated_files lists absolute output paths for any future validator whitelist

Smoke-tested with 3 synthetic CSVs (20-row → augment, 100-row → skip sufficient, text-only → skip no-numeric). All KS stats < 0.1.

Inserts between P1.5 (data cleaning) and P2 (modeling) to augment cleaned CSVs whose row count falls below MIN_ROWS_FOR_MODELING (30). Augmentation method ------------------- Gaussian-perturbation bootstrap: - bootstrap-sample rows from the cleaned dataframe - for each numeric column, add N(0, sigma * col_std) noise (sigma = 0.05 relative) - non-numeric columns are copied verbatim - tag each row with _sim_origin column ("real" | "simulated") - per-column Kolmogorov-Smirnov stat is logged as quality signal Integration ----------- - agents/data_simulation.py: new DataSimulationAgent (pure stats, no LLM; scipy optional with numpy fallback) - main.py: p1_7 closure + PhaseSpec(on_error="skip", record_experience) inserted between P1.5 and P2; --start choices extended - agents/experience_recorder.py: _PHASE_PROMPTS["P1.7"] prompt + _extract_phase_context branch + _PHASE_NAME entry - ui/server.py: PHASE_META, PHASE_ORDER, PHASE_COMPLETE_MAP updated Safety ------ - Never overwrites cleaned_*.csv — writes augmented_*.csv alongside - on_error="skip": missing data / read failures don't halt pipeline - _sim_origin column name safe against existing fabrication regex - ctx.data_simulation.simulated_files lists absolute output paths for any future validator whitelist Smoke-tested with 3 synthetic CSVs (20-row → augment, 100-row → skip sufficient, text-only → skip no-numeric). All KS stats < 0.1.

Mola-maker merged commit 9a9af84 into main Apr 20, 2026
0 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: P1.7 data simulation agent for small-sample augmentation#7

feat: P1.7 data simulation agent for small-sample augmentation#7
Mola-maker merged 1 commit intomainfrom
feat/p1.7-data-simulation

Mola-maker commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Mola-maker commented Apr 20, 2026

Augmentation method

Integration

Safety

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant