Skip to content

feat: P1.7 data simulation agent for small-sample augmentation#7

Merged
Mola-maker merged 1 commit intomainfrom
feat/p1.7-data-simulation
Apr 20, 2026
Merged

feat: P1.7 data simulation agent for small-sample augmentation#7
Mola-maker merged 1 commit intomainfrom
feat/p1.7-data-simulation

Conversation

@Mola-maker
Copy link
Copy Markdown
Owner

Inserts between P1.5 (data cleaning) and P2 (modeling) to augment cleaned CSVs whose row count falls below MIN_ROWS_FOR_MODELING (30).

Augmentation method

Gaussian-perturbation bootstrap:

  • bootstrap-sample rows from the cleaned dataframe
  • for each numeric column, add N(0, sigma * col_std) noise (sigma = 0.05 relative)
  • non-numeric columns are copied verbatim
  • tag each row with _sim_origin column ("real" | "simulated")
  • per-column Kolmogorov-Smirnov stat is logged as quality signal

Integration

  • agents/data_simulation.py: new DataSimulationAgent (pure stats, no LLM; scipy optional with numpy fallback)
  • main.py: p1_7 closure + PhaseSpec(on_error="skip", record_experience) inserted between P1.5 and P2; --start choices extended
  • agents/experience_recorder.py: _PHASE_PROMPTS["P1.7"] prompt + _extract_phase_context branch + _PHASE_NAME entry
  • ui/server.py: PHASE_META, PHASE_ORDER, PHASE_COMPLETE_MAP updated

Safety

  • Never overwrites cleaned_.csv — writes augmented_.csv alongside
  • on_error="skip": missing data / read failures don't halt pipeline
  • _sim_origin column name safe against existing fabrication regex
  • ctx.data_simulation.simulated_files lists absolute output paths for any future validator whitelist

Smoke-tested with 3 synthetic CSVs (20-row → augment, 100-row → skip sufficient, text-only → skip no-numeric). All KS stats < 0.1.

Inserts between P1.5 (data cleaning) and P2 (modeling) to augment
cleaned CSVs whose row count falls below MIN_ROWS_FOR_MODELING (30).

Augmentation method
-------------------
Gaussian-perturbation bootstrap:
- bootstrap-sample rows from the cleaned dataframe
- for each numeric column, add N(0, sigma * col_std) noise
  (sigma = 0.05 relative)
- non-numeric columns are copied verbatim
- tag each row with _sim_origin column ("real" | "simulated")
- per-column Kolmogorov-Smirnov stat is logged as quality signal

Integration
-----------
- agents/data_simulation.py: new DataSimulationAgent (pure stats,
  no LLM; scipy optional with numpy fallback)
- main.py: p1_7 closure + PhaseSpec(on_error="skip", record_experience)
  inserted between P1.5 and P2; --start choices extended
- agents/experience_recorder.py: _PHASE_PROMPTS["P1.7"] prompt +
  _extract_phase_context branch + _PHASE_NAME entry
- ui/server.py: PHASE_META, PHASE_ORDER, PHASE_COMPLETE_MAP updated

Safety
------
- Never overwrites cleaned_*.csv — writes augmented_*.csv alongside
- on_error="skip": missing data / read failures don't halt pipeline
- _sim_origin column name safe against existing fabrication regex
- ctx.data_simulation.simulated_files lists absolute output paths
  for any future validator whitelist

Smoke-tested with 3 synthetic CSVs (20-row → augment, 100-row →
skip sufficient, text-only → skip no-numeric). All KS stats < 0.1.
@Mola-maker Mola-maker merged commit 9a9af84 into main Apr 20, 2026
0 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant