refactor: Overhaul to seed datasets #167

mikeknep · 2025-12-30T21:10:47Z

This PR makes some significant, breaking changes to how we work with seed datasets.

The two most important new base classes are:

SeedDatasetConfig: config-side, provides pointers to datasets (HF, local file, dataframe, NMP File)
SeedDatasetReader: engine-side, reading seed datasets using duckdb

The most significant structural change is that both seed column resolution (fetching column names) and overall builder validation are moved to the "back-end", i.e. to engine. By doing this we can use the new SeedDatasetReader objects to get the column names consistently regardless of seed type (using duckdb). I made a single compile_data_designer_config function so that there is a single "entrypoint" to go from builder -> fully resolved and validated DataDesignerConfig. This new compiler could do more of both these things ("deferred column resolution" and overall validation) in the future if needed.

Moving validation to engine does mean that we no longer have a DataDesignerConfigBuilder.validate method. Validation happens when you call preview or create from the interface (and in NMP, validation will happen server-side in those endpoints). If this is too unpleasant and we want to keep (some form of) that validate method on the builder, we could move things around a bit so that "partial" validation can be done config-side, but it would not be able to do full validation because the builder will never know the seed columns.

Note: I've not yet updated docs or notebooks; I figure I'll wait until the implementation is reviewed to minimize rework there.

mikeknep · 2025-12-30T21:18:55Z

src/data_designer/config/seed.py

    """

-    dataset: str
+    config: SeedDatasetConfigT


I was trying to think of a better name for this field, since elsewhere it's kind of weird to see seed_config.config. "source"? "pointer"? "location"? any suggestions?

mikeknep · 2025-12-30T21:22:41Z

src/data_designer/config/seed_dataset.py

+    model_config = ConfigDict(arbitrary_types_allowed=True)
+
+    df: pd.DataFrame = Field(exclude=True, default_factory=lambda: pd.DataFrame())


I just thought of something here, haven't tried it yet and it might be too weird even if it works, but: I think that the way duckdb works, instead of passing an actual dataframe object here, we could potentially ask users to set "the name of the variable holding the dataframe" and pass it through as a string. I think duckdb would just work with something like that. It would allow us to simplify this object, make it truly/fully serializable, and get rid of some of the warnings when de/serializing the config builder. Example:

my_df = pd.DataFrame(...) builder.with_seed_dataset(DataFrameSeedConfig(df_var_name="my_df"))

mikeknep requested review from andreatgretel, johnnygreco and nabinchha December 30, 2025 21:10

mikeknep force-pushed the mknepper/refactor/seed-datasets branch from bc02d82 to 86683fe Compare December 30, 2025 21:11

Refactor seed datasets

0043b3b

mikeknep force-pushed the mknepper/refactor/seed-datasets branch from 86683fe to 0043b3b Compare December 30, 2025 21:14

mikeknep commented Dec 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: Overhaul to seed datasets #167

refactor: Overhaul to seed datasets #167

Uh oh!

mikeknep commented Dec 30, 2025

Uh oh!

mikeknep Dec 30, 2025 •

edited

Loading

Uh oh!

mikeknep Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		model_config = ConfigDict(arbitrary_types_allowed=True)

		df: pd.DataFrame = Field(exclude=True, default_factory=lambda: pd.DataFrame())

refactor: Overhaul to seed datasets #167

Are you sure you want to change the base?

refactor: Overhaul to seed datasets #167

Uh oh!

Conversation

mikeknep commented Dec 30, 2025

Uh oh!

mikeknep Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikeknep Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mikeknep Dec 30, 2025 •

edited

Loading