Skip to content

Conversation

@mikeknep
Copy link
Contributor

This PR makes some significant, breaking changes to how we work with seed datasets.

The two most important new base classes are:

  • SeedDatasetConfig: config-side, provides pointers to datasets (HF, local file, dataframe, NMP File)
  • SeedDatasetReader: engine-side, reading seed datasets using duckdb

The most significant structural change is that both seed column resolution (fetching column names) and overall builder validation are moved to the "back-end", i.e. to engine. By doing this we can use the new SeedDatasetReader objects to get the column names consistently regardless of seed type (using duckdb). I made a single compile_data_designer_config function so that there is a single "entrypoint" to go from builder -> fully resolved and validated DataDesignerConfig. This new compiler could do more of both these things ("deferred column resolution" and overall validation) in the future if needed.

Moving validation to engine does mean that we no longer have a DataDesignerConfigBuilder.validate method. Validation happens when you call preview or create from the interface (and in NMP, validation will happen server-side in those endpoints). If this is too unpleasant and we want to keep (some form of) that validate method on the builder, we could move things around a bit so that "partial" validation can be done config-side, but it would not be able to do full validation because the builder will never know the seed columns.

Note: I've not yet updated docs or notebooks; I figure I'll wait until the implementation is reviewed to minimize rework there.

@mikeknep mikeknep force-pushed the mknepper/refactor/seed-datasets branch from bc02d82 to 86683fe Compare December 30, 2025 21:11
@mikeknep mikeknep force-pushed the mknepper/refactor/seed-datasets branch from 86683fe to 0043b3b Compare December 30, 2025 21:14
"""

dataset: str
config: SeedDatasetConfigT
Copy link
Contributor Author

@mikeknep mikeknep Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to think of a better name for this field, since elsewhere it's kind of weird to see seed_config.config. "source"? "pointer"? "location"? any suggestions?

Comment on lines +54 to +56
model_config = ConfigDict(arbitrary_types_allowed=True)

df: pd.DataFrame = Field(exclude=True, default_factory=lambda: pd.DataFrame())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just thought of something here, haven't tried it yet and it might be too weird even if it works, but: I think that the way duckdb works, instead of passing an actual dataframe object here, we could potentially ask users to set "the name of the variable holding the dataframe" and pass it through as a string. I think duckdb would just work with something like that. It would allow us to simplify this object, make it truly/fully serializable, and get rid of some of the warnings when de/serializing the config builder. Example:

my_df = pd.DataFrame(...)

builder.with_seed_dataset(DataFrameSeedConfig(df_var_name="my_df"))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants