refactor: Overhaul to seed datasets #167
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR makes some significant, breaking changes to how we work with seed datasets.
The two most important new base classes are:
SeedDatasetConfig: config-side, provides pointers to datasets (HF, local file, dataframe, NMP File)SeedDatasetReader: engine-side, reading seed datasets using duckdbThe most significant structural change is that both seed column resolution (fetching column names) and overall builder validation are moved to the "back-end", i.e. to
engine. By doing this we can use the newSeedDatasetReaderobjects to get the column names consistently regardless of seed type (using duckdb). I made a singlecompile_data_designer_configfunction so that there is a single "entrypoint" to go from builder -> fully resolved and validatedDataDesignerConfig. This new compiler could do more of both these things ("deferred column resolution" and overall validation) in the future if needed.Moving validation to
enginedoes mean that we no longer have aDataDesignerConfigBuilder.validatemethod. Validation happens when you callprevieworcreatefrom the interface (and in NMP, validation will happen server-side in those endpoints). If this is too unpleasant and we want to keep (some form of) thatvalidatemethod on the builder, we could move things around a bit so that "partial" validation can be done config-side, but it would not be able to do full validation because the builder will never know the seed columns.Note: I've not yet updated docs or notebooks; I figure I'll wait until the implementation is reviewed to minimize rework there.