Skip to content

Introduce cardinality limit on One Hot encodings#833

Merged
klemens-floege merged 12 commits into
mainfrom
klemens/cap-max-feats-after
Mar 23, 2026
Merged

Introduce cardinality limit on One Hot encodings#833
klemens-floege merged 12 commits into
mainfrom
klemens/cap-max-feats-after

Conversation

@klemens-floege
Copy link
Copy Markdown
Contributor

@klemens-floege klemens-floege commented Mar 22, 2026

Disabled by default.

@klemens-floege klemens-floege requested a review from a team as a code owner March 22, 2026 12:50
@klemens-floege klemens-floege requested review from anuragg1209 and removed request for a team March 22, 2026 12:50
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@klemens-floege klemens-floege requested review from LennartPurucker and bejaeger and removed request for anuragg1209 March 22, 2026 12:50
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two valuable features: capping the maximum number of features after preprocessing and adding a cardinality limit for one-hot encoding. While the intent is good, the current implementation has a few critical issues that result in the incorrect handling of feature modalities. Specifically, high-cardinality categorical features lose their categorical status after being filtered from one-hot encoding, and a new subsampling step incorrectly converts all features to numerical. The associated tests are not comprehensive enough to detect these bugs. I have provided detailed feedback and suggestions for fixes in the review comments.

Comment thread src/tabpfn/preprocessing/steps/encode_categorical_features_step.py Outdated
Comment thread src/tabpfn/preprocessing/pipeline_factory.py Outdated
Comment thread tests/test_preprocessing/test_encode_categorical_features_step.py
Comment thread tests/test_preprocessing/test_reshape_feature_distribution_step.py Outdated
Copy link
Copy Markdown
Collaborator

@bejaeger bejaeger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
I think the "max cardinality one-hot-encoding" is a good addition, but I think we should remove the logic for the second resampling for now.

Comment thread src/tabpfn/preprocessing/steps/encode_categorical_features_step.py Outdated
Comment thread src/tabpfn/preprocessing/pipeline_factory.py Outdated
@klemens-floege klemens-floege requested a review from bejaeger March 23, 2026 10:27
Copy link
Copy Markdown
Collaborator

@bejaeger bejaeger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! just one nit

@klemens-floege klemens-floege changed the title WIP: Cap max features after preprocessing + introduce cardinality limit on One Hot encodings Introduce cardinality limit on One Hot encodings Mar 23, 2026
Comment thread src/tabpfn/preprocessing/configs.py
@klemens-floege klemens-floege added this pull request to the merge queue Mar 23, 2026
Merged via the queue into main with commit 5ad24a5 Mar 23, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants