Introduce cardinality limit on One Hot encodings#833
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Code Review
This pull request introduces two valuable features: capping the maximum number of features after preprocessing and adding a cardinality limit for one-hot encoding. While the intent is good, the current implementation has a few critical issues that result in the incorrect handling of feature modalities. Specifically, high-cardinality categorical features lose their categorical status after being filtered from one-hot encoding, and a new subsampling step incorrectly converts all features to numerical. The associated tests are not comprehensive enough to detect these bugs. I have provided detailed feedback and suggestions for fixes in the review comments.
bejaeger
left a comment
There was a problem hiding this comment.
Thanks!
I think the "max cardinality one-hot-encoding" is a good addition, but I think we should remove the logic for the second resampling for now.
Disabled by default.