Fix Greenacre correction for subset MCA (#206) by MaxHalford · Pull Request #236 · MaxHalford/prince

MaxHalford · 2026-06-14T20:48:00Z

Summary

When one_hot_columns_to_drop is used with correction="greenacre", the existing closed-form correction is mis-calibrated: it assumes uniform row sums in the indicator matrix, which dropping categories breaks. Reported in Correction for the explained inertia in subset MCA #206.
This PR adds a subset-MCA branch ported from R's ca::mjca(lambda='adjusted', subsetcat=...): column marginals are computed on the full Burt matrix (so dropping categories doesn't redistribute mass), and per-dimension inertia is taken from the eigendecomposition of S_null restricted to the active categories, with the total inertia coming from S_e (independence within each variable's block) on the same subset.
The new path only triggers for one_hot_columns_to_drop is not None + correction="greenacre". The existing non-subset Greenacre and Benzécri paths are unchanged.

Test plan

New test_subset_greenacre_matches_ca_mjca compares prince.MCA against R ca::mjca on the burgundy wines dataset — eigenvalues and percentages match to ~1e-8.
Existing test_abdi_2007_correction (non-subset Greenacre/Benzécri doctest) still passes.
Full tests/test_mca.py passes (60/60).

🤖 Generated with Claude Code

When `one_hot_columns_to_drop` is set, the closed-form Benzécri/Greenacre formula no longer applies — it assumes uniform row sums in the indicator matrix, which subsetting violates. Switch to Greenacre's subset-MCA adjustment (CA in Practice, ch. 21), ported from R's `ca::mjca(lambda='adjusted', subsetcat=...)`. The fix only triggers when `one_hot_columns_to_drop is not None` and `correction='greenacre'`; the existing non-subset path is unchanged. Fixes #206. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Build the indicator matrix directly as a scipy CSC sparse matrix (factorize columns, build COO from offsets) instead of going through pd.get_dummies. Densify into one contiguous float ndarray right before CA.fit so sklearn's check_array sees a single block instead of iterating over J per-column ExtensionArrays. The subset-Greenacre Burt matmul now also benefits from sparse Zᵀ Z. Benchmarks (n_components=5, sklearn engine, 3 runs / best): 1k x 10 x 5 : 23ms -> 4ms (-83%) 10k x 20 x 10 : 62ms -> 45ms (-27%) 50k x 30 x 10 : 413ms -> 319ms (-23%) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resolves ty unresolved-attribute warnings for ``full_one_hot_sp.T`` and ``np.repeat(..., n_levels)``: both are only used when ``self.one_hot`` is true, so move the block inside that branch instead of carrying ``None`` sentinels. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

FactoMineR has no eigenvalue correction; the existing test_abdi_2007_correction relies on hardcoded numbers from the paper. Add a live rpy2 cross-check against ca::mjca(lambda='adjusted'), which implements the same closed-form correction prince does for the non-subset path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

scipy.linalg.svd (and fbpca) cap at min(M, N) singular values, so for a matrix of rank r < n_components the existing slice ``s = s[:n_components]`` silently returned fewer than the requested number of components, causing downstream shape mismatches (e.g. MFA with n_components=3 on a 2-column numerical group). Normalise all engines to return exactly n_components by padding U/s/V with zeros for the missing tail (sklearn's randomized_svd already does this implicitly with noisy padding). The extra components have zero singular values, so their contribution to coordinates and inertias is zero — matching the rank-deficient mathematical answer. With the scipy engine no longer crashing, switch the TestMFACategorical fixture to engine="scipy" so tight (atol=1e-4) comparisons against FactoMineR don't flake on randomized-SVD precision noise on the last component (CI flake on PR #236). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

MaxHalford and others added 5 commits June 14, 2026 22:47

Avoid double one-hot encoding in MCA.fit

c50d983

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Document notation in _subset_greenacre_quantities

93e4608

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Update mca.py

8435eda

MaxHalford mentioned this pull request Jun 14, 2026

Correction for the explained inertia in subset MCA #206

Open

MaxHalford and others added 3 commits June 14, 2026 23:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix Greenacre correction for subset MCA (#206)#236

Fix Greenacre correction for subset MCA (#206)#236
MaxHalford wants to merge 8 commits into
masterfrom
fix/subset-mca-greenacre-correction

MaxHalford commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

MaxHalford commented Jun 14, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant