Opt/step5 attempt2 by mschwoer · Pull Request #91 · MannLabs/directlfq

mschwoer · 2026-06-26T07:51:22Z

supersedes: #88
The individual commits should now be easier to understand.
All unit tests pass at every commit.

Original description:

Replaces the per-protein pandas interface with numpy on the hot path: the work producer yields array-slice tuples instead of building/pickling ~14k MultiIndex DataFrames; the worker computes on numpy and returns a 4-tuple. Adds numpy helpers _cut_peptide_values / _normalize_protein_values; ProtvalCutter / NormalizationManagerProtein retained for other callers. Includes the Fortran-order summed_pepint and lexsort tie-break fixes for bit-exactness.

Estimated speedup: estimate stage 16.2 s → 9.6 s single-core (~1.7×), and restores multi-core scaling (8-core estimate ~10.5 s → ~2.4 s).

Bit-identical protein + ion output (max_abs_diff=0); suite passes.

🤖 Generated with Claude Code

--- # Conversation that produced these changes --- ## User prompt "I think pr 88 needs to be redone from scratch as it is quite confusing. Distill all key points of the implementation into a markdown file together with an implementation plan that allows for step-by-step review of logically sound (and tested commits)" "also, an upfront refactoring could make my life easier" Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…tuple Preparatory step for the numpy hot-path migration (step 5, commit 1/5). Still 100% DataFrame-based; no reduction order or memory layout changes. - Rename get_input_specification_tuplelist_idx__df__num_samples_quadratic__min_nonan -> get_protein_workitems (pure rename, body unchanged). - calculate_peptide_and_protein_intensities now returns the 4-tuple (protein_profile, protein_name, ion_names, shifted_values) instead of (protein_profile, shifted_df); metadata is extracted from the shifted DataFrame's MultiIndex inside the worker. - The two assembly functions read the tuple fields directly instead of re-deriving protein_name/ion_names from the result DataFrame's index. This lands the assembly functions in their final shape so the later numpy flip is a small, focused diff that does not touch them. optbench identical, pytest green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… 2/5) Lift the numpy body of NormalizationManagerProtein._normalize_quadratic_and_linear into a module-level normalize_protein_ion_values(peptide_values, num_samples_quadratic) that reproduces *both* dispatch branches of NormalizationManager._run_normalization: - n_ions <= k: quadratic-only, rows in original (ion-name) order (mirrors _normalize_complete_input_quadratic / normalize_ion_profiles); - n_ions > k: quadratic+linear (the unchanged step-4 numpy code). The class method becomes a thin delegator. Since the class only reaches it on the n > k branch, NormalizationManagerProtein output is byte-identical; the n <= k branch is exercised only by the (upcoming) hot path. A single source removes the risk of two divergent copies of this order-sensitive float math. Pinned against the base-class .loc pipeline it reimplements (via a plain NormalizationManager, not the delegating subclass). optbench identical, pytest green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Numpy replacement for ProtvalCutter: keep <= maximum ions sorted by NaN count asc then summed intensity desc. Uses np.lexsort((neg_summed, nan_counts)) so the stable last-key-primary sort reproduces ProtvalCutter's sorted() tie-break -- ions tied on both keys keep their original ion-name order (Gotcha 2). Strangler pattern: introduced beside the live ProtvalCutter and proven by direct A/B equivalence tests (ion order incl. a full tie, and a within-limit no-op), the strongest possible check. Not yet wired into the worker. optbench identical, pytest green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…mit 4/5) Replace the per-protein pandas interface end to end with numpy. The producer ships contiguous array slices, the worker computes on arrays and returns arrays, the assembly functions (already in final form since commit 1) are untouched. - get_protein_workitems is now a generator yielding the 6-tuple (idx, protein_name, ion_names, peptide_values, num_samples_quadratic, min_nonan) over find_nameswitch_indices slices -- no per-protein DataFrame is built. - calculate_peptide_and_protein_intensities takes the 6-tuple, uses _cut_peptide_values, lfqnorm.normalize_protein_ion_values, the single-sample guard, and the Fortran-order summed_pepint (Gotcha 1: np.asfortranarray matches the former column-major DataFrame sum, which only bites proteins with >100 ions). - The two profile helpers take a numpy array (rows = ions, columns = samples). - Pool wrappers: generic parameter rename only. - visualizations.py passes .to_numpy() to the now-array profile helper. ProtvalCutter / get_subdf / get_normed_dfs are dead but kept this commit (removal is commit 5). Estimate stage 14.8s -> 9.4s single-core; 4-core 4.1s, bit-identical. optbench identical=True, pytest green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… 5/5) Now that the hot path is numpy, the DataFrame machinery is dead and removed: - Delete ProtvalCutter (+ its @njit _get_num_nas_in_row and the now-unused numba import), get_subdf, get_normed_dfs, and the redundant pandas import. - Prune __all__: drop get_normed_dfs, ProtvalCutter, and the long-dead OrphanIonRemover / OrphanIonsForDeletionSelector / IonCheckedForOrphan (which were only ever listed, never defined). - De-circularize the cut tests: ProtvalCutter is gone, so the A/B equivalence assertions become hardcoded expectations (same inputs, expected order/values). The two ProtvalCutter-specific tests are dropped (their intent is covered). - Benchmark probes: add protvalcutter_reference_sorted_index / protvalcutter_reference_cut to optbench.py (verbatim copy of the old sort/cut) and point order/sum/layout probes at them; update prof/jit/bench_estimate/pickle probes off the renamed/removed producer functions. The only NormalizationManagerProtein consumer, visualizations.py, is unaffected (it was updated for the array profile helper in commit 4). optbench identical=True, pytest green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

mschwoer and others added 6 commits June 17, 2026 18:07

mschwoer changed the base branch from main to opt/step4-numpy-normalize-quadratic-linear June 26, 2026 07:51

mschwoer mentioned this pull request Jun 26, 2026

Step 5: eliminate per-protein DataFrames (ship numpy slices) #88

Closed

mschwoer requested a review from ammarcsj June 26, 2026 07:59

pre-commit

f4f4873

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Opt/step5 attempt2#91

Opt/step5 attempt2#91
mschwoer wants to merge 7 commits into
opt/step4-numpy-normalize-quadratic-linearfrom
opt/step5-attempt2

mschwoer commented Jun 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mschwoer commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mschwoer commented Jun 26, 2026 •

edited

Loading