Opt/step5 attempt2#91
Open
mschwoer wants to merge 7 commits into
Open
Conversation
--- # Conversation that produced these changes --- ## User prompt "I think pr 88 needs to be redone from scratch as it is quite confusing. Distill all key points of the implementation into a markdown file together with an implementation plan that allows for step-by-step review of logically sound (and tested commits)" "also, an upfront refactoring could make my life easier" Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…tuple Preparatory step for the numpy hot-path migration (step 5, commit 1/5). Still 100% DataFrame-based; no reduction order or memory layout changes. - Rename get_input_specification_tuplelist_idx__df__num_samples_quadratic__min_nonan -> get_protein_workitems (pure rename, body unchanged). - calculate_peptide_and_protein_intensities now returns the 4-tuple (protein_profile, protein_name, ion_names, shifted_values) instead of (protein_profile, shifted_df); metadata is extracted from the shifted DataFrame's MultiIndex inside the worker. - The two assembly functions read the tuple fields directly instead of re-deriving protein_name/ion_names from the result DataFrame's index. This lands the assembly functions in their final shape so the later numpy flip is a small, focused diff that does not touch them. optbench identical, pytest green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… 2/5) Lift the numpy body of NormalizationManagerProtein._normalize_quadratic_and_linear into a module-level normalize_protein_ion_values(peptide_values, num_samples_quadratic) that reproduces *both* dispatch branches of NormalizationManager._run_normalization: - n_ions <= k: quadratic-only, rows in original (ion-name) order (mirrors _normalize_complete_input_quadratic / normalize_ion_profiles); - n_ions > k: quadratic+linear (the unchanged step-4 numpy code). The class method becomes a thin delegator. Since the class only reaches it on the n > k branch, NormalizationManagerProtein output is byte-identical; the n <= k branch is exercised only by the (upcoming) hot path. A single source removes the risk of two divergent copies of this order-sensitive float math. Pinned against the base-class .loc pipeline it reimplements (via a plain NormalizationManager, not the delegating subclass). optbench identical, pytest green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Numpy replacement for ProtvalCutter: keep <= maximum ions sorted by NaN count asc then summed intensity desc. Uses np.lexsort((neg_summed, nan_counts)) so the stable last-key-primary sort reproduces ProtvalCutter's sorted() tie-break -- ions tied on both keys keep their original ion-name order (Gotcha 2). Strangler pattern: introduced beside the live ProtvalCutter and proven by direct A/B equivalence tests (ion order incl. a full tie, and a within-limit no-op), the strongest possible check. Not yet wired into the worker. optbench identical, pytest green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…mit 4/5) Replace the per-protein pandas interface end to end with numpy. The producer ships contiguous array slices, the worker computes on arrays and returns arrays, the assembly functions (already in final form since commit 1) are untouched. - get_protein_workitems is now a generator yielding the 6-tuple (idx, protein_name, ion_names, peptide_values, num_samples_quadratic, min_nonan) over find_nameswitch_indices slices -- no per-protein DataFrame is built. - calculate_peptide_and_protein_intensities takes the 6-tuple, uses _cut_peptide_values, lfqnorm.normalize_protein_ion_values, the single-sample guard, and the Fortran-order summed_pepint (Gotcha 1: np.asfortranarray matches the former column-major DataFrame sum, which only bites proteins with >100 ions). - The two profile helpers take a numpy array (rows = ions, columns = samples). - Pool wrappers: generic parameter rename only. - visualizations.py passes .to_numpy() to the now-array profile helper. ProtvalCutter / get_subdf / get_normed_dfs are dead but kept this commit (removal is commit 5). Estimate stage 14.8s -> 9.4s single-core; 4-core 4.1s, bit-identical. optbench identical=True, pytest green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… 5/5) Now that the hot path is numpy, the DataFrame machinery is dead and removed: - Delete ProtvalCutter (+ its @njit _get_num_nas_in_row and the now-unused numba import), get_subdf, get_normed_dfs, and the redundant pandas import. - Prune __all__: drop get_normed_dfs, ProtvalCutter, and the long-dead OrphanIonRemover / OrphanIonsForDeletionSelector / IonCheckedForOrphan (which were only ever listed, never defined). - De-circularize the cut tests: ProtvalCutter is gone, so the A/B equivalence assertions become hardcoded expectations (same inputs, expected order/values). The two ProtvalCutter-specific tests are dropped (their intent is covered). - Benchmark probes: add protvalcutter_reference_sorted_index / protvalcutter_reference_cut to optbench.py (verbatim copy of the old sort/cut) and point order/sum/layout probes at them; update prof/jit/bench_estimate/pickle probes off the renamed/removed producer functions. The only NormalizationManagerProtein consumer, visualizations.py, is unaffected (it was updated for the array profile helper in commit 4). optbench identical=True, pytest green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
supersedes: #88
The individual commits should now be easier to understand.
All unit tests pass at every commit.
Original description:
Replaces the per-protein pandas interface with numpy on the hot path: the work producer yields array-slice tuples instead of building/pickling ~14k MultiIndex DataFrames; the worker computes on numpy and returns a 4-tuple. Adds numpy helpers _cut_peptide_values / _normalize_protein_values; ProtvalCutter / NormalizationManagerProtein retained for other callers. Includes the Fortran-order summed_pepint and lexsort tie-break fixes for bit-exactness.
Estimated speedup: estimate stage 16.2 s → 9.6 s single-core (~1.7×), and restores multi-core scaling (8-core estimate ~10.5 s → ~2.4 s).
Bit-identical protein + ion output (max_abs_diff=0); suite passes.
🤖 Generated with Claude Code