Skip to content

Opt/step5 attempt2#91

Open
mschwoer wants to merge 7 commits into
opt/step4-numpy-normalize-quadratic-linearfrom
opt/step5-attempt2
Open

Opt/step5 attempt2#91
mschwoer wants to merge 7 commits into
opt/step4-numpy-normalize-quadratic-linearfrom
opt/step5-attempt2

Conversation

@mschwoer

@mschwoer mschwoer commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

supersedes: #88
The individual commits should now be easier to understand.
All unit tests pass at every commit.

Original description:

Replaces the per-protein pandas interface with numpy on the hot path: the work producer yields array-slice tuples instead of building/pickling ~14k MultiIndex DataFrames; the worker computes on numpy and returns a 4-tuple. Adds numpy helpers _cut_peptide_values / _normalize_protein_values; ProtvalCutter / NormalizationManagerProtein retained for other callers. Includes the Fortran-order summed_pepint and lexsort tie-break fixes for bit-exactness.

Estimated speedup: estimate stage 16.2 s → 9.6 s single-core (~1.7×), and restores multi-core scaling (8-core estimate ~10.5 s → ~2.4 s).

Bit-identical protein + ion output (max_abs_diff=0); suite passes.

🤖 Generated with Claude Code

mschwoer and others added 6 commits June 17, 2026 18:07
---
# Conversation that produced these changes
---

## User prompt

"I think pr 88 needs to be redone from scratch as it is quite confusing. Distill all key points of the implementation into a markdown file together with an implementation plan that allows for step-by-step review of logically sound (and tested commits)"

"also,  an upfront refactoring could make my life easier"

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…tuple

Preparatory step for the numpy hot-path migration (step 5, commit 1/5).
Still 100% DataFrame-based; no reduction order or memory layout changes.

- Rename get_input_specification_tuplelist_idx__df__num_samples_quadratic__min_nonan
  -> get_protein_workitems (pure rename, body unchanged).
- calculate_peptide_and_protein_intensities now returns the 4-tuple
  (protein_profile, protein_name, ion_names, shifted_values) instead of
  (protein_profile, shifted_df); metadata is extracted from the shifted
  DataFrame's MultiIndex inside the worker.
- The two assembly functions read the tuple fields directly instead of
  re-deriving protein_name/ion_names from the result DataFrame's index.

This lands the assembly functions in their final shape so the later numpy
flip is a small, focused diff that does not touch them. optbench identical,
pytest green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… 2/5)

Lift the numpy body of NormalizationManagerProtein._normalize_quadratic_and_linear
into a module-level normalize_protein_ion_values(peptide_values,
num_samples_quadratic) that reproduces *both* dispatch branches of
NormalizationManager._run_normalization:

- n_ions <= k: quadratic-only, rows in original (ion-name) order
  (mirrors _normalize_complete_input_quadratic / normalize_ion_profiles);
- n_ions  > k: quadratic+linear (the unchanged step-4 numpy code).

The class method becomes a thin delegator. Since the class only reaches it on
the n > k branch, NormalizationManagerProtein output is byte-identical; the
n <= k branch is exercised only by the (upcoming) hot path. A single source
removes the risk of two divergent copies of this order-sensitive float math.

Pinned against the base-class .loc pipeline it reimplements (via a plain
NormalizationManager, not the delegating subclass). optbench identical,
pytest green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Numpy replacement for ProtvalCutter: keep <= maximum ions sorted by NaN count
asc then summed intensity desc. Uses np.lexsort((neg_summed, nan_counts)) so the
stable last-key-primary sort reproduces ProtvalCutter's sorted() tie-break --
ions tied on both keys keep their original ion-name order (Gotcha 2).

Strangler pattern: introduced beside the live ProtvalCutter and proven by direct
A/B equivalence tests (ion order incl. a full tie, and a within-limit no-op), the
strongest possible check. Not yet wired into the worker. optbench identical,
pytest green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…mit 4/5)

Replace the per-protein pandas interface end to end with numpy. The producer
ships contiguous array slices, the worker computes on arrays and returns arrays,
the assembly functions (already in final form since commit 1) are untouched.

- get_protein_workitems is now a generator yielding the 6-tuple
  (idx, protein_name, ion_names, peptide_values, num_samples_quadratic, min_nonan)
  over find_nameswitch_indices slices -- no per-protein DataFrame is built.
- calculate_peptide_and_protein_intensities takes the 6-tuple, uses
  _cut_peptide_values, lfqnorm.normalize_protein_ion_values, the single-sample
  guard, and the Fortran-order summed_pepint (Gotcha 1: np.asfortranarray matches
  the former column-major DataFrame sum, which only bites proteins with >100 ions).
- The two profile helpers take a numpy array (rows = ions, columns = samples).
- Pool wrappers: generic parameter rename only.
- visualizations.py passes .to_numpy() to the now-array profile helper.

ProtvalCutter / get_subdf / get_normed_dfs are dead but kept this commit (removal
is commit 5). Estimate stage 14.8s -> 9.4s single-core; 4-core 4.1s, bit-identical.
optbench identical=True, pytest green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… 5/5)

Now that the hot path is numpy, the DataFrame machinery is dead and removed:

- Delete ProtvalCutter (+ its @njit _get_num_nas_in_row and the now-unused
  numba import), get_subdf, get_normed_dfs, and the redundant pandas import.
- Prune __all__: drop get_normed_dfs, ProtvalCutter, and the long-dead
  OrphanIonRemover / OrphanIonsForDeletionSelector / IonCheckedForOrphan
  (which were only ever listed, never defined).
- De-circularize the cut tests: ProtvalCutter is gone, so the A/B equivalence
  assertions become hardcoded expectations (same inputs, expected order/values).
  The two ProtvalCutter-specific tests are dropped (their intent is covered).
- Benchmark probes: add protvalcutter_reference_sorted_index /
  protvalcutter_reference_cut to optbench.py (verbatim copy of the old sort/cut)
  and point order/sum/layout probes at them; update prof/jit/bench_estimate/pickle
  probes off the renamed/removed producer functions.

The only NormalizationManagerProtein consumer, visualizations.py, is unaffected
(it was updated for the array profile helper in commit 4). optbench identical=True,
pytest green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@mschwoer mschwoer changed the base branch from main to opt/step4-numpy-normalize-quadratic-linear June 26, 2026 07:51
@mschwoer mschwoer requested a review from ammarcsj June 26, 2026 07:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant