[r][cpp] add element-wise matrix addition by erliuu · Pull Request #336 · bnprks/BPCells

erliuu · 2026-04-24T03:16:36Z

Hi @bnprks, here are the matrix addition changes we discussed.
Resolves outstanding to-do in NEWS.md and Issue #296.

Description

Prior to this change, combining two matrices required intermediate materialization in memory with dgCMatrix, resulting in high memory overhead for very large matrices. This pull request adds a lazy element-wise addition of two IterableMatrix objects through the existing + operator.

Adds a MatrixAddition class that iterates column-by-column and sums via a dense accumulator mirroring SparseMultiply.
Overrides colSums/rowSums/vecMultiplyRight/vecMultiplyLeft via distributive/linearity properties to bypass accumulation
Follows MatrixMultiply subsetting and rbind2/cbind2 merge_dimnames() conventions.

Behavior

Matrix size requirements for element-wise addition are enforced (identical dimensions)
Type coercion: mismatched matrix types promote to double, equal types preserved.
Transpose: both matrices must share internal transpose states (users are pointed to transpose_storage_order())

Tests

Added one MatrixMath.Add test to align with existing arithmetic-loader conventions
Added basic addition, type coercion, transpose, dimnames, and dim-mismatch tests

Scope

Element-wise subtraction is not in this PR. Users can still subtract explicitly via mat1 + (mat2 * -1). To discuss if subtraction should be implemented as a separate function.

bnprks · 2026-04-24T04:42:11Z

Initially this looks pretty good! I think you mostly found all the useful helper functions, though I'll do a closer pass to look for any minor things that are missing.

One discussion I'd like to have: what are your thoughts on the main two algorithm options for doing the addition? As I see it, options are:

Calculate per-column sums as dense vectors (what you've done so far)
Assuming non-zero values within a column are stored in order by row, do something like a 2-way merge to create a sparse vector of the per-column sum (list of row idx + sum value).
- There's an OrderRows helper class that can guarantee sorted order by buffering + sorting as needed. Most inputs will start with row ordering, so it's usually just the cost of buffering, sometimes the cost of sorting (e.g. if there is a re-ordering matrix selection applied).

I'm not 100% sure which one makes more sense for this use-case, but I'd like to make sure that we've done some napkin math to think through the efficiency comparison between the algorithms.

erliuu · 2026-04-24T21:29:25Z

Thanks for the suggestions!

Assuming sorted, the 2-way merge method does seem like a more feasible method (especially helpful with OrderRows). My understanding is that method 1 time scales based on the dimensions of the matrix (re-zeroes and sweeps n-rows buffer for each column). While method 2 only scales with the total number of non-zero elements. In the un-sorted case, OrderRows just uses radix sort which scales linearly.

Both cases put method 2 at O(n) while method 1 scales at O(cols * rows + n), which should be strict asymptotic improvement. Please let me know if I made any incorrect assumptions here, or else I'm happy to draft up an implementation for it.

bnprks · 2026-04-27T15:06:13Z

Yeah, that matches with my rough analysis, so I think it's worth implementing the 2-way merge approach if you're up for it. For the sparse matrices we work on, the non-zero fraction tends to be in the range of 0.01-0.05 for RNA (see fig S2 e+f in this manuscript), so I often approximate n = 0.05 * cols * rows. That means we're often talking about a difference in constants in practice, though I still think it's likely to come out ahead for method 2 (more compact memory accesses, and I think possible to write mostly branch-free). In the event that matrices become very sparse, method 2 is likely to get a larger advantage as you point out with the asymptotic analysis. If you save a copy of method 1, you can always run a little performance comparison between the two to double-check on a test dataset.

erliuu · 2026-04-28T23:41:44Z

@bnprks I've added an implementation for 2-way merge and a couple of tests to account for the merge behaviour. I just noticed you mentioned implementing this branch-free (which this is not). I'm not the most familiar with it, but happy to take a deeper look.

I'll run some quick performance comparison on the current branch then proceed with improvements.

bnprks · 2026-04-29T15:56:43Z

Hi @erliuu, thanks, looks pretty good! Branch-free is ultimately a bit of reliance on the compiler, but the two tricks that seem to convince it most of the time is using ternary operators (cond ? true_val : false_val) and some boolean math on indices. For an example of this style, see this bit of MatrixAccumulators.h, which is "compacting" adjacent values that have matching row+col indices. Notice it still has one branch in the for loop, but that is highly predictable since it is not dependent on the data in the loop. (The main target is getting rid of unpredictable branches). It's not a perfect analogue for what you're doing here, but hopefully it gives some idea. (Another example is InsertionIterator::nextInsertion()) Happy to chat further if needed -- it's usually not that complicated to transform the if statements like you wrote to be branch free, but it might take some getting used to at first

I'll also see if I can dig up the instructions I made once on how to double-check what assembly code is generated for a function, since it's the real way to double-check if your code is compiled as branch free.

Other than that, there's just a couple testing details I'll need to double check to make sure that all the various edge cases about dimnames and subsetting are being exercised in testing. I think your code is correct since it's using all the right helpers, just trying to remember if there's a standard test that covers these details. There is at least one catch-all test "Generic methods work" which should be added to every time a new operator is added.

bnprks · 2026-04-29T16:29:04Z

Found the link for checking generated assembly: https://github.com/bnprks/BPCells/wiki/Tools-and--Development-workflows#checking-generated-assembly-of-c-functions

bnprks

I just got a chance to review everything -- looks really good overall! I have a couple small comments/suggestions, including a note about where to focus on writing things branch-free.

One other small suggestion I mentioned elsewhere is to add matrix addition to the "Generic methods work" test, which just requires adding one line with some way to use addition without modifying a matrix result (e.g. 0.5 * (mi + mi)).

…rix.R, add denseMultiply properties

…-matrix addition setMethod location

…ching

erliuu · 2026-05-05T22:38:18Z

Hey @bnprks,

Thanks for the feedback and suggestions. I've implemented your suggested changes. I just want to check that my branch-minimized implementation aligns with your comments about that section (and for my own understanding):

Make this section handle values in bulk while left and right both have data, up to max load (yes a branch for the loop, but will come up true 100s of times in a row so easy to predict)

My understanding is that I shouldn't be trying to make the whole check branchfree, but reduce unpredictable branches in the inner loop. This meant that I could keep the inner loop itself branched as its still highly predictable (one exit branch per max_load_size iterations).

while (loaded < max_load_size && l_idx < l_cap && r_idx < r_cap) {

Use index-boolean addition and ternary operators to remove the need for if/else inside the loop.

Claude gave me a good tip with the index checks and addition (<= and >= instead of additional == check) and I worked something like this out with ternary operators.

bool l_pick = lr_v <= rr_v;
bool r_pick = lr_v >= rr_v;
uint32_t out_row = l_pick ? lr_v : rr_v;
T out_val = l_pick ? (r_pick ? lv_v + rv_v : lv_v) : rv_v;
bool emit = (lr_v != rr_v) | (out_val != T(0));
row_buffer[loaded] = out_row;
val_buffer[loaded] = out_val;
loaded += emit;
l_idx += l_pick;
r_idx += r_pick;

Leave the calls to refill outside the loop

I popped the refill outside the inner-loop of the merge case. Is it worth it to move the refills out of all three cases (drain + merge) at the end of the outer loop? It would make the code cleaner, but my understanding is that both refill-checks would be conducted every outer loop, instead of only one side for the drain cases (Likely negligible in practice though).

erliuu · 2026-05-05T23:50:34Z

I followed your instructions to generate the assembly for 000000000088eb20 W BPCells::MatrixAddition<unsigned int>::load() and pulled what I thought was the relevant code for review with the python wrapper:

142	                while (loaded < max_load_size && l_idx < l_cap && r_idx < r_cap) {
   0x000000000088efe7 <+1223>:	mov    DWORD PTR [r15+0x148],ecx
   0x000000000088efee <+1230>:	cmp    ecx,DWORD PTR [r15+0x14c]
   0x000000000088eff5 <+1237>:	jae    88f050 <BPCells::MatrixAddition<unsigned int>::load()+1328>
   0x000000000088eff7 <+1239>:	cmp    eax,DWORD PTR [r15+0x154]
   0x000000000088effe <+1246>:	jae    88f0b8 <BPCells::MatrixAddition<unsigned int>::load()+1432>
   0x000000000088f004 <+1252>:	cmp    DWORD PTR [r15+0x15c],edx
   0x000000000088f00b <+1259>:	jbe    88f050 <BPCells::MatrixAddition<unsigned int>::load()+1328>
   0x000000000088f00d <+1261>:	add    rax,r8
   0x000000000088f010 <+1264>:	add    rdx,r14
143	                    uint32_t lr_v = lr[l_idx];
   0x000000000088f013 <+1267>:	mov    edi,DWORD PTR [r9+rax*4]
144	                    uint32_t rr_v = rr[r_idx];
   0x000000000088f017 <+1271>:	mov    esi,DWORD PTR [r13+rdx*4+0x0]
145	                    T lv_v = lv[l_idx];
   0x000000000088f01c <+1276>:	mov    ebp,DWORD PTR [rbx+rax*4]
146	                    T rv_v = rv[r_idx];
   0x000000000088f01f <+1279>:	mov    r10d,DWORD PTR [r12+rdx*4]
147	
148	                    bool l_pick = lr_v <= rr_v;
   0x000000000088f023 <+1283>:	cmp    edi,esi
   0x000000000088f025 <+1285>:	mov    r11d,esi
   0x000000000088f028 <+1288>:	cmovbe r11d,edi
   0x000000000088f02c <+1292>:	setbe  al
149	                    bool r_pick = lr_v >= rr_v;
   0x000000000088f02f <+1295>:	setae  dl
150	                    uint32_t out_row = l_pick ? lr_v : rr_v;
151	                    T out_val = l_pick ? (r_pick ? lv_v + rv_v : lv_v) : rv_v;
   0x000000000088f032 <+1298>:	ja     88ef93 <BPCells::MatrixAddition<unsigned int>::load()+1139>
   0x000000000088f038 <+1304>:	jae    88ef90 <BPCells::MatrixAddition<unsigned int>::load()+1136>
   0x000000000088f03e <+1310>:	mov    r10d,ebp
   0x000000000088f041 <+1313>:	jmp    88ef93 <BPCells::MatrixAddition<unsigned int>::load()+1139>
   0x000000000088f046 <+1318>:	cs nop WORD PTR [rax+rax*1+0x0]

I can't say I know how to read assembly well, but my understanding is that we are trying to minimize the j* conditional jumps in the main body of the inner-loop (merge-case). Lines 143-150 appear to show a successful branchfree implementation of the changes, but Line 151 seems to show that the output value ternary still compiled branched. Please let me know if I am looking at this correctly, and what next steps might be. Thanks!

bnprks

Nice job on the adjustments, just two small one-line suggestions on my end.

My understanding is that I shouldn't be trying to make the whole check branchfree, but reduce unpredictable branches in the inner loop. This meant that I could keep the inner loop itself branched as its still highly predictable (one exit branch per max_load_size iterations).

Yep, exactly right

Is it worth it to move the refills out of all three cases (drain + merge) at the end of the outer loop?

Because it's out of the hot loop I think it probably doesn't matter and you can do whatever makes the code easiest to read/think about in your judgement

Lines 143-150 appear to show a successful branchfree implementation of the changes, but Line 151 seems to show that the output value ternary still compiled branched. Please let me know if I am looking at this correctly, and what next steps might be.

I agree with your analysis that we only hit partially branch free. I checked a few tricks on my end and also wasn't successful getting the compiler to spit out branch-free code for out_val. (Tricks were 1. using clang instead of gcc 2. Trying out_val = l_pick * lv_v + r_pick * rv_v). It seems my intuition here was a bit off for what would produce branch-free code

I'd recommend taking the partial success and moving on, as further wrestling with the compiler would require (1) running actual timing tests/profiling to get ground-truth performance data (2) trying some more variants of how to write the inner loop (maybe first merge values into the output buffer without adding, then do the addition/zero dropping in a second loop? Maybe there are other slight tweaks possible?). You can develop intuitions over time for what might work and what might not, but at the end of the day there's always an element of guess and check.

I don't think that this matrix addition is likely to be a common bottleneck, hence why I think it might be easiest to just move on. If you're having fun with this as a chance to learn more about inspecting assembly/profiling code then go for it, otherwise I think this code is basically ready to merge

Co-authored-by: Ben Parks <bnprks@users.noreply.github.com>

erliuu added 4 commits April 23, 2026 19:25

[cpp] Add MatrixAddition iterator for element-wise matrix addition

c577427

[r] Add R MatrixAddition binding and support

fbda462

[r][cpp] Add regression and functional tests for MatrixAddition

87c0532

[r] Minor docs update and variable renaming for code polishing

c59b38f

erliuu added 2 commits April 28, 2026 13:48

[cpp] Rewrite MatrixAddition to use 2-way merge method

4e2fe51

[r] Add MatrixAddition merge-specific regression tests

4f0854f

bnprks reviewed May 1, 2026

View reviewed changes

Comment thread r/R/transforms.R Outdated

Comment thread r/src/bpcells-cpp/matrixIterators/MatrixAddition.h

Comment thread r/src/bpcells-cpp/matrixIterators/MatrixAddition.h Outdated

Comment thread r/src/bpcells-cpp/matrixIterators/MatrixAddition.h

erliuu added 4 commits May 4, 2026 16:49

[r][cpp] Relocate + operator definition next to MatrixAddition in mat…

1dd7f02

…rix.R, add denseMultiply properties

[r] Add subtraction via MatrixAddition, add comment indicating matrix…

307504d

…-matrix addition setMethod location

[cpp] Refactor MatrixAddition merge-case to reduce unpredictable bran…

47c5a34

…ching

[r] Add matrix subtraction and generic methods work tests

539dd07

bnprks approved these changes May 7, 2026

View reviewed changes

Comment thread r/src/bpcells-cpp/matrixIterators/MatrixAddition.h Outdated

Comment thread r/tests/testthat/test-matrix_utils.R Outdated

[r][cpp] Apply suggestion for removal of explicit zero condition/check

42e1b50

Co-authored-by: Ben Parks <bnprks@users.noreply.github.com>

erliuu merged commit bde5737 into bnprks:main May 7, 2026
0 of 4 checks passed

erliuu deleted the el/element-wise-addition branch May 7, 2026 04:21

erliuu mentioned this pull request May 7, 2026

Addition of 2 matrices #296

Closed

Conversation

erliuu commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Behavior

Tests

Scope

Uh oh!

bnprks commented Apr 24, 2026

Uh oh!

erliuu commented Apr 24, 2026

Uh oh!

bnprks commented Apr 27, 2026

Uh oh!

erliuu commented Apr 28, 2026

Uh oh!

bnprks commented Apr 29, 2026

Uh oh!

bnprks commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bnprks left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

erliuu commented May 5, 2026

Uh oh!

erliuu commented May 5, 2026

Uh oh!

bnprks left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

erliuu commented Apr 24, 2026 •

edited

Loading

bnprks commented Apr 29, 2026 •

edited

Loading