Skip to content

[r][cpp] add element-wise matrix addition#336

Merged
erliuu merged 11 commits into
bnprks:mainfrom
GreenleafLab:el/element-wise-addition
May 7, 2026
Merged

[r][cpp] add element-wise matrix addition#336
erliuu merged 11 commits into
bnprks:mainfrom
GreenleafLab:el/element-wise-addition

Conversation

@erliuu

@erliuu erliuu commented Apr 24, 2026

Copy link
Copy Markdown
Collaborator

Hi @bnprks, here are the matrix addition changes we discussed.
Resolves outstanding to-do in NEWS.md and Issue #296.

Description

Prior to this change, combining two matrices required intermediate materialization in memory with dgCMatrix, resulting in high memory overhead for very large matrices. This pull request adds a lazy element-wise addition of two IterableMatrix objects through the existing + operator.

  • Adds a MatrixAddition class that iterates column-by-column and sums via a dense accumulator mirroring SparseMultiply.
  • Overrides colSums/rowSums/vecMultiplyRight/vecMultiplyLeft via distributive/linearity properties to bypass accumulation
  • Follows MatrixMultiply subsetting and rbind2/cbind2 merge_dimnames() conventions.

Behavior

  • Matrix size requirements for element-wise addition are enforced (identical dimensions)
  • Type coercion: mismatched matrix types promote to double, equal types preserved.
  • Transpose: both matrices must share internal transpose states (users are pointed to transpose_storage_order())

Tests

  • Added one MatrixMath.Add test to align with existing arithmetic-loader conventions
  • Added basic addition, type coercion, transpose, dimnames, and dim-mismatch tests

Scope

  • Element-wise subtraction is not in this PR. Users can still subtract explicitly via mat1 + (mat2 * -1). To discuss if subtraction should be implemented as a separate function.

@bnprks

bnprks commented Apr 24, 2026

Copy link
Copy Markdown
Owner

Initially this looks pretty good! I think you mostly found all the useful helper functions, though I'll do a closer pass to look for any minor things that are missing.

One discussion I'd like to have: what are your thoughts on the main two algorithm options for doing the addition? As I see it, options are:

  1. Calculate per-column sums as dense vectors (what you've done so far)
  2. Assuming non-zero values within a column are stored in order by row, do something like a 2-way merge to create a sparse vector of the per-column sum (list of row idx + sum value).
    • There's an OrderRows helper class that can guarantee sorted order by buffering + sorting as needed. Most inputs will start with row ordering, so it's usually just the cost of buffering, sometimes the cost of sorting (e.g. if there is a re-ordering matrix selection applied).

I'm not 100% sure which one makes more sense for this use-case, but I'd like to make sure that we've done some napkin math to think through the efficiency comparison between the algorithms.

@erliuu

erliuu commented Apr 24, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks for the suggestions!

Assuming sorted, the 2-way merge method does seem like a more feasible method (especially helpful with OrderRows). My understanding is that method 1 time scales based on the dimensions of the matrix (re-zeroes and sweeps n-rows buffer for each column). While method 2 only scales with the total number of non-zero elements. In the un-sorted case, OrderRows just uses radix sort which scales linearly.

Both cases put method 2 at O(n) while method 1 scales at O(cols * rows + n), which should be strict asymptotic improvement. Please let me know if I made any incorrect assumptions here, or else I'm happy to draft up an implementation for it.

@bnprks

bnprks commented Apr 27, 2026

Copy link
Copy Markdown
Owner

Yeah, that matches with my rough analysis, so I think it's worth implementing the 2-way merge approach if you're up for it. For the sparse matrices we work on, the non-zero fraction tends to be in the range of 0.01-0.05 for RNA (see fig S2 e+f in this manuscript), so I often approximate n = 0.05 * cols * rows. That means we're often talking about a difference in constants in practice, though I still think it's likely to come out ahead for method 2 (more compact memory accesses, and I think possible to write mostly branch-free). In the event that matrices become very sparse, method 2 is likely to get a larger advantage as you point out with the asymptotic analysis. If you save a copy of method 1, you can always run a little performance comparison between the two to double-check on a test dataset.

@erliuu

erliuu commented Apr 28, 2026

Copy link
Copy Markdown
Collaborator Author

@bnprks I've added an implementation for 2-way merge and a couple of tests to account for the merge behaviour. I just noticed you mentioned implementing this branch-free (which this is not). I'm not the most familiar with it, but happy to take a deeper look.

I'll run some quick performance comparison on the current branch then proceed with improvements.

@bnprks

bnprks commented Apr 29, 2026

Copy link
Copy Markdown
Owner

Hi @erliuu, thanks, looks pretty good! Branch-free is ultimately a bit of reliance on the compiler, but the two tricks that seem to convince it most of the time is using ternary operators (cond ? true_val : false_val) and some boolean math on indices. For an example of this style, see this bit of MatrixAccumulators.h, which is "compacting" adjacent values that have matching row+col indices. Notice it still has one branch in the for loop, but that is highly predictable since it is not dependent on the data in the loop. (The main target is getting rid of unpredictable branches). It's not a perfect analogue for what you're doing here, but hopefully it gives some idea. (Another example is InsertionIterator::nextInsertion()) Happy to chat further if needed -- it's usually not that complicated to transform the if statements like you wrote to be branch free, but it might take some getting used to at first

I'll also see if I can dig up the instructions I made once on how to double-check what assembly code is generated for a function, since it's the real way to double-check if your code is compiled as branch free.

Other than that, there's just a couple testing details I'll need to double check to make sure that all the various edge cases about dimnames and subsetting are being exercised in testing. I think your code is correct since it's using all the right helpers, just trying to remember if there's a standard test that covers these details. There is at least one catch-all test "Generic methods work" which should be added to every time a new operator is added.

@bnprks

bnprks commented Apr 29, 2026

Copy link
Copy Markdown
Owner

@bnprks bnprks left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just got a chance to review everything -- looks really good overall! I have a couple small comments/suggestions, including a note about where to focus on writing things branch-free.

One other small suggestion I mentioned elsewhere is to add matrix addition to the "Generic methods work" test, which just requires adding one line with some way to use addition without modifying a matrix result (e.g. 0.5 * (mi + mi)).

Comment thread r/R/transforms.R Outdated
Comment thread r/src/bpcells-cpp/matrixIterators/MatrixAddition.h
Comment thread r/src/bpcells-cpp/matrixIterators/MatrixAddition.h Outdated
Comment thread r/src/bpcells-cpp/matrixIterators/MatrixAddition.h
@erliuu

erliuu commented May 5, 2026

Copy link
Copy Markdown
Collaborator Author

Hey @bnprks,

Thanks for the feedback and suggestions. I've implemented your suggested changes. I just want to check that my branch-minimized implementation aligns with your comments about that section (and for my own understanding):

  1. Make this section handle values in bulk while left and right both have data, up to max load (yes a branch for the loop, but will come up true 100s of times in a row so easy to predict)

My understanding is that I shouldn't be trying to make the whole check branchfree, but reduce unpredictable branches in the inner loop. This meant that I could keep the inner loop itself branched as its still highly predictable (one exit branch per max_load_size iterations).

while (loaded < max_load_size && l_idx < l_cap && r_idx < r_cap) {
  1. Use index-boolean addition and ternary operators to remove the need for if/else inside the loop.

Claude gave me a good tip with the index checks and addition (<= and >= instead of additional == check) and I worked something like this out with ternary operators.

bool l_pick = lr_v <= rr_v;
bool r_pick = lr_v >= rr_v;
uint32_t out_row = l_pick ? lr_v : rr_v;
T out_val = l_pick ? (r_pick ? lv_v + rv_v : lv_v) : rv_v;
bool emit = (lr_v != rr_v) | (out_val != T(0));
row_buffer[loaded] = out_row;
val_buffer[loaded] = out_val;
loaded += emit;
l_idx += l_pick;
r_idx += r_pick;
  1. Leave the calls to refill outside the loop

I popped the refill outside the inner-loop of the merge case. Is it worth it to move the refills out of all three cases (drain + merge) at the end of the outer loop? It would make the code cleaner, but my understanding is that both refill-checks would be conducted every outer loop, instead of only one side for the drain cases (Likely negligible in practice though).

@erliuu

erliuu commented May 5, 2026

Copy link
Copy Markdown
Collaborator Author

I followed your instructions to generate the assembly for 000000000088eb20 W BPCells::MatrixAddition<unsigned int>::load() and pulled what I thought was the relevant code for review with the python wrapper:

142	                while (loaded < max_load_size && l_idx < l_cap && r_idx < r_cap) {
   0x000000000088efe7 <+1223>:	mov    DWORD PTR [r15+0x148],ecx
   0x000000000088efee <+1230>:	cmp    ecx,DWORD PTR [r15+0x14c]
   0x000000000088eff5 <+1237>:	jae    88f050 <BPCells::MatrixAddition<unsigned int>::load()+1328>
   0x000000000088eff7 <+1239>:	cmp    eax,DWORD PTR [r15+0x154]
   0x000000000088effe <+1246>:	jae    88f0b8 <BPCells::MatrixAddition<unsigned int>::load()+1432>
   0x000000000088f004 <+1252>:	cmp    DWORD PTR [r15+0x15c],edx
   0x000000000088f00b <+1259>:	jbe    88f050 <BPCells::MatrixAddition<unsigned int>::load()+1328>
   0x000000000088f00d <+1261>:	add    rax,r8
   0x000000000088f010 <+1264>:	add    rdx,r14
143	                    uint32_t lr_v = lr[l_idx];
   0x000000000088f013 <+1267>:	mov    edi,DWORD PTR [r9+rax*4]
144	                    uint32_t rr_v = rr[r_idx];
   0x000000000088f017 <+1271>:	mov    esi,DWORD PTR [r13+rdx*4+0x0]
145	                    T lv_v = lv[l_idx];
   0x000000000088f01c <+1276>:	mov    ebp,DWORD PTR [rbx+rax*4]
146	                    T rv_v = rv[r_idx];
   0x000000000088f01f <+1279>:	mov    r10d,DWORD PTR [r12+rdx*4]
147	
148	                    bool l_pick = lr_v <= rr_v;
   0x000000000088f023 <+1283>:	cmp    edi,esi
   0x000000000088f025 <+1285>:	mov    r11d,esi
   0x000000000088f028 <+1288>:	cmovbe r11d,edi
   0x000000000088f02c <+1292>:	setbe  al
149	                    bool r_pick = lr_v >= rr_v;
   0x000000000088f02f <+1295>:	setae  dl
150	                    uint32_t out_row = l_pick ? lr_v : rr_v;
151	                    T out_val = l_pick ? (r_pick ? lv_v + rv_v : lv_v) : rv_v;
   0x000000000088f032 <+1298>:	ja     88ef93 <BPCells::MatrixAddition<unsigned int>::load()+1139>
   0x000000000088f038 <+1304>:	jae    88ef90 <BPCells::MatrixAddition<unsigned int>::load()+1136>
   0x000000000088f03e <+1310>:	mov    r10d,ebp
   0x000000000088f041 <+1313>:	jmp    88ef93 <BPCells::MatrixAddition<unsigned int>::load()+1139>
   0x000000000088f046 <+1318>:	cs nop WORD PTR [rax+rax*1+0x0]

I can't say I know how to read assembly well, but my understanding is that we are trying to minimize the j* conditional jumps in the main body of the inner-loop (merge-case). Lines 143-150 appear to show a successful branchfree implementation of the changes, but Line 151 seems to show that the output value ternary still compiled branched. Please let me know if I am looking at this correctly, and what next steps might be. Thanks!

@bnprks bnprks left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job on the adjustments, just two small one-line suggestions on my end.

My understanding is that I shouldn't be trying to make the whole check branchfree, but reduce unpredictable branches in the inner loop. This meant that I could keep the inner loop itself branched as its still highly predictable (one exit branch per max_load_size iterations).

Yep, exactly right

Is it worth it to move the refills out of all three cases (drain + merge) at the end of the outer loop?

Because it's out of the hot loop I think it probably doesn't matter and you can do whatever makes the code easiest to read/think about in your judgement

Lines 143-150 appear to show a successful branchfree implementation of the changes, but Line 151 seems to show that the output value ternary still compiled branched. Please let me know if I am looking at this correctly, and what next steps might be.

I agree with your analysis that we only hit partially branch free. I checked a few tricks on my end and also wasn't successful getting the compiler to spit out branch-free code for out_val. (Tricks were 1. using clang instead of gcc 2. Trying out_val = l_pick * lv_v + r_pick * rv_v). It seems my intuition here was a bit off for what would produce branch-free code

I'd recommend taking the partial success and moving on, as further wrestling with the compiler would require (1) running actual timing tests/profiling to get ground-truth performance data (2) trying some more variants of how to write the inner loop (maybe first merge values into the output buffer without adding, then do the addition/zero dropping in a second loop? Maybe there are other slight tweaks possible?). You can develop intuitions over time for what might work and what might not, but at the end of the day there's always an element of guess and check.

I don't think that this matrix addition is likely to be a common bottleneck, hence why I think it might be easiest to just move on. If you're having fun with this as a chance to learn more about inspecting assembly/profiling code then go for it, otherwise I think this code is basically ready to merge

Comment thread r/src/bpcells-cpp/matrixIterators/MatrixAddition.h Outdated
Comment thread r/tests/testthat/test-matrix_utils.R Outdated
Co-authored-by: Ben Parks <bnprks@users.noreply.github.com>
@erliuu erliuu merged commit bde5737 into bnprks:main May 7, 2026
0 of 4 checks passed
@erliuu erliuu deleted the el/element-wise-addition branch May 7, 2026 04:21
@erliuu erliuu mentioned this pull request May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants