[r][cpp] add element-wise matrix addition#336
Conversation
|
Initially this looks pretty good! I think you mostly found all the useful helper functions, though I'll do a closer pass to look for any minor things that are missing. One discussion I'd like to have: what are your thoughts on the main two algorithm options for doing the addition? As I see it, options are:
I'm not 100% sure which one makes more sense for this use-case, but I'd like to make sure that we've done some napkin math to think through the efficiency comparison between the algorithms. |
|
Thanks for the suggestions! Assuming sorted, the 2-way merge method does seem like a more feasible method (especially helpful with OrderRows). My understanding is that method 1 time scales based on the dimensions of the matrix (re-zeroes and sweeps n-rows buffer for each column). While method 2 only scales with the total number of non-zero elements. In the un-sorted case, OrderRows just uses radix sort which scales linearly. Both cases put method 2 at |
|
Yeah, that matches with my rough analysis, so I think it's worth implementing the 2-way merge approach if you're up for it. For the sparse matrices we work on, the non-zero fraction tends to be in the range of 0.01-0.05 for RNA (see fig S2 e+f in this manuscript), so I often approximate n = 0.05 * cols * rows. That means we're often talking about a difference in constants in practice, though I still think it's likely to come out ahead for method 2 (more compact memory accesses, and I think possible to write mostly branch-free). In the event that matrices become very sparse, method 2 is likely to get a larger advantage as you point out with the asymptotic analysis. If you save a copy of method 1, you can always run a little performance comparison between the two to double-check on a test dataset. |
|
@bnprks I've added an implementation for 2-way merge and a couple of tests to account for the merge behaviour. I just noticed you mentioned implementing this branch-free (which this is not). I'm not the most familiar with it, but happy to take a deeper look. I'll run some quick performance comparison on the current branch then proceed with improvements. |
|
Hi @erliuu, thanks, looks pretty good! Branch-free is ultimately a bit of reliance on the compiler, but the two tricks that seem to convince it most of the time is using ternary operators ( I'll also see if I can dig up the instructions I made once on how to double-check what assembly code is generated for a function, since it's the real way to double-check if your code is compiled as branch free. Other than that, there's just a couple testing details I'll need to double check to make sure that all the various edge cases about dimnames and subsetting are being exercised in testing. I think your code is correct since it's using all the right helpers, just trying to remember if there's a standard test that covers these details. There is at least one catch-all test |
|
Found the link for checking generated assembly: https://github.com/bnprks/BPCells/wiki/Tools-and--Development-workflows#checking-generated-assembly-of-c-functions |
bnprks
left a comment
There was a problem hiding this comment.
I just got a chance to review everything -- looks really good overall! I have a couple small comments/suggestions, including a note about where to focus on writing things branch-free.
One other small suggestion I mentioned elsewhere is to add matrix addition to the "Generic methods work" test, which just requires adding one line with some way to use addition without modifying a matrix result (e.g. 0.5 * (mi + mi)).
…rix.R, add denseMultiply properties
…-matrix addition setMethod location
|
Hey @bnprks, Thanks for the feedback and suggestions. I've implemented your suggested changes. I just want to check that my branch-minimized implementation aligns with your comments about that section (and for my own understanding):
My understanding is that I shouldn't be trying to make the whole check branchfree, but reduce unpredictable branches in the inner loop. This meant that I could keep the inner loop itself branched as its still highly predictable (one exit branch per
Claude gave me a good tip with the index checks and addition (
I popped the refill outside the inner-loop of the merge case. Is it worth it to move the refills out of all three cases (drain + merge) at the end of the outer loop? It would make the code cleaner, but my understanding is that both refill-checks would be conducted every outer loop, instead of only one side for the drain cases (Likely negligible in practice though). |
|
I followed your instructions to generate the assembly for I can't say I know how to read assembly well, but my understanding is that we are trying to minimize the |
bnprks
left a comment
There was a problem hiding this comment.
Nice job on the adjustments, just two small one-line suggestions on my end.
My understanding is that I shouldn't be trying to make the whole check branchfree, but reduce unpredictable branches in the inner loop. This meant that I could keep the inner loop itself branched as its still highly predictable (one exit branch per max_load_size iterations).
Yep, exactly right
Is it worth it to move the refills out of all three cases (drain + merge) at the end of the outer loop?
Because it's out of the hot loop I think it probably doesn't matter and you can do whatever makes the code easiest to read/think about in your judgement
Lines 143-150 appear to show a successful branchfree implementation of the changes, but Line 151 seems to show that the output value ternary still compiled branched. Please let me know if I am looking at this correctly, and what next steps might be.
I agree with your analysis that we only hit partially branch free. I checked a few tricks on my end and also wasn't successful getting the compiler to spit out branch-free code for out_val. (Tricks were 1. using clang instead of gcc 2. Trying out_val = l_pick * lv_v + r_pick * rv_v). It seems my intuition here was a bit off for what would produce branch-free code
I'd recommend taking the partial success and moving on, as further wrestling with the compiler would require (1) running actual timing tests/profiling to get ground-truth performance data (2) trying some more variants of how to write the inner loop (maybe first merge values into the output buffer without adding, then do the addition/zero dropping in a second loop? Maybe there are other slight tweaks possible?). You can develop intuitions over time for what might work and what might not, but at the end of the day there's always an element of guess and check.
I don't think that this matrix addition is likely to be a common bottleneck, hence why I think it might be easiest to just move on. If you're having fun with this as a chance to learn more about inspecting assembly/profiling code then go for it, otherwise I think this code is basically ready to merge
Co-authored-by: Ben Parks <bnprks@users.noreply.github.com>
Hi @bnprks, here are the matrix addition changes we discussed.
Resolves outstanding to-do in NEWS.md and Issue #296.
Description
Prior to this change, combining two matrices required intermediate materialization in memory with
dgCMatrix, resulting in high memory overhead for very large matrices. This pull request adds a lazy element-wise addition of twoIterableMatrixobjects through the existing+operator.MatrixAdditionclass that iterates column-by-column and sums via a dense accumulator mirroringSparseMultiply.colSums/rowSums/vecMultiplyRight/vecMultiplyLeftvia distributive/linearity properties to bypass accumulationMatrixMultiplysubsetting andrbind2/cbind2merge_dimnames()conventions.Behavior
double, equal types preserved.transpose_storage_order())Tests
MatrixMath.Addtest to align with existing arithmetic-loader conventionsScope
mat1 + (mat2 * -1). To discuss if subtraction should be implemented as a separate function.