[r][cpp] add Bernoulli Sampling by erliuu · Pull Request #339 · bnprks/BPCells

erliuu · 2026-06-13T01:05:22Z

Been sitting on this for a while, but thought I would finally PR a useful feature I've been using downstream locally with IterableMatrices (particularly for doublet simulation).

Description

This PR adds a Bernoulli sampling function for an IterableMatrix to keep each non-zero entry independently based on provided probability prob. So for prob=0.3, we expect to keep approximately 30% of non-zeros. This should require no materialization or additional allocation.

Adds a BernoulliSample class that performs the streaming keep/drop filter based off of the FilterZeros transform.

Details

The sampling is designed to be deterministic. Keep/drop choice is a splitmix64 hash of seed, column, and row.
As the column index is taken after subset, by design, selecting the same input column multiple times will still give independent sampling not identical copies. (e.g. for doublet simulation, when MatrixSubset selects the same column more than once).
Works for uint32_t, float, and double matrices

Tests

Add functional tests in test-matrix_transforms.R: prob=1 no-op, keep percentage within tolerance, reproducibility, independent draws on duplicate subset columns, matrix-type invariance behaviour.

Happy to clarify any questions you might have, thanks!

…tests

bnprks · 2026-06-15T16:07:07Z

Hi @erliuu, I haven't had a chance to do a close read over the weekend, but from initial impressions this looks pretty good. I have two high level thoughts from an initial look:

It doesn't have to be this PR, but it might be advantageous to make a BernoulliSampleTranspose class that would allow identical sampling of a matrix after transposing the storage order (provided the same seed is used). Storage order transpose logically keeps the matrix unchanged, so users might reasonably want to be able to sample from it regardless of storage order.
For the choice of RNG, fyi I had previously considered whether we should use https://www.pcg-random.org/, which is supposed to be pretty strong+fast and has a header-only C++ library. I think splitmix64 probably can also get the job done and looks simpler + probably a small amount faster. I'll just need to find a paper or something to double-check it will provide good randomness for our use-case and we won't have to worry about weird correlations popping up. (If you have something you've found that made you choose this generator, that could be helpful to share)

erliuu · 2026-06-15T22:11:40Z

Hi @bnprks, thanks for the feedback.

That's actually a great point, I might just follow up directly in this PR. What are your thoughts on using the @transpose flag instead of a separate class. The load(), threshold, and seed logic is all the same, so if we can modify the order which the row/col feed into the hash based on the flag, this should theoretically make it storage-order invariant.
I was looking for something small, fast and also deterministic/reproducible for downstream usages, and splitmix64 came up. My brief reading of PCG seems to indicate its not stateless and might not be as trivial to add storage-order in-variance due to the sequential streaming. That said, it appears to be more robust statistically, so I'm happy to look into if you prefer.

bnprks

Hi @erliuu,

At the R level, I agree using @transpose flag is the way to go. I was more thinking at the C++ level having a second class might be convenient, but I think any option is fine that keeps the boolean check on transpose out of the hot for loop in load(). Happy to have it in this PR if it'll be fast to add in
With some further reading, I think splitmix64 seems to be fairly well regarded so no need to switch to pcg. In particular, splitmix64 seems to fare well in a benchmark by Daniel Lemire who I trust: https://github.com/lemire/testingRNG. I think the size of the state is likely to be similar between pcg and splitmix64, but it seems splitmix64 is a bit faster and as a bonus has a very simple implementation.

After taking a closer look at splitmix64, I'm not quite sure that it's being called correctly:

In this PR's splitmix64, the function is fully stateless, but in Lemire's implementation and the Java stdlib, there is a uint64_t as state. Also, the reference implementations don't do any ^ ops on the state/arguments like happens in this PR. I think that our splitmix64 is therefore slightly wrong (or at least nonstandard), and since getting a good RNG can be subtle, I think it's best to try to match how trustworthy sources do it.

Because the state manipulation is quite simple (adding a constant), I wonder if we can use something like the splitmix64_stateless(seed, offset) from the Lemire code, and set offset = col * nrow + row or offset = row * ncol + col depending on the transpose state. (It need not be this exact scheme, just something that maps each matrix entry to a unique element from the random stream)
One possible naming style suggestion: What if the R function was called sample_bernoulli() instead of bernoulli_sample()? Then it leaves room for future sample_poisson or sample_binomial functions, following the prefix style of our knn_*, plot_*, cluster_* functions. I know it reads a bit less naturally, though I've always liked how it makes alphabetical lists group nicely and easier discoverability with tab completion after typing, e.g. cluster_<tab> (optional; curious for your thoughts)

…torage order invariance, rename to sample_bernoulli

erliuu · 2026-06-24T07:30:46Z

Hi @bnprks,

Sorry for late response, hectic week so only got to this now. Thanks for the feedback!

At the R level, I agree using @transpose flag is the way to go. I was more thinking at the C++ level having a second class might be convenient, but I think any option is fine that keeps the boolean check on transpose out of the hot for loop in load(). Happy to have it in this PR if it'll be fast to add in

I was just thinking of passing it down as a bool Transpose template parameter directly in BernoulliSample instead of a second class and out of the for loop in load(). I thought the second class would just duplicate a lot of the code, but let me know if you think its actually better implementation/runtime-wise than what I have now.

In this PR's splitmix64, the function is fully stateless, but in Lemire's implementation and the Java stdlib, there is a uint64_t as state. Also, the reference implementations don't do any ^ ops on the state/arguments like happens in this PR. I think that our splitmix64 is therefore slightly wrong (or at least nonstandard), and since getting a good RNG can be subtle, I think it's best to try to match how trustworthy sources do it.

Because the state manipulation is quite simple (adding a constant), I wonder if we can use something like the splitmix64_stateless(seed, offset) from the Lemire code, and set offset = col * nrow + row or offset = row * ncol + col depending on the transpose state. (It need not be this exact scheme, just something that maps each matrix entry to a unique element from the random stream)

Great points. I've swapped the splitmix64 to match Lemire's implementation with the stateless offset as suggested. I decided to go with your suggestion on the offset formula of ncol/nrow and col/row.

One possible naming style suggestion: What if the R function was called sample_bernoulli() instead of bernoulli_sample()? Then it leaves room for future sample_poisson or sample_binomial functions, following the prefix style of our knn_, plot_, cluster_* functions. I know it reads a bit less naturally, though I've always liked how it makes alphabetical lists group nicely and easier discoverability with tab completion after typing, e.g. cluster_ (optional; curious for your thoughts)

I totally agree, looks better and leaves room for future features. I've made the swap.

This seems like a slightly weird subtest, as my recollection is that our subset op explicitly throws an error if you try to duplicate columns/rows using it.

I was keeping that there for a downstream case in MarchR doublet simulation, but you're right its not really needed here. Removed.

As per overall comments, might want to add a reproducibility check when using transpose_storage_order()

Thanks, added the check!

[r][cpp] Add functionality for IterableMatrix Bernoulli sampling and …

68a5773

…tests

erliuu requested a review from bnprks June 13, 2026 01:05

bnprks reviewed Jun 17, 2026

View reviewed changes

Comment thread r/tests/testthat/test-matrix_transforms.R Outdated

Comment thread r/tests/testthat/test-matrix_transforms.R

[r][cpp] Match splitmix64 to standard implementation, add transpose s…

9e6b6fe

…torage order invariance, rename to sample_bernoulli

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[r][cpp] add Bernoulli Sampling#339

[r][cpp] add Bernoulli Sampling#339
erliuu wants to merge 2 commits into
mainfrom
el/bernoulli-sampling

erliuu commented Jun 13, 2026

Uh oh!

bnprks commented Jun 15, 2026

Uh oh!

erliuu commented Jun 15, 2026

Uh oh!

bnprks left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

erliuu commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

erliuu commented Jun 13, 2026

Description

Details

Tests

Uh oh!

bnprks commented Jun 15, 2026

Uh oh!

erliuu commented Jun 15, 2026

Uh oh!

bnprks left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

erliuu commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bnprks left a comment •

edited

Loading