Skip to content

[r][cpp] add Bernoulli Sampling#339

Open
erliuu wants to merge 2 commits into
mainfrom
el/bernoulli-sampling
Open

[r][cpp] add Bernoulli Sampling#339
erliuu wants to merge 2 commits into
mainfrom
el/bernoulli-sampling

Conversation

@erliuu

@erliuu erliuu commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator

Hi @bnprks,

Been sitting on this for a while, but thought I would finally PR a useful feature I've been using downstream locally with IterableMatrices (particularly for doublet simulation).

Description

This PR adds a Bernoulli sampling function for an IterableMatrix to keep each non-zero entry independently based on provided probability prob. So for prob=0.3, we expect to keep approximately 30% of non-zeros. This should require no materialization or additional allocation.

  • Adds a BernoulliSample class that performs the streaming keep/drop filter based off of the FilterZeros transform.

Details

  • The sampling is designed to be deterministic. Keep/drop choice is a splitmix64 hash of seed, column, and row.
  • As the column index is taken after subset, by design, selecting the same input column multiple times will still give independent sampling not identical copies. (e.g. for doublet simulation, when MatrixSubset selects the same column more than once).
  • Works for uint32_t, float, and double matrices

Tests

  • Add functional tests in test-matrix_transforms.R: prob=1 no-op, keep percentage within tolerance, reproducibility, independent draws on duplicate subset columns, matrix-type invariance behaviour.

Happy to clarify any questions you might have, thanks!

@erliuu erliuu requested a review from bnprks June 13, 2026 01:05
@bnprks

bnprks commented Jun 15, 2026

Copy link
Copy Markdown
Owner

Hi @erliuu, I haven't had a chance to do a close read over the weekend, but from initial impressions this looks pretty good. I have two high level thoughts from an initial look:

  1. It doesn't have to be this PR, but it might be advantageous to make a BernoulliSampleTranspose class that would allow identical sampling of a matrix after transposing the storage order (provided the same seed is used). Storage order transpose logically keeps the matrix unchanged, so users might reasonably want to be able to sample from it regardless of storage order.
  2. For the choice of RNG, fyi I had previously considered whether we should use https://www.pcg-random.org/, which is supposed to be pretty strong+fast and has a header-only C++ library. I think splitmix64 probably can also get the job done and looks simpler + probably a small amount faster. I'll just need to find a paper or something to double-check it will provide good randomness for our use-case and we won't have to worry about weird correlations popping up. (If you have something you've found that made you choose this generator, that could be helpful to share)

@erliuu

erliuu commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator Author

Hi @bnprks, thanks for the feedback.

  1. That's actually a great point, I might just follow up directly in this PR. What are your thoughts on using the @transpose flag instead of a separate class. The load(), threshold, and seed logic is all the same, so if we can modify the order which the row/col feed into the hash based on the flag, this should theoretically make it storage-order invariant.
  2. I was looking for something small, fast and also deterministic/reproducible for downstream usages, and splitmix64 came up. My brief reading of PCG seems to indicate its not stateless and might not be as trivial to add storage-order in-variance due to the sequential streaming. That said, it appears to be more robust statistically, so I'm happy to look into if you prefer.

@bnprks bnprks left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @erliuu,

  1. At the R level, I agree using @transpose flag is the way to go. I was more thinking at the C++ level having a second class might be convenient, but I think any option is fine that keeps the boolean check on transpose out of the hot for loop in load(). Happy to have it in this PR if it'll be fast to add in
  2. With some further reading, I think splitmix64 seems to be fairly well regarded so no need to switch to pcg. In particular, splitmix64 seems to fare well in a benchmark by Daniel Lemire who I trust: https://github.com/lemire/testingRNG. I think the size of the state is likely to be similar between pcg and splitmix64, but it seems splitmix64 is a bit faster and as a bonus has a very simple implementation.

After taking a closer look at splitmix64, I'm not quite sure that it's being called correctly:

  1. In this PR's splitmix64, the function is fully stateless, but in Lemire's implementation and the Java stdlib, there is a uint64_t as state. Also, the reference implementations don't do any ^ ops on the state/arguments like happens in this PR. I think that our splitmix64 is therefore slightly wrong (or at least nonstandard), and since getting a good RNG can be subtle, I think it's best to try to match how trustworthy sources do it.

    Because the state manipulation is quite simple (adding a constant), I wonder if we can use something like the splitmix64_stateless(seed, offset) from the Lemire code, and set offset = col * nrow + row or offset = row * ncol + col depending on the transpose state. (It need not be this exact scheme, just something that maps each matrix entry to a unique element from the random stream)

  2. One possible naming style suggestion: What if the R function was called sample_bernoulli() instead of bernoulli_sample()? Then it leaves room for future sample_poisson or sample_binomial functions, following the prefix style of our knn_*, plot_*, cluster_* functions. I know it reads a bit less naturally, though I've always liked how it makes alphabetical lists group nicely and easier discoverability with tab completion after typing, e.g. cluster_<tab> (optional; curious for your thoughts)

Comment thread r/tests/testthat/test-matrix_transforms.R Outdated
Comment thread r/tests/testthat/test-matrix_transforms.R
…torage order invariance, rename to sample_bernoulli
@erliuu

erliuu commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator Author

Hi @bnprks,

Sorry for late response, hectic week so only got to this now. Thanks for the feedback!

At the R level, I agree using @transpose flag is the way to go. I was more thinking at the C++ level having a second class might be convenient, but I think any option is fine that keeps the boolean check on transpose out of the hot for loop in load(). Happy to have it in this PR if it'll be fast to add in

I was just thinking of passing it down as a bool Transpose template parameter directly in BernoulliSample instead of a second class and out of the for loop in load(). I thought the second class would just duplicate a lot of the code, but let me know if you think its actually better implementation/runtime-wise than what I have now.

In this PR's splitmix64, the function is fully stateless, but in Lemire's implementation and the Java stdlib, there is a uint64_t as state. Also, the reference implementations don't do any ^ ops on the state/arguments like happens in this PR. I think that our splitmix64 is therefore slightly wrong (or at least nonstandard), and since getting a good RNG can be subtle, I think it's best to try to match how trustworthy sources do it.

Because the state manipulation is quite simple (adding a constant), I wonder if we can use something like the splitmix64_stateless(seed, offset) from the Lemire code, and set offset = col * nrow + row or offset = row * ncol + col depending on the transpose state. (It need not be this exact scheme, just something that maps each matrix entry to a unique element from the random stream)

Great points. I've swapped the splitmix64 to match Lemire's implementation with the stateless offset as suggested. I decided to go with your suggestion on the offset formula of ncol/nrow and col/row.

One possible naming style suggestion: What if the R function was called sample_bernoulli() instead of bernoulli_sample()? Then it leaves room for future sample_poisson or sample_binomial functions, following the prefix style of our knn_, plot_, cluster_* functions. I know it reads a bit less naturally, though I've always liked how it makes alphabetical lists group nicely and easier discoverability with tab completion after typing, e.g. cluster_ (optional; curious for your thoughts)

I totally agree, looks better and leaves room for future features. I've made the swap.

This seems like a slightly weird subtest, as my recollection is that our subset op explicitly throws an error if you try to duplicate columns/rows using it.

I was keeping that there for a downstream case in MarchR doublet simulation, but you're right its not really needed here. Removed.

As per overall comments, might want to add a reproducibility check when using transpose_storage_order()

Thanks, added the check!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants