[r][cpp] add Bernoulli Sampling#339
Conversation
|
Hi @erliuu, I haven't had a chance to do a close read over the weekend, but from initial impressions this looks pretty good. I have two high level thoughts from an initial look:
|
|
Hi @bnprks, thanks for the feedback.
|
There was a problem hiding this comment.
Hi @erliuu,
- At the R level, I agree using
@transposeflag is the way to go. I was more thinking at the C++ level having a second class might be convenient, but I think any option is fine that keeps the boolean check on transpose out of the hotforloop inload(). Happy to have it in this PR if it'll be fast to add in - With some further reading, I think splitmix64 seems to be fairly well regarded so no need to switch to pcg. In particular, splitmix64 seems to fare well in a benchmark by Daniel Lemire who I trust: https://github.com/lemire/testingRNG. I think the size of the state is likely to be similar between pcg and splitmix64, but it seems splitmix64 is a bit faster and as a bonus has a very simple implementation.
After taking a closer look at splitmix64, I'm not quite sure that it's being called correctly:
-
In this PR's
splitmix64, the function is fully stateless, but in Lemire's implementation and the Java stdlib, there is auint64_tas state. Also, the reference implementations don't do any^ops on the state/arguments like happens in this PR. I think that oursplitmix64is therefore slightly wrong (or at least nonstandard), and since getting a good RNG can be subtle, I think it's best to try to match how trustworthy sources do it.Because the state manipulation is quite simple (adding a constant), I wonder if we can use something like the
splitmix64_stateless(seed, offset)from the Lemire code, and setoffset = col * nrow + roworoffset = row * ncol + coldepending on the transpose state. (It need not be this exact scheme, just something that maps each matrix entry to a unique element from the random stream) -
One possible naming style suggestion: What if the R function was called
sample_bernoulli()instead ofbernoulli_sample()? Then it leaves room for futuresample_poissonorsample_binomialfunctions, following the prefix style of ourknn_*,plot_*,cluster_*functions. I know it reads a bit less naturally, though I've always liked how it makes alphabetical lists group nicely and easier discoverability with tab completion after typing, e.g.cluster_<tab>(optional; curious for your thoughts)
…torage order invariance, rename to sample_bernoulli
|
Hi @bnprks, Sorry for late response, hectic week so only got to this now. Thanks for the feedback!
I was just thinking of passing it down as a
Great points. I've swapped the splitmix64 to match Lemire's implementation with the stateless offset as suggested. I decided to go with your suggestion on the
I totally agree, looks better and leaves room for future features. I've made the swap.
I was keeping that there for a downstream case in MarchR doublet simulation, but you're right its not really needed here. Removed.
Thanks, added the check! |
Hi @bnprks,
Been sitting on this for a while, but thought I would finally PR a useful feature I've been using downstream locally with IterableMatrices (particularly for doublet simulation).
Description
This PR adds a Bernoulli sampling function for an
IterableMatrixto keep each non-zero entry independently based on provided probabilityprob. So forprob=0.3, we expect to keep approximately 30% of non-zeros. This should require no materialization or additional allocation.BernoulliSampleclass that performs the streaming keep/drop filter based off of theFilterZerostransform.Details
splitmix64hash of seed, column, and row.MatrixSubsetselects the same column more than once).uint32_t,float, anddoublematricesTests
test-matrix_transforms.R:prob=1no-op, keep percentage within tolerance, reproducibility, independent draws on duplicate subset columns, matrix-type invariance behaviour.Happy to clarify any questions you might have, thanks!