Skip to content

Add bloom filter folding to automatically size SBBF filters#9628

Open
adriangb wants to merge 27 commits intoapache:mainfrom
pydantic:bloom-filter-folding
Open

Add bloom filter folding to automatically size SBBF filters#9628
adriangb wants to merge 27 commits intoapache:mainfrom
pydantic:bloom-filter-folding

Conversation

@adriangb
Copy link
Copy Markdown
Contributor

@adriangb adriangb commented Mar 30, 2026

Summary

Bloom filters now support folding mode: allocate a conservatively large filter (sized for worst-case NDV), insert all values during writing, then fold down at flush time to meet a target FPP. This eliminates the need to guess NDV upfront and produces optimally-sized filters automatically.

Changes

  • BloomFilterProperties.ndv changed from u64 to Option<u64> — when None (new default), the filter is sized based on max_row_group_row_count; when Some(n), the explicit NDV is used
  • DEFAULT_BLOOM_FILTER_NDV redefined to DEFAULT_MAX_ROW_GROUP_ROW_COUNT as u64 (was hardcoded 1_000_000)
  • Added Sbbf::fold_to_target_fpp() and supporting methods (num_folds_for_target_fpp, fold_n, num_blocks) with comprehensive documentation
  • flush_bloom_filter() in both ColumnValueEncoderImpl and ByteArrayEncoder now folds the filter before returning it
  • New create_bloom_filter() helper in encoder.rs centralizes bloom filter construction logic

How folding works

The SBBF fold operation merges adjacent block pairs (block[2i] | block[2i+1]) via bitwise OR, halving the filter size. This differs from standard Bloom filter folding (which merges halves at distance m/2) because SBBF uses multiplicative hashing for block selection:

block_index = ((hash >> 32) * num_blocks) >> 32

When num_blocks is halved, the new index becomes floor(original_index / 2), so adjacent blocks map to the same position.

The number of safe folds is determined analytically from the average per-block fill rate: after k folds, expected fill is 1 - (1-f)^(2^k), giving FPP = fill^8. This requires only a single popcount scan over the blocks (no scratch allocation), then O(log N) floating-point ops to find the optimal fold count. The actual fold is then performed in a single pass.

Benchmarks

Filter sized for 1M NDV, varying actual distinct values inserted. Measured on Apple M3 Pro.

Fold overhead (fold_to_target_fpp only):

Actual NDV Time Throughput
1,000 37.8 µs 866 Melem/s
10,000 31.9 µs 1.03 Gelem/s
100,000 32.2 µs 1.02 Gelem/s

End-to-end (insert + fold) vs insert-only:

Actual NDV Insert only Insert + fold Fold overhead
1,000 14.7 µs 49.2 µs 34.5 µs (70%)
10,000 30.2 µs 56.0 µs 25.8 µs (46%)
100,000 156.3 µs 183.5 µs 27.2 µs (15%)

The fold cost is dominated by the popcount scan over the initial (large) filter. For the common case (100K values into a 1M-NDV filter), folding adds only ~15% overhead to the total insert+fold time.

References

Sailhan & Stehr, "Folding and Unfolding Bloom Filters", IEEE iThings 2012.

Liang, "Blocked Bloom Filters: Speeding Up Point Lookups in Tiger Postgres' Native Columnstore"

Breaking changes

  • BloomFilterProperties.ndv: u64Option<u64> (direct struct construction must be updated)

Test plan

  • All existing bloom filter unit tests pass
  • All existing integration tests (sync + async reader roundtrips) pass
  • New unit tests: fold correctness, no false negatives after folding, FPP target respected, minimum size guard
  • New unit tests: folded filter is bit-identical to a fresh filter of the same size (proves correctness via two lemmas about SBBF hashing)
  • New unit tests: multi-step folding, folded FPP matches fresh FPP empirically, fold size matches optimal fixed-size filter
  • New integration test: i32_column_bloom_filter_fixed_ndv — roundtrip with both overestimated and underestimated NDV
  • Full cargo test -p parquet passes

🤖 Generated with Claude Code

Instead of requiring users to guess NDV (number of distinct values) upfront,
bloom filters now support a folding mode: allocate a conservatively large
filter (sized for worst-case NDV = max row group rows), insert all values
during writing, then fold down at flush time to meet a target FPP.

When NDV is not explicitly set (the new default), folding mode activates
automatically. Setting NDV explicitly preserves the existing fixed-size
behavior for backward compatibility.

Key changes:
- BloomFilterProperties.ndv is now Option<u64> (None = folding mode)
- Added BloomFilterProperties.max_bytes for explicit initial size control
- Default FPP changed from 0.05 to 0.01
- Added Sbbf::fold_to_target_fpp() which merges adjacent block pairs
- Both ColumnValueEncoderImpl and ByteArrayEncoder fold at flush time

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the parquet Changes to the parquet crate label Mar 30, 2026
adriangb and others added 5 commits March 30, 2026 06:41
- Revert DEFAULT_BLOOM_FILTER_FPP back to 0.05 (no behavior change)
- Add comprehensive docstrings on Sbbf, fold_once, estimated_fpp_after_fold,
  and fold_to_target_fpp explaining the mathematical basis, SBBF adaptation
  (adjacent pairs vs halves), FPP estimation, and correctness guarantees
- Add citation to Sailhan & Stehr "Folding and Unfolding Bloom Filters"
  (IEEE iThings 2012, doi:10.1109/GreenCom.2012.16)
- Keep module-level docs short, pointing to struct/method docs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented Mar 30, 2026

Hey @adriangb, cool idea. What motivated this if you don't mind me asking? Are any other Parquet implementations doing this?

@adriangb
Copy link
Copy Markdown
Contributor Author

adriangb commented Mar 30, 2026

@viirya or @jimexist since you've worked on our bloom filters before, any interest in reviewing?

@adriangb
Copy link
Copy Markdown
Contributor Author

Hey @adriangb, cool idea. What motivated this if you don't mind me asking? Are any other Parquet implementations doing this?

My motivation was that looking at our data this is a consistent problem: we have high cardinality data (trace ids) that when packed into 1M row row groups saturate the bloom filters (making them useless) but also waste a ton of space in small files. In looking for a solution I came across this neat trick.

I don't know if other Parquet implementations use this, but TimescaleDB does (linked above).

Co-authored-by: emkornfield <emkornfield@gmail.com>
assert!(
len >= 2,
"Cannot fold a bloom filter with fewer than 2 blocks"
);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert!(len % 2 == 0)?

I think fold_once can only work if len is not odd.

Copy link
Copy Markdown
Contributor

@emkornfield emkornfield Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think is should work fine with odd values as long, we are sure that the last value doesn't do an out of bound index? (i.e. the last block is not modified for the odd case). But I think we probably truncate too much for odd values.

let block_fill = set_bits as f64 / 256.0;
total_fpp += block_fill.powi(8);
}
total_fpp / half as f64
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is cast needed here, can this be avoided by setting the type explicitly on total_fpp?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

///
/// ## Why adjacent pairs (not halves)?
///
/// Standard Bloom filter folding merges the two halves (`B[i] | B[i + m/2]`) because
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: as an explanation it might pay to reverse, this I'm not sure whether readers would commonly be aware of bloom filter folding. So it might be better to explain why half first and then indicate why this is different then the linked paper.

.bloom_filter_properties(descr.path())
.map(|props| Sbbf::new_with_ndv_fpp(props.ndv, props.fpp))
.transpose()?;
let (bloom_filter, bloom_filter_target_fpp) =
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the bloom filter creation logic the same as encoder.rs? Maybe we can extract fn create_bloom_filter?

Copy link
Copy Markdown
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have e2e tests that cover this folding mode behavior already?

@adriangb
Copy link
Copy Markdown
Contributor Author

Do we have e2e tests that cover this folding mode behavior already?

I can add them. Where would you recommend? I'm not all that familiar with the test structure here.

adriangb and others added 2 commits March 30, 2026 17:31
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
@viirya
Copy link
Copy Markdown
Member

viirya commented Mar 31, 2026

Do we have e2e tests that cover this folding mode behavior already?

I can add them. Where would you recommend? I'm not all that familiar with the test structure here.

Okay. Actually existing integration roundtrip tests after this PR will cover folding path automatically because they don't set NDV. So it turns out that old behavior fixed-size mode will not be covered by these roundtrip tests. Seems we should add roundtrip tests for fixed-size mode.

Arrow writer roundtrip tests are in parquet/src/arrow/arrow_writer/mod.rs, like i32_column_bloom_filter, i32_column_bloom_filter_at_end, etc.

Arrow reader roundtrip tests like
test_get_row_group_column_bloom_filter_with_length in parquet/tests/arrow_reader/bloom_filter/sync.rs, only calls set_bloom_filter_enabled(true).

@adriangb
Copy link
Copy Markdown
Contributor Author

I added a test for the legacy path. Should we deprecate it? I think the intent is better captured by the new path. One may want to create exact size bloom filters, but I don't think setting the NDV and FPP is the right way to do that (a setting for specifying the size directly would be better).

@viirya
Copy link
Copy Markdown
Member

viirya commented Mar 31, 2026

I added a test for the legacy path. Should we deprecate it? I think the intent is better captured by the new path. One may want to create exact size bloom filters, but I don't think setting the NDV and FPP is the right way to do that (a setting for specifying the size directly would be better).

Yea, I think we can deprecate the old behavior and maybe remove it after few releases.

@adriangb
Copy link
Copy Markdown
Contributor Author

Yea, I think we can deprecate the old behavior and maybe remove it after few releases.

Do you want to do that in this PR or in a followup (maybe once this is out in the wild and known to be working well)?

@viirya
Copy link
Copy Markdown
Member

viirya commented Mar 31, 2026

Yea, I think we can deprecate the old behavior and maybe remove it after few releases.

Do you want to do that in this PR or in a followup (maybe once this is out in the wild and known to be working well)?

I think we can do it in this PR.

@adriangb
Copy link
Copy Markdown
Contributor Author

Yea, I think we can deprecate the old behavior and maybe remove it after few releases.

Do you want to do that in this PR or in a followup (maybe once this is out in the wild and known to be working well)?

I think we can do it in this PR.

If we want to deprecate the existing NDV I think we're better off re-interpreting it to mean "maximum ndv" or "initial ndv". That way existing users who are setting the ndv also benefit from folding. This means there will be no way to disable folding but I also don't see any reason anyone would want to do that beyond requiring a fixed-size bloom filter (in which case relying on a combination of fpp + ndv giving you a fixed size was probably a bad choice to begin with given I don't think we made any such API promise, and they should open an issue requesting an explicit API for this).

Thus the only changes vs. main now are:

  1. Adding the folding on write.
  2. Default max ndv is derived from the max rows per row group instead of being hardcoded.

@adriangb adriangb requested review from emkornfield and viirya March 31, 2026 20:43
(SMALL_SIZE as i32 + 1..SMALL_SIZE as i32 + 10).collect(),
);

// NDV smaller than actual distinct values — tests the underestimate path
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

array has only 7 distinct value. So "NDV smaller than actual distinct values" seems incorrect?

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 1, 2026

Checking this one out

SMALL_SIZE=7 so the array has 7 distinct values; ndv=10 was larger
than actual, not smaller. Use ndv=3 to correctly test the path where
NDV < actual distinct values.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very cool @adriangb

I reviewed this one carefully, not algorithmically but empirically with testing and I have convinced myself that this folding algorithm is correct -- specifically that it produces bit for bit identical filters.

I made a fuzz test here

My two concerns:

  1. Performance (my testing shows this folding operation is very expensive) -- if we want to enable this behavior by default we should probably optimize the calculation and folding
  2. API change -- sadly this looks like a public API change so will have to wait to the next major release

/// If you do set this value explicitly it is probably best to set it for each column
/// individually via [`WriterPropertiesBuilder::set_column_bloom_filter_ndv`] rather than globally,
/// since different columns may have different numbers of distinct values.
pub ndv: Option<u64>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is a public struct, I think this is a breaking API change: https://docs.rs/parquet/58.0.0/parquet/file/properties/struct.BloomFilterProperties.html

/// Setting to a very small number diminishes the value of the filter itself, as the bitset size is
/// even larger than just storing the whole value. You are also expected to set `ndv` if it can
/// be known in advance to greatly reduce space usage.
/// This value also serves as the target FPP for bloom filter folding: after all values
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really cool feature -- to make bloom filter creation dynamically adjust the bloom filter size based on the data. I really like that there is a way to go back to the old explicit behavior too -- and solves a problem for using bloom filters in Parquet.

As mentioned below this would be a breaking API change so would have to wait for the next release

Given this is a pub struct with pub fields, I am not sure there is any way to avoid a breaking API change 🤔

One thing I think that would make the API could potentially do would be to make a builder

struct BloomFilterPropertiesBuilder { 
...
}

And change the fields of BloomFilterProperties to not be public 🤔

(we could do this as a follow on PR)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of went down this path and then backed out to minimize the diff / change. I would propose we merge this and then refactor into BloomFilterPropertiesBuilder to ease future changes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me

let half = len / 2;
for i in 0..half {
for j in 0..8 {
self.0[i].0[j] = self.0[2 * i].0[j] | self.0[2 * i + 1].0[j];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this code seems very expensive in profiling

Image

It also consumes a huge amount of time even in release builds
Image

I wonder if there is some way to make this faster (unchecked access / rust iterators, etc)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will see what I can do

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gotten Claude to do some benchmark driven optimization. It should be ~ 35% faster now.

@alamb alamb added api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version labels Apr 1, 2026
adriangb and others added 5 commits April 1, 2026 10:01
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Rewrite fold_to_target_fpp to combine the fold and FPP estimation into
a single pass using chunks_exact(2) and a fresh Vec, eliminating aliasing
that prevented auto-vectorization of the [u32; 8] OR loop. Fix fold_once
similarly by copying blocks to locals before the inner loop. Remove the
now-inlined estimated_fpp_after_fold. Up to ~20% faster on realistic
filter sizes.

Add a criterion benchmark (parquet/benches/bloom_filter.rs) measuring
fold_to_target_fpp and insert+fold end-to-end for varying NDVs.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Fold in-place instead of allocating a new Vec per fold level. Uses a
  two-pass approach (read-only FPP estimate, then in-place fold) so
  the existing Vec allocation is reused across all fold iterations.

- Add BitOr/BitOrAssign impls and Block::count_ones() method. Separating
  the OR from the popcount allows the compiler to emit cnt.16b (128-bit
  SIMD popcount) instead of cnt.8b (per-word scalar), and to process
  FPP math for 2 blocks in parallel via fmul.2d/ucvtf.2d.

- Assembly confirms: zero __rust_alloc/__rust_dealloc calls, 4x loop
  unrolling, and full NEON vectorization of the hot path.

Benchmarks vs. pre-optimization baseline (1M NDV filter):
  fold_to_target_fpp/100K values: -31%
  insert_and_fold/1K values:      -31%

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
"Cannot fold a bloom filter with fewer than 2 blocks"
);
assert!(
len % 2 == 0,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't clear to me that this will be the general case (would need to reread surrounding code and formulation) but are we sure this could never be hit in practice? It doesn't seem expensive to necessarily handle the case when it is odd.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Folding is only applied on the write path and we never generate odd block counts, so I think it's worth the defensive assertion instead of handling via a dead code path.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. If block counts are never odd, I would expect an assert someplace above this function (e.g. on initialization) that the block count is never odd as well (and we can comment there that this is required for folding).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of bytes used is always a power of 2 and a multiple of block size, so it follows that the number of blocks will also be a power of 2. But it looks like this code is no longer used.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main point it's not obvious from this patch that this is the case. Im not sure we made use of the fact of power of 2 sizing before and if start doing so, we should ideally document that with an assertion or unit test on where we set initial number of blocks, so it doesn't regress in the future

/// Separating the OR from the popcount lets the compiler emit vectorized popcount
/// instructions (e.g., `cnt.16b` on ARM NEON) instead of per-word scalar popcount.
fn estimated_fpp_after_fold(&self) -> f64 {
let half = self.0.len() as f64 / 2.0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move this to above where it is used.

fn estimated_fpp_after_fold(&self) -> f64 {
let half = self.0.len() as f64 / 2.0;
let mut total_fpp: f64 = 0.0;
for pair in self.0.chunks_exact(2) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: either have a remainder block here, or assert len() is evenly divisible by 2?

while self.0.len() >= 2 {
// Pass 1: Read-only FPP estimate. If the fold would exceed the
// target FPP, stop before mutating the filter.
if self.estimated_fpp_after_fold() > target_fpp {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for benchmarking, how much is the estimated fpp an issue vs the actual folding? Not necessarily for this PR, but it seems like it could be possible to binary search for the target number of folds, and then do all folding in a single pass (one could probably just do all folding in a single pass anyways if we didn't want do the binary search).

…move half

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
"Cannot fold a bloom filter with an odd number of blocks"
);
let half = len / 2;
for i in 0..half {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the estimate function we use chunks_extract, could we maybe converge on a single approach (I wonder if the actual iteration should be factored to a higher level function that takes a functor to process the individual blocks).

// where k = number of hash functions, m = number of bits, n = number of distinct values
#[inline]
fn num_of_bits_from_ndv_fpp(ndv: u64, fpp: f64) -> usize {
pub(crate) fn num_of_bits_from_ndv_fpp(ndv: u64, fpp: f64) -> usize {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: looks like this might not be needed? (at least searching through github UI isn't turning up a new call).

adriangb and others added 5 commits April 1, 2026 16:25
Instead of alternating between FPP estimation and folding at each level
(reading data twice per level), determine the number of safe folds upfront
via a tree reduction on a scratch buffer, then fold all levels in one pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ld_n

Remove fold_once and estimated_fpp_after_fold in favor of two composable
private methods that are independently testable. Tests now use fold_n(1)
directly instead of a test-only helper.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tch buffer

Replace the tree-reduction approach (O(N/2) scratch allocation + simulated
merges) with an analytical estimate based on average per-block fill rate.
After k folds, expected fill is 1-(1-f)^(2^k), giving FPP = fill^8.

This is ~2x faster for the fold-only benchmark: the only data pass is a
vectorizable popcount scan, followed by O(log N) floating-point ops.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants