CBO Phase 3: HyperLogLog NDV Estimator#82
Conversation
📝 WalkthroughWalkthroughAdds a header-only HyperLogLog implementation, uses per-column HLL sketches in QueryExecutor::execute_analyze to estimate NDV, adds unit tests for HLL, adjusts an integration test to accept probabilistic NDV, and registers a new CMake test target ChangesHyperLogLog Cardinality Estimation and ANALYZE TABLE
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Adds new hll_tests executable for HyperLogLog unit testing.
Replaces std::unordered_set<std::string>> NDV collection with std::vector<common::HyperLogLog>. Uses FNV-1a hash for text (better upper-bit distribution than djb2) and Value::Hash for numeric types. Linear counting fallback for small cardinalities prevents HLL's extreme overestimation when few registers are used.
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/executor/query_executor.cpp (1)
307-324:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winRoute
ANALYZE TABLEthroughexecute_analyze(...).
execute_analyze(...)is implemented below, but this dispatcher never calls it, soANALYZE TABLEwill still fall into"Unsupported statement type".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/executor/query_executor.cpp` around lines 307 - 324, The dispatcher in execute (the if/else chain handling parser::StmtType cases) is missing a branch for parser::StmtType::Analyze; add an else-if branch that calls execute_analyze with the statement cast to the correct type (e.g., dynamic_cast<const parser::AnalyzeStatement&>(stmt)) and pass the current transaction object (txn) if execute_analyze expects it, mirroring how execute_select/execute_insert/etc. are invoked so ANALYZE TABLE routes to execute_analyze instead of falling through to "Unsupported statement type".
🧹 Nitpick comments (1)
tests/hll_test.cpp (1)
56-165: ⚡ Quick winStrengthen these assertions beyond “non-zero”.
Most of this suite still passes if
cardinality()returns1for every non-empty sketch, so it won't catch broken HLL math. For a deterministic input stream, add a few bounded accuracy checks and verify merged sketches estimate at least as large as each input sketch.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/hll_test.cpp` around lines 56 - 165, Replace the weak "non-zero" assertions with bounded-accuracy checks: for tests that insert N distinct items (e.g., the loops in DistinctValuesProduceCardinality, DistinctValueSetsProduceCardinality, TextValueInsertion, MergeCombinesDistinctSets) assert the returned HyperLogLog::cardinality() lies within a reasonable relative tolerance of N (for example N*0.5 <= card <= N*1.5) or use a configurable relative error constant (e.g., 1.6% * k or a fixed 50% for small N) and assert merge preserves/raises counts (after hll1.merge(hll2) assert hll1.cardinality() >= old_hll1_cardinality and >= old_hll2_cardinality); similarly replace the single zero-check after reset to EXPECT_EQ(cardinality(), 0U) remains but ensure before reset you capture and assert expected range; use HyperLogLog::insert, cardinality(), merge(), reset(), and hash_bytes() as anchors when editing each test.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@include/common/hll.hpp`:
- Around line 122-125: merge() silently combines HyperLogLog sketches with
different hash seeds which corrupts results; change merge(const HyperLogLog&
other) noexcept to remove noexcept and add a check that other.seed_ == seed_,
and if not, throw a descriptive std::invalid_argument (or std::runtime_error)
mentioning mismatched seeds; if seeds match, proceed with the existing loop over
kNumRegisters updating registers_.at(i). Ensure you reference and update the
function signature and any callers to handle the exception.
- Around line 83-111: The sparse-case and raw HLL formulas are wrong (they use
log2 and lack alpha scaling); replace the linear counting and raw estimate
calculations so linear counting uses the natural log (std::log) with V =
empty_count and E = -m * std::log(static_cast<double>(empty_count) / m), and
replace raw_estimate with the HLL estimator raw_estimate = alpha_m * m * m / sum
where alpha_m is chosen per m (compute alpha_m using the standard small-m
constants or formula for get_alpha(m)); keep existing bounds handling for
estimate, casting and use kMaxCardinality as before (refer to nonzero_count, m,
empty_count, sum, raw_estimate, estimate, kMaxCardinality).
In `@src/executor/query_executor.cpp`:
- Around line 955-961: The text branch currently hashes only a 64-byte prefix
(see col_info.type checks and val.as_text()) which undercounts NDV; change it to
hash the entire string by passing the full length of val.as_text() to
common::HyperLogLog::hash_bytes (replace the prefix_len logic and use s.size()),
ensuring the computed hash value assigned to variable hash covers the whole text
for ValueType::TYPE_TEXT/TYPE_VARCHAR/TYPE_CHAR cases.
In `@tests/cloudSQL_tests.cpp`:
- Around line 1263-1610: The file contains unresolved git conflict markers
surrounding a large block of new tests (e.g., TEST(ExecutionTests,
AnalyzeTable), AnalyzeTableLargeRows, AnalyzeTableNonExistent,
AnalyzeTableEmpty, AnalyzeTableWithNulls, AnalyzeTableStringStats,
AnalyzeFilterSelectivity, AnalyzeJoinOrder); remove the conflict markers
(<<<<<<<, =======, >>>>>>>) and keep the intended test additions (the ANALYZE
TABLE test cases) as a single coherent block, ensuring only one copy of each
TEST(...) remains and the surrounding includes/namespace usages remain valid so
the file compiles.
---
Outside diff comments:
In `@src/executor/query_executor.cpp`:
- Around line 307-324: The dispatcher in execute (the if/else chain handling
parser::StmtType cases) is missing a branch for parser::StmtType::Analyze; add
an else-if branch that calls execute_analyze with the statement cast to the
correct type (e.g., dynamic_cast<const parser::AnalyzeStatement&>(stmt)) and
pass the current transaction object (txn) if execute_analyze expects it,
mirroring how execute_select/execute_insert/etc. are invoked so ANALYZE TABLE
routes to execute_analyze instead of falling through to "Unsupported statement
type".
---
Nitpick comments:
In `@tests/hll_test.cpp`:
- Around line 56-165: Replace the weak "non-zero" assertions with
bounded-accuracy checks: for tests that insert N distinct items (e.g., the loops
in DistinctValuesProduceCardinality, DistinctValueSetsProduceCardinality,
TextValueInsertion, MergeCombinesDistinctSets) assert the returned
HyperLogLog::cardinality() lies within a reasonable relative tolerance of N (for
example N*0.5 <= card <= N*1.5) or use a configurable relative error constant
(e.g., 1.6% * k or a fixed 50% for small N) and assert merge preserves/raises
counts (after hll1.merge(hll2) assert hll1.cardinality() >= old_hll1_cardinality
and >= old_hll2_cardinality); similarly replace the single zero-check after
reset to EXPECT_EQ(cardinality(), 0U) remains but ensure before reset you
capture and assert expected range; use HyperLogLog::insert, cardinality(),
merge(), reset(), and hash_bytes() as anchors when editing each test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ec6e791d-6dfa-461f-bae4-f36dbc2e255b
📒 Files selected for processing (5)
CMakeLists.txtinclude/common/hll.hppsrc/executor/query_executor.cpptests/cloudSQL_tests.cpptests/hll_test.cpp
| // For sparse data (few registers used), use linear counting to avoid | ||
| // HLL's extreme overestimation. When registers are sparse (nonzero < m/20), | ||
| // the HLL raw formula gives wildly incorrect results. | ||
| if (nonzero_count < static_cast<int>(m / 20)) { | ||
| // Linear counting: E[n] ≈ -m * log(V/m) where V = empty fraction | ||
| // Using simple form without alpha scaling for very sparse data | ||
| double linear_est = -m * std::log2(static_cast<double>(empty_count) / m); | ||
| return static_cast<uint64_t>(std::max(1.0, linear_est)); | ||
| } | ||
|
|
||
| // Standard HLL formula for moderate to large cardinalities | ||
| double raw_estimate = m * std::log2(m / sum); | ||
|
|
||
| // Bias correction for small cardinalities | ||
| // HLL systematically overestimates for small n; apply downward bias | ||
| double bias = 0.0; | ||
| if (raw_estimate <= 2.5 * m) { | ||
| bias = -0.5 * (m / 10.0); | ||
| } | ||
|
|
||
| double estimate = raw_estimate + bias; | ||
|
|
||
| if (estimate < 0) { | ||
| return 0; | ||
| } | ||
| if (estimate > static_cast<double>(kMaxCardinality)) { | ||
| return kMaxCardinality; | ||
| } | ||
| return static_cast<uint64_t>(estimate); |
There was a problem hiding this comment.
Use the actual HLL estimators here.
Both estimates are using log2(...), but standard HLL needs linear counting with the natural log for the sparse case and alpha_m * m * m / sum for the raw estimate. As written, this will store materially biased NDV stats for ANALYZE TABLE.
Suggested fix
- if (nonzero_count < static_cast<int>(m / 20)) {
- // Linear counting: E[n] ≈ -m * log(V/m) where V = empty fraction
- // Using simple form without alpha scaling for very sparse data
- double linear_est = -m * std::log2(static_cast<double>(empty_count) / m);
- return static_cast<uint64_t>(std::max(1.0, linear_est));
- }
-
- // Standard HLL formula for moderate to large cardinalities
- double raw_estimate = m * std::log2(m / sum);
+ const double alpha_m = 0.7213 / (1.0 + 1.079 / m);
+ double raw_estimate = alpha_m * m * m / sum;
+ if (empty_count > 0 && raw_estimate <= 2.5 * m) {
+ const double linear_est = m * std::log(m / static_cast<double>(empty_count));
+ return static_cast<uint64_t>(std::llround(linear_est));
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // For sparse data (few registers used), use linear counting to avoid | |
| // HLL's extreme overestimation. When registers are sparse (nonzero < m/20), | |
| // the HLL raw formula gives wildly incorrect results. | |
| if (nonzero_count < static_cast<int>(m / 20)) { | |
| // Linear counting: E[n] ≈ -m * log(V/m) where V = empty fraction | |
| // Using simple form without alpha scaling for very sparse data | |
| double linear_est = -m * std::log2(static_cast<double>(empty_count) / m); | |
| return static_cast<uint64_t>(std::max(1.0, linear_est)); | |
| } | |
| // Standard HLL formula for moderate to large cardinalities | |
| double raw_estimate = m * std::log2(m / sum); | |
| // Bias correction for small cardinalities | |
| // HLL systematically overestimates for small n; apply downward bias | |
| double bias = 0.0; | |
| if (raw_estimate <= 2.5 * m) { | |
| bias = -0.5 * (m / 10.0); | |
| } | |
| double estimate = raw_estimate + bias; | |
| if (estimate < 0) { | |
| return 0; | |
| } | |
| if (estimate > static_cast<double>(kMaxCardinality)) { | |
| return kMaxCardinality; | |
| } | |
| return static_cast<uint64_t>(estimate); | |
| const double alpha_m = 0.7213 / (1.0 + 1.079 / m); | |
| double raw_estimate = alpha_m * m * m / sum; | |
| if (empty_count > 0 && raw_estimate <= 2.5 * m) { | |
| const double linear_est = m * std::log(m / static_cast<double>(empty_count)); | |
| return static_cast<uint64_t>(std::llround(linear_est)); | |
| } | |
| // Bias correction for small cardinalities | |
| // HLL systematically overestimates for small n; apply downward bias | |
| double bias = 0.0; | |
| if (raw_estimate <= 2.5 * m) { | |
| bias = -0.5 * (m / 10.0); | |
| } | |
| double estimate = raw_estimate + bias; | |
| if (estimate < 0) { | |
| return 0; | |
| } | |
| if (estimate > static_cast<double>(kMaxCardinality)) { | |
| return kMaxCardinality; | |
| } | |
| return static_cast<uint64_t>(estimate); |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@include/common/hll.hpp` around lines 83 - 111, The sparse-case and raw HLL
formulas are wrong (they use log2 and lack alpha scaling); replace the linear
counting and raw estimate calculations so linear counting uses the natural log
(std::log) with V = empty_count and E = -m *
std::log(static_cast<double>(empty_count) / m), and replace raw_estimate with
the HLL estimator raw_estimate = alpha_m * m * m / sum where alpha_m is chosen
per m (compute alpha_m using the standard small-m constants or formula for
get_alpha(m)); keep existing bounds handling for estimate, casting and use
kMaxCardinality as before (refer to nonzero_count, m, empty_count, sum,
raw_estimate, estimate, kMaxCardinality).
| void merge(const HyperLogLog& other) noexcept { | ||
| for (size_t i = 0; i < kNumRegisters; ++i) { | ||
| registers_.at(i) = std::max(registers_.at(i), other.registers_.at(i)); | ||
| } |
There was a problem hiding this comment.
Block merges across different seeds.
merge() currently combines sketches even when seed_ differs. That silently unions registers from different hash domains, so the resulting cardinality is meaningless.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@include/common/hll.hpp` around lines 122 - 125, merge() silently combines
HyperLogLog sketches with different hash seeds which corrupts results; change
merge(const HyperLogLog& other) noexcept to remove noexcept and add a check that
other.seed_ == seed_, and if not, throw a descriptive std::invalid_argument (or
std::runtime_error) mentioning mismatched seeds; if seeds match, proceed with
the existing loop over kNumRegisters updating registers_.at(i). Ensure you
reference and update the function signature and any callers to handle the
exception.
| if (col_info.type == common::ValueType::TYPE_TEXT || | ||
| col_info.type == common::ValueType::TYPE_VARCHAR || | ||
| col_info.type == common::ValueType::TYPE_CHAR) { | ||
| // Use 64-char prefix for text hashing | ||
| const std::string& s = val.as_text(); | ||
| size_t prefix_len = std::min(s.size(), size_t(64)); | ||
| hash = common::HyperLogLog::hash_bytes(s.data(), prefix_len); |
There was a problem hiding this comment.
Hash the full text value for NDV.
Using only the first 64 characters makes all longer strings with a shared prefix collide into the same HLL stream, which will systematically undercount NDV for common text patterns.
Suggested fix
- // Use 64-char prefix for text hashing
const std::string& s = val.as_text();
- size_t prefix_len = std::min(s.size(), size_t(64));
- hash = common::HyperLogLog::hash_bytes(s.data(), prefix_len);
+ hash = common::HyperLogLog::hash_bytes(s.data(), s.size());🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/executor/query_executor.cpp` around lines 955 - 961, The text branch
currently hashes only a 64-byte prefix (see col_info.type checks and
val.as_text()) which undercounts NDV; change it to hash the entire string by
passing the full length of val.as_text() to common::HyperLogLog::hash_bytes
(replace the prefix_len logic and use s.size()), ensuring the computed hash
value assigned to variable hash covers the whole text for
ValueType::TYPE_TEXT/TYPE_VARCHAR/TYPE_CHAR cases.
New files: - include/common/hll.hpp: HLL class with FNV-1a hash, bottom-k index, trailing zeros register values, linear counting for small n - tests/hll_test.cpp: 10 unit tests covering basic operations, hash consistency, reset, merge, and text value insertion
HLL is probabilistic and may estimate 4 instead of 3 for small cardinalities. Use >= instead of exact equality.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/executor/query_executor.cpp (1)
1019-1022:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHash the full text value for NDV.
Using only the first 64 characters will cause all longer strings with a shared prefix to collide into the same HLL stream, systematically undercounting NDV for text columns with common prefixes.
🔧 Proposed fix
- // Use 64-char prefix for text hashing const std::string& s = val.as_text(); - size_t prefix_len = std::min(s.size(), size_t(64)); - hash = common::HyperLogLog::hash_bytes(s.data(), prefix_len); + hash = common::HyperLogLog::hash_bytes(s.data(), s.size());🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/executor/query_executor.cpp` around lines 1019 - 1022, The NDV hashing currently only uses the first 64 bytes of the text (variable s and prefix_len) which causes collisions for long strings; change the call to common::HyperLogLog::hash_bytes to hash the entire string length (use s.size() / full length) instead of std::min(..., 64) so that the full text value is passed to hash_bytes (update the variables around s, prefix_len and the hash assignment in query_executor.cpp accordingly).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@src/executor/query_executor.cpp`:
- Around line 1019-1022: The NDV hashing currently only uses the first 64 bytes
of the text (variable s and prefix_len) which causes collisions for long
strings; change the call to common::HyperLogLog::hash_bytes to hash the entire
string length (use s.size() / full length) instead of std::min(..., 64) so that
the full text value is passed to hash_bytes (update the variables around s,
prefix_len and the hash assignment in query_executor.cpp accordingly).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 71bc574a-2691-44ae-baa8-a3093707a8d7
📒 Files selected for processing (5)
CMakeLists.txtinclude/common/hll.hppsrc/executor/query_executor.cpptests/cloudSQL_tests.cpptests/hll_test.cpp
🚧 Files skipped from review as they are similar to previous changes (2)
- CMakeLists.txt
- include/common/hll.hpp
- Add kLinearCountingThreshold, kBiasCorrectionBoundary, kBiasAdjustmentFactor constants to replace magic numbers in cardinality() formula - Fix missing newline at EOF in hll.hpp and hll_test.cpp - Add AccuracyBoundsDistinct test for cardinality bounds - Add MergeOverlappingSets test for HLL merge behavior - Add SeedReproducibility and DifferentSeedsDiffer tests for seed behavior
Tests INT64, BIGINT, DOUBLE, and TEXT value types using the
same integration path as execute_analyze() — Value::Hash{}
for numeric types and hash_bytes() for text.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/hll_test.cpp (1)
54-63: 💤 Low valueConsider extracting LCG constants for clarity.
The LCG multiplier
6364136223846793005ULLand increment1442695043ULLare reused across multiple tests. Extracting them as named constants would improve readability and maintainability.♻️ Proposed refactor
Add at the top of the anonymous namespace:
namespace { // LCG constants for generating distinct test values constexpr uint64_t kLcgMultiplier = 6364136223846793005ULL; constexpr uint64_t kLcgIncrement = 1442695043ULL; // Then use in tests: val = val * kLcgMultiplier + kLcgIncrement;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/hll_test.cpp` around lines 54 - 63, Extract the LCG magic numbers into named constexprs (e.g., add constexpr uint64_t kLcgMultiplier = 6364136223846793005ULL; and constexpr uint64_t kLcgIncrement = 1442695043ULL; inside the anonymous namespace at the top of the test file) and replace the literal uses in the HyperLogLogTests::DistinctValuesProduceCardinality test (and any other tests using the same LCG pattern) so the loop uses val = val * kLcgMultiplier + kLcgIncrement; to improve readability and reuse.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/hll_test.cpp`:
- Around line 178-200: Rename the test whose TEST macro uses the name
"HyperLogLogTests, MergeOverlappingSets" to reflect the actual behavior (the two
LCG seeds produce disjoint sets); change the TEST identifier to something like
"HyperLogLogTests, MergeDisjointSets" and update any references to this test
name; ensure the test function name in the TEST macro matches the comment and
intent (or alternatively change the data generation to produce overlapping
values if you prefer to keep the original name).
---
Nitpick comments:
In `@tests/hll_test.cpp`:
- Around line 54-63: Extract the LCG magic numbers into named constexprs (e.g.,
add constexpr uint64_t kLcgMultiplier = 6364136223846793005ULL; and constexpr
uint64_t kLcgIncrement = 1442695043ULL; inside the anonymous namespace at the
top of the test file) and replace the literal uses in the
HyperLogLogTests::DistinctValuesProduceCardinality test (and any other tests
using the same LCG pattern) so the loop uses val = val * kLcgMultiplier +
kLcgIncrement; to improve readability and reuse.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1b8e0ee5-88ac-4b2f-8c69-b310bdc9e778
📒 Files selected for processing (2)
include/common/hll.hpptests/hll_test.cpp
🚧 Files skipped from review as they are similar to previous changes (1)
- include/common/hll.hpp
| TEST(HyperLogLogTests, MergeOverlappingSets) { | ||
| HyperLogLog hll1; | ||
| HyperLogLog hll2; | ||
| uint64_t val = 123456789ULL; | ||
| for (int i = 0; i < 100; ++i) { | ||
| hll1.insert(val); | ||
| val = val * 6364136223846793005ULL + 1442695043ULL; | ||
| } | ||
| uint64_t val2 = 987654321ULL; | ||
| for (int i = 0; i < 100; ++i) { | ||
| hll2.insert(val2); | ||
| val2 = val2 * 6364136223846793005ULL + 1442695043ULL; | ||
| } | ||
| uint64_t card1 = hll1.cardinality(); | ||
| uint64_t card2 = hll2.cardinality(); | ||
| hll1.merge(hll2); | ||
| uint64_t merged = hll1.cardinality(); | ||
| // Merged cardinality should be >= either individual | ||
| EXPECT_GE(merged, card1); | ||
| EXPECT_GE(merged, card2); | ||
| // Both sets are disjoint with good distribution, merged should be in a reasonable range | ||
| EXPECT_LT(merged, 50000U); // Sanity upper bound | ||
| } |
There was a problem hiding this comment.
Test name does not match implementation.
The test is named MergeOverlappingSets but uses disjoint sets (different LCG seeds: 123456789ULL vs 987654321ULL). The comment on line 198 even states "Both sets are disjoint." Consider renaming to MergeDisjointSets or similar to avoid confusion.
📝 Suggested rename
-TEST(HyperLogLogTests, MergeOverlappingSets) {
+TEST(HyperLogLogTests, MergeDisjointSets) {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| TEST(HyperLogLogTests, MergeOverlappingSets) { | |
| HyperLogLog hll1; | |
| HyperLogLog hll2; | |
| uint64_t val = 123456789ULL; | |
| for (int i = 0; i < 100; ++i) { | |
| hll1.insert(val); | |
| val = val * 6364136223846793005ULL + 1442695043ULL; | |
| } | |
| uint64_t val2 = 987654321ULL; | |
| for (int i = 0; i < 100; ++i) { | |
| hll2.insert(val2); | |
| val2 = val2 * 6364136223846793005ULL + 1442695043ULL; | |
| } | |
| uint64_t card1 = hll1.cardinality(); | |
| uint64_t card2 = hll2.cardinality(); | |
| hll1.merge(hll2); | |
| uint64_t merged = hll1.cardinality(); | |
| // Merged cardinality should be >= either individual | |
| EXPECT_GE(merged, card1); | |
| EXPECT_GE(merged, card2); | |
| // Both sets are disjoint with good distribution, merged should be in a reasonable range | |
| EXPECT_LT(merged, 50000U); // Sanity upper bound | |
| } | |
| TEST(HyperLogLogTests, MergeDisjointSets) { | |
| HyperLogLog hll1; | |
| HyperLogLog hll2; | |
| uint64_t val = 123456789ULL; | |
| for (int i = 0; i < 100; ++i) { | |
| hll1.insert(val); | |
| val = val * 6364136223846793005ULL + 1442695043ULL; | |
| } | |
| uint64_t val2 = 987654321ULL; | |
| for (int i = 0; i < 100; ++i) { | |
| hll2.insert(val2); | |
| val2 = val2 * 6364136223846793005ULL + 1442695043ULL; | |
| } | |
| uint64_t card1 = hll1.cardinality(); | |
| uint64_t card2 = hll2.cardinality(); | |
| hll1.merge(hll2); | |
| uint64_t merged = hll1.cardinality(); | |
| // Merged cardinality should be >= either individual | |
| EXPECT_GE(merged, card1); | |
| EXPECT_GE(merged, card2); | |
| // Both sets are disjoint with good distribution, merged should be in a reasonable range | |
| EXPECT_LT(merged, 50000U); // Sanity upper bound | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/hll_test.cpp` around lines 178 - 200, Rename the test whose TEST macro
uses the name "HyperLogLogTests, MergeOverlappingSets" to reflect the actual
behavior (the two LCG seeds produce disjoint sets); change the TEST identifier
to something like "HyperLogLogTests, MergeDisjointSets" and update any
references to this test name; ensure the test function name in the TEST macro
matches the comment and intent (or alternatively change the data generation to
produce overlapping values if you prefer to keep the original name).
Summary
HyperLogLogclass (include/common/hll.hpp) — memory-bounded probabilistic NDV estimator (~12KB per column vs unboundedunordered_set)execute_analyze()replacingunordered_set<string>for NDV collectionhll_test.cppunit tests (10 tests passing)ExecutionTests.AnalyzeTableto useEXPECT_GEinstead of exact equality (HLL is probabilistic)Implementation Details
Test Results
server_testsSIGPIPE is pre-existing)AnalyzeTableandAnalyzeTableLargeRowspassNote
The branch was created from
mainand pushed successfully. Ready for review.Summary by CodeRabbit