CBO Phase 3: HyperLogLog NDV Estimator by poyrazK · Pull Request #82 · poyrazK/cloudSQL

poyrazK · 2026-05-12T15:21:43Z

Summary

Add HyperLogLog class (include/common/hll.hpp) — memory-bounded probabilistic NDV estimator (~12KB per column vs unbounded unordered_set)
Integrate HLL into execute_analyze() replacing unordered_set<string> for NDV collection
Add hll_test.cpp unit tests (10 tests passing)
Fix ExecutionTests.AnalyzeTable to use EXPECT_GE instead of exact equality (HLL is probabilistic)

Implementation Details

Hash: FNV-1a (replaces djb2 which had poor upper-bit distribution)
Index: bottom 11 bits of hash → register index (0-2047)
Register value: trailing zeros in upper 53 bits + 1
Small cardinality fix: linear counting fallback when nonzero registers < m/20

Test Results

37/38 tests pass (server_tests SIGPIPE is pre-existing)
HLL tests: 10/10 pass
AnalyzeTable and AnalyzeTableLargeRows pass

Note

The branch was created from main and pushed successfully. Ready for review.

Summary by CodeRabbit

New Features
- Added a fixed-size HyperLogLog distinct-count estimator and switched ANALYZE TABLE to use memory-bounded probabilistic sketches for NDV estimation.
Tests
- Extensive unit and integration tests for the sketch (accuracy bounds, merge, seed behavior, reset, hashing) and updated an existing NDV assertion to a probabilistic lower bound.
Chores
- Registered a new test executable in the test suite.

coderabbitai · 2026-05-12T15:21:56Z

📝 Walkthrough

Walkthrough

Adds a header-only HyperLogLog implementation, uses per-column HLL sketches in QueryExecutor::execute_analyze to estimate NDV, adds unit tests for HLL, adjusts an integration test to accept probabilistic NDV, and registers a new CMake test target hll_tests.

Changes

HyperLogLog Cardinality Estimation and ANALYZE TABLE

Layer / File(s)	Summary
HyperLogLog Library Implementation `include/common/hll.hpp`	New `cloudsql::common::HyperLogLog` header: 2048-register sketch, `insert()`, `cardinality()` with linear-counting and bias adjustment, `reset()`, `merge()`, and `hash_bytes()` (FNV-1a).
QueryExecutor::execute_analyze Integration `src/executor/query_executor.cpp`	Replaces per-column `unordered_set` NDV tracking with `std::vector<common::HyperLogLog>`; hashes values by type during table scan and inserts into HLL; derives column NDV via `HyperLogLog::cardinality()` before updating catalog stats.
HyperLogLog Unit & Integration Tests `tests/hll_test.cpp`	New GoogleTest suite covering empty/non-empty cardinality, insert behavior, distinct-value streams, `hash_bytes` determinism/difference, `reset()`, `merge()`, seed reproducibility/difference, accuracy sanity bounds, and value-type insertion coverage.
Integration test adjustment `tests/cloudSQL_tests.cpp`	Updates `ExecutionTests.AnalyzeTable` to assert a probabilistic lower bound (`EXPECT_GE`) for text-column NDV instead of exact equality.
CMake Test Target Registration `CMakeLists.txt`	Adds `add_cloudsql_test(hll_tests tests/hll_test.cpp)` to BUILD_TESTS to build/register the new test executable.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Replace unordered_set NDV with memory-bounded HyperLogLog estimator #77: Proposes replacing per-column NDV unordered_set tracking with a HyperLogLog estimator — this PR implements that change.

Possibly related PRs

poyrazK/cloudSQL#2: Adds GoogleTest/CMake test registration macros used to register the new hll_tests target.
poyrazK/cloudSQL#81: Modifies CMakeLists.txt's BUILD_TESTS test registrations (similar test-target additions).
poyrazK/cloudSQL#75: Also modifies QueryExecutor::execute_analyze(...); overlaps with analyze/NDV changes.

Poem

🐰 I hopped in with registers bright and small,
Counting uniques without storing them all,
I hash each byte with FNV's tune,
Merge and reset beneath the moon,
ANALYZE now hums an estimate for all.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'CBO Phase 3: HyperLogLog NDV Estimator' directly and clearly summarizes the main change: adding a HyperLogLog implementation for number of distinct values (NDV) estimation as part of the query optimizer cost-based optimization (CBO) initiative, which is confirmed by all modified files.
Docstring Coverage	✅ Passed	Docstring coverage is 91.67% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/hll-ndv

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Adds new hll_tests executable for HyperLogLog unit testing.

Replaces std::unordered_set<std::string>> NDV collection with std::vector<common::HyperLogLog>. Uses FNV-1a hash for text (better upper-bit distribution than djb2) and Value::Hash for numeric types. Linear counting fallback for small cardinalities prevents HLL's extreme overestimation when few registers are used.

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/executor/query_executor.cpp (1)

307-324: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Route ANALYZE TABLE through execute_analyze(...).

execute_analyze(...) is implemented below, but this dispatcher never calls it, so ANALYZE TABLE will still fall into "Unsupported statement type".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/executor/query_executor.cpp` around lines 307 - 324, The dispatcher in
execute (the if/else chain handling parser::StmtType cases) is missing a branch
for parser::StmtType::Analyze; add an else-if branch that calls execute_analyze
with the statement cast to the correct type (e.g., dynamic_cast<const
parser::AnalyzeStatement&>(stmt)) and pass the current transaction object (txn)
if execute_analyze expects it, mirroring how execute_select/execute_insert/etc.
are invoked so ANALYZE TABLE routes to execute_analyze instead of falling
through to "Unsupported statement type".

🧹 Nitpick comments (1)

tests/hll_test.cpp (1)

56-165: ⚡ Quick win

Strengthen these assertions beyond “non-zero”.

Most of this suite still passes if cardinality() returns 1 for every non-empty sketch, so it won't catch broken HLL math. For a deterministic input stream, add a few bounded accuracy checks and verify merged sketches estimate at least as large as each input sketch.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/hll_test.cpp` around lines 56 - 165, Replace the weak "non-zero"
assertions with bounded-accuracy checks: for tests that insert N distinct items
(e.g., the loops in DistinctValuesProduceCardinality,
DistinctValueSetsProduceCardinality, TextValueInsertion,
MergeCombinesDistinctSets) assert the returned HyperLogLog::cardinality() lies
within a reasonable relative tolerance of N (for example N*0.5 <= card <= N*1.5)
or use a configurable relative error constant (e.g., 1.6% * k or a fixed 50% for
small N) and assert merge preserves/raises counts (after hll1.merge(hll2) assert
hll1.cardinality() >= old_hll1_cardinality and >= old_hll2_cardinality);
similarly replace the single zero-check after reset to EXPECT_EQ(cardinality(),
0U) remains but ensure before reset you capture and assert expected range; use
HyperLogLog::insert, cardinality(), merge(), reset(), and hash_bytes() as
anchors when editing each test.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@include/common/hll.hpp`:
- Around line 122-125: merge() silently combines HyperLogLog sketches with
different hash seeds which corrupts results; change merge(const HyperLogLog&
other) noexcept to remove noexcept and add a check that other.seed_ == seed_,
and if not, throw a descriptive std::invalid_argument (or std::runtime_error)
mentioning mismatched seeds; if seeds match, proceed with the existing loop over
kNumRegisters updating registers_.at(i). Ensure you reference and update the
function signature and any callers to handle the exception.
- Around line 83-111: The sparse-case and raw HLL formulas are wrong (they use
log2 and lack alpha scaling); replace the linear counting and raw estimate
calculations so linear counting uses the natural log (std::log) with V =
empty_count and E = -m * std::log(static_cast<double>(empty_count) / m), and
replace raw_estimate with the HLL estimator raw_estimate = alpha_m * m * m / sum
where alpha_m is chosen per m (compute alpha_m using the standard small-m
constants or formula for get_alpha(m)); keep existing bounds handling for
estimate, casting and use kMaxCardinality as before (refer to nonzero_count, m,
empty_count, sum, raw_estimate, estimate, kMaxCardinality).

In `@src/executor/query_executor.cpp`:
- Around line 955-961: The text branch currently hashes only a 64-byte prefix
(see col_info.type checks and val.as_text()) which undercounts NDV; change it to
hash the entire string by passing the full length of val.as_text() to
common::HyperLogLog::hash_bytes (replace the prefix_len logic and use s.size()),
ensuring the computed hash value assigned to variable hash covers the whole text
for ValueType::TYPE_TEXT/TYPE_VARCHAR/TYPE_CHAR cases.

In `@tests/cloudSQL_tests.cpp`:
- Around line 1263-1610: The file contains unresolved git conflict markers
surrounding a large block of new tests (e.g., TEST(ExecutionTests,
AnalyzeTable), AnalyzeTableLargeRows, AnalyzeTableNonExistent,
AnalyzeTableEmpty, AnalyzeTableWithNulls, AnalyzeTableStringStats,
AnalyzeFilterSelectivity, AnalyzeJoinOrder); remove the conflict markers
(<<<<<<<, =======, >>>>>>>) and keep the intended test additions (the ANALYZE
TABLE test cases) as a single coherent block, ensuring only one copy of each
TEST(...) remains and the surrounding includes/namespace usages remain valid so
the file compiles.

---

Outside diff comments:
In `@src/executor/query_executor.cpp`:
- Around line 307-324: The dispatcher in execute (the if/else chain handling
parser::StmtType cases) is missing a branch for parser::StmtType::Analyze; add
an else-if branch that calls execute_analyze with the statement cast to the
correct type (e.g., dynamic_cast<const parser::AnalyzeStatement&>(stmt)) and
pass the current transaction object (txn) if execute_analyze expects it,
mirroring how execute_select/execute_insert/etc. are invoked so ANALYZE TABLE
routes to execute_analyze instead of falling through to "Unsupported statement
type".

---

Nitpick comments:
In `@tests/hll_test.cpp`:
- Around line 56-165: Replace the weak "non-zero" assertions with
bounded-accuracy checks: for tests that insert N distinct items (e.g., the loops
in DistinctValuesProduceCardinality, DistinctValueSetsProduceCardinality,
TextValueInsertion, MergeCombinesDistinctSets) assert the returned
HyperLogLog::cardinality() lies within a reasonable relative tolerance of N (for
example N*0.5 <= card <= N*1.5) or use a configurable relative error constant
(e.g., 1.6% * k or a fixed 50% for small N) and assert merge preserves/raises
counts (after hll1.merge(hll2) assert hll1.cardinality() >= old_hll1_cardinality
and >= old_hll2_cardinality); similarly replace the single zero-check after
reset to EXPECT_EQ(cardinality(), 0U) remains but ensure before reset you
capture and assert expected range; use HyperLogLog::insert, cardinality(),
merge(), reset(), and hash_bytes() as anchors when editing each test.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ec6e791d-6dfa-461f-bae4-f36dbc2e255b

📥 Commits

Reviewing files that changed from the base of the PR and between 18d5d03 and e788277.

📒 Files selected for processing (5)

CMakeLists.txt
include/common/hll.hpp
src/executor/query_executor.cpp
tests/cloudSQL_tests.cpp
tests/hll_test.cpp

coderabbitai · 2026-05-12T15:26:10Z

+        // For sparse data (few registers used), use linear counting to avoid
+        // HLL's extreme overestimation. When registers are sparse (nonzero < m/20),
+        // the HLL raw formula gives wildly incorrect results.
+        if (nonzero_count < static_cast<int>(m / 20)) {
+            // Linear counting: E[n] ≈ -m * log(V/m) where V = empty fraction
+            // Using simple form without alpha scaling for very sparse data
+            double linear_est = -m * std::log2(static_cast<double>(empty_count) / m);
+            return static_cast<uint64_t>(std::max(1.0, linear_est));
+        }
+
+        // Standard HLL formula for moderate to large cardinalities
+        double raw_estimate = m * std::log2(m / sum);
+
+        // Bias correction for small cardinalities
+        // HLL systematically overestimates for small n; apply downward bias
+        double bias = 0.0;
+        if (raw_estimate <= 2.5 * m) {
+            bias = -0.5 * (m / 10.0);
+        }
+
+        double estimate = raw_estimate + bias;
+
+        if (estimate < 0) {
+            return 0;
+        }
+        if (estimate > static_cast<double>(kMaxCardinality)) {
+            return kMaxCardinality;
+        }
+        return static_cast<uint64_t>(estimate);


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the actual HLL estimators here.

Both estimates are using log2(...), but standard HLL needs linear counting with the natural log for the sparse case and alpha_m * m * m / sum for the raw estimate. As written, this will store materially biased NDV stats for ANALYZE TABLE.

Suggested fix

- if (nonzero_count < static_cast<int>(m / 20)) { - // Linear counting: E[n] ≈ -m * log(V/m) where V = empty fraction - // Using simple form without alpha scaling for very sparse data - double linear_est = -m * std::log2(static_cast<double>(empty_count) / m); - return static_cast<uint64_t>(std::max(1.0, linear_est)); - } - - // Standard HLL formula for moderate to large cardinalities - double raw_estimate = m * std::log2(m / sum); + const double alpha_m = 0.7213 / (1.0 + 1.079 / m); + double raw_estimate = alpha_m * m * m / sum; + if (empty_count > 0 && raw_estimate <= 2.5 * m) { + const double linear_est = m * std::log(m / static_cast<double>(empty_count)); + return static_cast<uint64_t>(std::llround(linear_est)); + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// For sparse data (few registers used), use linear counting to avoid

// HLL's extreme overestimation. When registers are sparse (nonzero < m/20),

// the HLL raw formula gives wildly incorrect results.

if (nonzero_count < static_cast<int>(m / 20)) {

// Linear counting: E[n] ≈ -m * log(V/m) where V = empty fraction

// Using simple form without alpha scaling for very sparse data

double linear_est = -m * std::log2(static_cast<double>(empty_count) / m);

return static_cast<uint64_t>(std::max(1.0, linear_est));

}

// Standard HLL formula for moderate to large cardinalities

double raw_estimate = m * std::log2(m / sum);

// Bias correction for small cardinalities

// HLL systematically overestimates for small n; apply downward bias

double bias = 0.0;

if (raw_estimate <= 2.5 * m) {

bias = -0.5 * (m / 10.0);

}

double estimate = raw_estimate + bias;

if (estimate < 0) {

return 0;

}

if (estimate > static_cast<double>(kMaxCardinality)) {

return kMaxCardinality;

}

return static_cast<uint64_t>(estimate);

const double alpha_m = 0.7213 / (1.0 + 1.079 / m);

double raw_estimate = alpha_m * m * m / sum;

if (empty_count > 0 && raw_estimate <= 2.5 * m) {

const double linear_est = m * std::log(m / static_cast<double>(empty_count));

return static_cast<uint64_t>(std::llround(linear_est));

}

// Bias correction for small cardinalities

// HLL systematically overestimates for small n; apply downward bias

double bias = 0.0;

if (raw_estimate <= 2.5 * m) {

bias = -0.5 * (m / 10.0);

}

double estimate = raw_estimate + bias;

if (estimate < 0) {

return 0;

}

if (estimate > static_cast<double>(kMaxCardinality)) {

return kMaxCardinality;

}

return static_cast<uint64_t>(estimate);

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@include/common/hll.hpp` around lines 83 - 111, The sparse-case and raw HLL formulas are wrong (they use log2 and lack alpha scaling); replace the linear counting and raw estimate calculations so linear counting uses the natural log (std::log) with V = empty_count and E = -m * std::log(static_cast<double>(empty_count) / m), and replace raw_estimate with the HLL estimator raw_estimate = alpha_m * m * m / sum where alpha_m is chosen per m (compute alpha_m using the standard small-m constants or formula for get_alpha(m)); keep existing bounds handling for estimate, casting and use kMaxCardinality as before (refer to nonzero_count, m, empty_count, sum, raw_estimate, estimate, kMaxCardinality).

coderabbitai · 2026-05-12T15:26:10Z

+    void merge(const HyperLogLog& other) noexcept {
+        for (size_t i = 0; i < kNumRegisters; ++i) {
+            registers_.at(i) = std::max(registers_.at(i), other.registers_.at(i));
+        }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Block merges across different seeds.

merge() currently combines sketches even when seed_ differs. That silently unions registers from different hash domains, so the resulting cardinality is meaningless.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@include/common/hll.hpp` around lines 122 - 125, merge() silently combines HyperLogLog sketches with different hash seeds which corrupts results; change merge(const HyperLogLog& other) noexcept to remove noexcept and add a check that other.seed_ == seed_, and if not, throw a descriptive std::invalid_argument (or std::runtime_error) mentioning mismatched seeds; if seeds match, proceed with the existing loop over kNumRegisters updating registers_.at(i). Ensure you reference and update the function signature and any callers to handle the exception.

coderabbitai · 2026-05-12T15:26:10Z

+                if (col_info.type == common::ValueType::TYPE_TEXT ||
+                    col_info.type == common::ValueType::TYPE_VARCHAR ||
+                    col_info.type == common::ValueType::TYPE_CHAR) {
+                    // Use 64-char prefix for text hashing
+                    const std::string& s = val.as_text();
+                    size_t prefix_len = std::min(s.size(), size_t(64));
+                    hash = common::HyperLogLog::hash_bytes(s.data(), prefix_len);


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Hash the full text value for NDV.

Using only the first 64 characters makes all longer strings with a shared prefix collide into the same HLL stream, which will systematically undercount NDV for common text patterns.

Suggested fix

- // Use 64-char prefix for text hashing const std::string& s = val.as_text(); - size_t prefix_len = std::min(s.size(), size_t(64)); - hash = common::HyperLogLog::hash_bytes(s.data(), prefix_len); + hash = common::HyperLogLog::hash_bytes(s.data(), s.size());

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/executor/query_executor.cpp` around lines 955 - 961, The text branch currently hashes only a 64-byte prefix (see col_info.type checks and val.as_text()) which undercounts NDV; change it to hash the entire string by passing the full length of val.as_text() to common::HyperLogLog::hash_bytes (replace the prefix_len logic and use s.size()), ensuring the computed hash value assigned to variable hash covers the whole text for ValueType::TYPE_TEXT/TYPE_VARCHAR/TYPE_CHAR cases.

New files: - include/common/hll.hpp: HLL class with FNV-1a hash, bottom-k index, trailing zeros register values, linear counting for small n - tests/hll_test.cpp: 10 unit tests covering basic operations, hash consistency, reset, merge, and text value insertion

HLL is probabilistic and may estimate 4 instead of 3 for small cardinalities. Use >= instead of exact equality.

coderabbitai

♻️ Duplicate comments (1)

src/executor/query_executor.cpp (1)

1019-1022: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Hash the full text value for NDV.

Using only the first 64 characters will cause all longer strings with a shared prefix to collide into the same HLL stream, systematically undercounting NDV for text columns with common prefixes.

🔧 Proposed fix

-                    // Use 64-char prefix for text hashing
                     const std::string& s = val.as_text();
-                    size_t prefix_len = std::min(s.size(), size_t(64));
-                    hash = common::HyperLogLog::hash_bytes(s.data(), prefix_len);
+                    hash = common::HyperLogLog::hash_bytes(s.data(), s.size());

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/executor/query_executor.cpp` around lines 1019 - 1022, The NDV hashing
currently only uses the first 64 bytes of the text (variable s and prefix_len)
which causes collisions for long strings; change the call to
common::HyperLogLog::hash_bytes to hash the entire string length (use s.size() /
full length) instead of std::min(..., 64) so that the full text value is passed
to hash_bytes (update the variables around s, prefix_len and the hash assignment
in query_executor.cpp accordingly).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@src/executor/query_executor.cpp`:
- Around line 1019-1022: The NDV hashing currently only uses the first 64 bytes
of the text (variable s and prefix_len) which causes collisions for long
strings; change the call to common::HyperLogLog::hash_bytes to hash the entire
string length (use s.size() / full length) instead of std::min(..., 64) so that
the full text value is passed to hash_bytes (update the variables around s,
prefix_len and the hash assignment in query_executor.cpp accordingly).

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 71bc574a-2691-44ae-baa8-a3093707a8d7

📥 Commits

Reviewing files that changed from the base of the PR and between e788277 and 6482641.

📒 Files selected for processing (5)

CMakeLists.txt
include/common/hll.hpp
src/executor/query_executor.cpp
tests/cloudSQL_tests.cpp
tests/hll_test.cpp

🚧 Files skipped from review as they are similar to previous changes (2)

CMakeLists.txt
include/common/hll.hpp

- Add kLinearCountingThreshold, kBiasCorrectionBoundary, kBiasAdjustmentFactor constants to replace magic numbers in cardinality() formula - Fix missing newline at EOF in hll.hpp and hll_test.cpp - Add AccuracyBoundsDistinct test for cardinality bounds - Add MergeOverlappingSets test for HLL merge behavior - Add SeedReproducibility and DifferentSeedsDiffer tests for seed behavior

Tests INT64, BIGINT, DOUBLE, and TEXT value types using the same integration path as execute_analyze() — Value::Hash{} for numeric types and hash_bytes() for text.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tests/hll_test.cpp (1)

54-63: 💤 Low value

Consider extracting LCG constants for clarity.

The LCG multiplier 6364136223846793005ULL and increment 1442695043ULL are reused across multiple tests. Extracting them as named constants would improve readability and maintainability.

♻️ Proposed refactor

Add at the top of the anonymous namespace:

namespace {

// LCG constants for generating distinct test values
constexpr uint64_t kLcgMultiplier = 6364136223846793005ULL;
constexpr uint64_t kLcgIncrement = 1442695043ULL;

// Then use in tests:
val = val * kLcgMultiplier + kLcgIncrement;

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/hll_test.cpp` around lines 54 - 63, Extract the LCG magic numbers into
named constexprs (e.g., add constexpr uint64_t kLcgMultiplier =
6364136223846793005ULL; and constexpr uint64_t kLcgIncrement = 1442695043ULL;
inside the anonymous namespace at the top of the test file) and replace the
literal uses in the HyperLogLogTests::DistinctValuesProduceCardinality test (and
any other tests using the same LCG pattern) so the loop uses val = val *
kLcgMultiplier + kLcgIncrement; to improve readability and reuse.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/hll_test.cpp`:
- Around line 178-200: Rename the test whose TEST macro uses the name
"HyperLogLogTests, MergeOverlappingSets" to reflect the actual behavior (the two
LCG seeds produce disjoint sets); change the TEST identifier to something like
"HyperLogLogTests, MergeDisjointSets" and update any references to this test
name; ensure the test function name in the TEST macro matches the comment and
intent (or alternatively change the data generation to produce overlapping
values if you prefer to keep the original name).

---

Nitpick comments:
In `@tests/hll_test.cpp`:
- Around line 54-63: Extract the LCG magic numbers into named constexprs (e.g.,
add constexpr uint64_t kLcgMultiplier = 6364136223846793005ULL; and constexpr
uint64_t kLcgIncrement = 1442695043ULL; inside the anonymous namespace at the
top of the test file) and replace the literal uses in the
HyperLogLogTests::DistinctValuesProduceCardinality test (and any other tests
using the same LCG pattern) so the loop uses val = val * kLcgMultiplier +
kLcgIncrement; to improve readability and reuse.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1b8e0ee5-88ac-4b2f-8c69-b310bdc9e778

📥 Commits

Reviewing files that changed from the base of the PR and between 6482641 and 72dfcc3.

📒 Files selected for processing (2)

include/common/hll.hpp
tests/hll_test.cpp

🚧 Files skipped from review as they are similar to previous changes (1)

include/common/hll.hpp

coderabbitai · 2026-05-14T13:18:54Z

+TEST(HyperLogLogTests, MergeOverlappingSets) {
+    HyperLogLog hll1;
+    HyperLogLog hll2;
+    uint64_t val = 123456789ULL;
+    for (int i = 0; i < 100; ++i) {
+        hll1.insert(val);
+        val = val * 6364136223846793005ULL + 1442695043ULL;
+    }
+    uint64_t val2 = 987654321ULL;
+    for (int i = 0; i < 100; ++i) {
+        hll2.insert(val2);
+        val2 = val2 * 6364136223846793005ULL + 1442695043ULL;
+    }
+    uint64_t card1 = hll1.cardinality();
+    uint64_t card2 = hll2.cardinality();
+    hll1.merge(hll2);
+    uint64_t merged = hll1.cardinality();
+    // Merged cardinality should be >= either individual
+    EXPECT_GE(merged, card1);
+    EXPECT_GE(merged, card2);
+    // Both sets are disjoint with good distribution, merged should be in a reasonable range
+    EXPECT_LT(merged, 50000U);  // Sanity upper bound
+}


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Test name does not match implementation.

The test is named MergeOverlappingSets but uses disjoint sets (different LCG seeds: 123456789ULL vs 987654321ULL). The comment on line 198 even states "Both sets are disjoint." Consider renaming to MergeDisjointSets or similar to avoid confusion.

📝 Suggested rename

-TEST(HyperLogLogTests, MergeOverlappingSets) { +TEST(HyperLogLogTests, MergeDisjointSets) {

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

TEST(HyperLogLogTests, MergeOverlappingSets) {

HyperLogLog hll1;

HyperLogLog hll2;

uint64_t val = 123456789ULL;

for (int i = 0; i < 100; ++i) {

hll1.insert(val);

val = val * 6364136223846793005ULL + 1442695043ULL;

}

uint64_t val2 = 987654321ULL;

for (int i = 0; i < 100; ++i) {

hll2.insert(val2);

val2 = val2 * 6364136223846793005ULL + 1442695043ULL;

}

uint64_t card1 = hll1.cardinality();

uint64_t card2 = hll2.cardinality();

hll1.merge(hll2);

uint64_t merged = hll1.cardinality();

// Merged cardinality should be >= either individual

EXPECT_GE(merged, card1);

EXPECT_GE(merged, card2);

// Both sets are disjoint with good distribution, merged should be in a reasonable range

EXPECT_LT(merged, 50000U); // Sanity upper bound

}

TEST(HyperLogLogTests, MergeDisjointSets) {

HyperLogLog hll1;

HyperLogLog hll2;

uint64_t val = 123456789ULL;

for (int i = 0; i < 100; ++i) {

hll1.insert(val);

val = val * 6364136223846793005ULL + 1442695043ULL;

}

uint64_t val2 = 987654321ULL;

for (int i = 0; i < 100; ++i) {

hll2.insert(val2);

val2 = val2 * 6364136223846793005ULL + 1442695043ULL;

}

uint64_t card1 = hll1.cardinality();

uint64_t card2 = hll2.cardinality();

hll1.merge(hll2);

uint64_t merged = hll1.cardinality();

// Merged cardinality should be >= either individual

EXPECT_GE(merged, card1);

EXPECT_GE(merged, card2);

// Both sets are disjoint with good distribution, merged should be in a reasonable range

EXPECT_LT(merged, 50000U); // Sanity upper bound

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/hll_test.cpp` around lines 178 - 200, Rename the test whose TEST macro uses the name "HyperLogLogTests, MergeOverlappingSets" to reflect the actual behavior (the two LCG seeds produce disjoint sets); change the TEST identifier to something like "HyperLogLogTests, MergeDisjointSets" and update any references to this test name; ensure the test function name in the TEST macro matches the comment and intent (or alternatively change the data generation to produce overlapping values if you prefer to keep the original name).

poyrazK

It's okay to merge

poyrazK added 2 commits May 12, 2026 18:25

Add hll_tests target to CMakeLists.txt

a558bb7

Adds new hll_tests executable for HyperLogLog unit testing.

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

poyrazK added 2 commits May 12, 2026 18:27

Fix AnalyzeTable test to use EXPECT_GE for HLL NDV estimate

1c31535

HLL is probabilistic and may estimate 4 instead of 3 for small cardinalities. Use >= instead of exact equality.

poyrazK force-pushed the feature/hll-ndv branch from e788277 to 1c31535 Compare May 12, 2026 15:30

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

poyrazK added 2 commits May 13, 2026 15:41

Add ValueTypeColumnCoverage test for HLL

8cca192

Tests INT64, BIGINT, DOUBLE, and TEXT value types using the same integration path as execute_analyze() — Value::Hash{} for numeric types and hash_bytes() for text.

poyrazK force-pushed the feature/hll-ndv branch from 6482641 to 8cca192 Compare May 14, 2026 13:15

style: automated clang-format fixes

72dfcc3

coderabbitai Bot reviewed May 14, 2026

View reviewed changes

poyrazK commented May 14, 2026

View reviewed changes

poyrazK merged commit f2fae18 into main May 14, 2026
1 check passed

Conversation

poyrazK commented May 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Implementation Details

Test Results

Note

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

poyrazK left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

poyrazK commented May 12, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 12, 2026 •

edited

Loading