CNDB-15669: Fully off-heap memtable by blambov · Pull Request #2308 · datastax/cassandra

blambov · 2026-04-07T11:50:34Z

What is the issue

https://github.com/riptano/cndb/issues/15669
https://github.com/riptano/cndb/issues/10302

What does this PR fix and why was it fixed

Implementation of the fully off-heap, tombstone-aware memtable.

The first commit is CNDB-10302 as reviewed in #2005, adding tombstone support. The second refactors some of the access interfaces to combine the cursor position into a single long for efficiency and extra flexibility, which the third commit uses to lift some restrictions in the kinds of ranges that the tries could support. The fifth commit extends the memtable trie all the way to individual cells, and the sixth makes it possible to store data in trie cells. When used with offheap_objects allocation type, this memtable is fully off-heap, with ~100KiB of on-heap presence irrespective of data size.

Each commit should compile and pass tests, and comes with documentation in the included markdown files.

github-actions · 2026-04-07T11:50:52Z

sonarqubecloud · 2026-04-27T14:56:16Z

Quality Gate passed

Issues
83 New issues
0 Accepted issues

Measures
0 Security Hotspots
82.4% Coverage on New Code
4.5% Duplication on New Code

See analysis details on SonarQube Cloud

sonarqubecloud · 2026-05-26T14:57:00Z

Quality Gate passed

Issues
84 New issues
0 Accepted issues

Measures
0 Security Hotspots
83.4% Coverage on New Code
4.3% Duplication on New Code

See analysis details on SonarQube Cloud

lesnik2u · 2026-05-28T10:56:09Z

+            else
+            {
+                // We are making a copy of another PartitionData object.
+                buffer.putLongOrdered(inBufferPos + PARTITIONDATA_OFFSET_ROW_COUNT, partitionData.rowCountIncludingStatic());


partitionData.rowCountIncludingStatic() returns an int so writing it using putLongOrdered writes 8 bytes. I presume this won't be a problem for us ds/ibm but I've heard that oss is sometimes ran on big-endian, if so then the 64-bit promoted long will place the row count at offset 4 (overwriting the tombstone count offset) and write 0 at offset 0.
Correct me if I am wrong please.

This is definitely a mistake as all other uses of PARTITIONDATA_OFFSET_ROW_COUNT use putIntOrdered. Fixed.

lesnik2u · 2026-05-28T10:58:07Z

                try
                {
-                    SHARD_COUNT = Integer.valueOf(shardCount);
+                    SHARD_COUNT = Integer.parseInt(shardCount);


SHARD_COUNT is mutable via JMX, but it only changes for new memtables

We should add a log message or javadoc warning to clarify that this change only takes effect for newly created memtables.

Actually, this number is never used.

Changed the config object to defer to AbstractShardedMemtable for the shard count and added comments for the lock fairness.

lesnik2u · 2026-05-28T11:23:26Z

-        final InMemoryTrie<Object> data;
+        final InMemoryDeletionAwareTrie<Object, TrieTombstoneMarker> data;

        RegularAndStaticColumns columns;


Shouldn't we mark both
RegularAndStaticColumns columns;
EncodingStats stats;
volatile, as the are mutated in put() under a write lock but read concurrently by reader threads via TrieMemtable.columns() and TrieMemtable.encodingStats() without locks.

I don't know why these were missed when all the other fields were marked volatile. Changed in all TrieMemtable variants.

lesnik2u · 2026-05-28T11:38:36Z

+        if (estimatedAverageRowSize == null || currentOperations.get() > estimatedAverageRowSize.operations * 1.5)
+            estimatedAverageRowSize = new MemtableAverageRowSize(this, mergedTrie);


estimatedAverageRowSize is not thread safe, IIUC if there are multiple concurrent read loads, then they can simultaneously trigger and compute MemtableAverageRowSize, traversing entire trie.

I think we should guarantee that only one thread recalculates this estimate?

Well, this was meant to be done in such a way that it is quicker to perform twice than to synchronize.

I broke the 100-row limit that was supposed to do that. Removed the special trie code to bring it back.

lesnik2u · 2026-05-28T11:44:19Z

+        }
+
+        @Override
+        public String dumpContent(UnsafeBuffer buffer, int inBufferPos, int offsetBits)


Can a cell be EmbeddedNoTTL here? if so then this could fail iiuc

Not fail but rather put data behind the "ttl" and "ldt" labels; as this is only for debugging and the data is still visible, this is okay.

Added comment.

lesnik2u · 2026-05-28T14:24:53Z

+    static long positionForSkippingBranch(long encodedBranchPosition)
+    {
+        return encodedBranchPosition + (1L << TRANSITION_SHIFT);
+    }
+
+    /// Returns true if the given `currPosition` as returned by `advance`, `advanceMultiple` or `skipTo` is the result


Under normal circumstances, this increments the transition byte. However, if the current transition byte is 0xFF (the last transition in forward direction, or 0x00 in reverse which is encoded as ~0x00 = 0xFF), adding 1 carries over to bit 28 (the 9th transition bit).
So IIUC

When skipTo subsequently decodes this position via Cursor.incomingTransition(encodedSkipPosition), the carry-over causes issues:

Forward Direction: The decoding masks the transition with & 0xFF (line 183), discarding bit 28. The returned skipTransition is evaluated as 0 instead of 0x100 (256).
Reverse Direction: Since DIRECTION_BIT is set, incomingTransition flips all bits. The carry-over bit 28 is flipped from 1 to 0, and bits 20–27 are flipped from 0 to 0xFF (255).
Because the skip transition is parsed as 0 (forward) or 255 (reverse) instead of 256, the cursor thinks it is looking for a sibling transition >= 0 (or <= 255 in reverse order) rather than ascending:

If the parent is a SPLIT node: With assertions enabled, this immediately triggers a crash in nextValidSplitTransition via: assert midIndex != direction.select(0, SPLIT_START_LEVEL_LIMIT - 1) (which fails because 3 != 3 is false). With assertions disabled, it descends into the first non-null child of mid (e.g. 0xC0), which has already been visited, causing duplicate emissions and/or an infinite loop.
If the parent is a SPARSE node: advanceToSparseTransition stops at the first sibling in the order word (which was already processed), repeating the subtree.

Maybe adjust Cursor.incomingTransition to keep the 9th bit (the 0x100 carry-over) and bypass direction XORing for this bit?

Added a new method specifically for skipTo implementations.

lesnik2u · 2026-05-28T14:26:26Z

+        /// @inheritDoc
+        /// Range tries may have two content values. Handle this possibility here.
+        @Override
+        S processPrefix(int node, int depth, int transition)
+        {
+            S content1 = processPrefixEntry(node, depth, transition, PREFIX_CONTENT_OFFSET);
+            S content2 = processPrefixEntry(node, depth, transition, PREFIX_ALTERNATE_OFFSET);
+            assert (content1 == null) || (content2 == null) : "Prefix node with incompatible content pair " + content1 + " and " + content2;
+            // It's not okay to have two backtracks either, but this is not trivial to check.
+            return content1 == null ? content2 : content1;
+        }


If a prefix node contains two return-path boundaries (where shouldPresentOnTheReturnPath evaluates to true for both), processPrefix will push two backtracking entries at the same depth (depth - 1)?

Maybe some safety check here?

The It's not okay to have two backtracks either, but this is not trivial to check. comment is specifically for this case. It's a bit of a pain to check it.

Do you insist on doing so?

I think a comment is enough

lesnik2u · 2026-05-28T14:29:36Z

+    private static RangesCursor tailCopyOf(RangesCursor copyFrom, Direction newDirection)
+    {
+        assert !Cursor.isOnReturnPath(copyFrom.currentPosition)
+            : "Cannot take tail of a position " + Cursor.toString(copyFrom.currentPosition) + " on the return path.";
+        boolean directionMatches = newDirection == copyFrom.direction();
+
+        // Calculate the span of boundaries that are still active for the tail, not including any matching return path
+        // (the latter has the same effect as the set being open-ended at this tail).
+        int startInclusive = copyFrom.currentIdx;
+        int endExclusive = startInclusive;
+        while (endExclusive < copyFrom.endIdx &&
+               Cursor.compare(copyFrom.nextPositions[endExclusive],
+                              copyFrom.currentPosition | ON_RETURN_PATH_BIT) < 0)
+             ++endExclusive;
+
+        // We can only drop an even number of boundaries on either size. Expand the indexes to make them even.
+        int arrayStart = startInclusive & ~1;
+        int arrayEnd = ((endExclusive + 1) & ~1);
+        // Note: if endExclusive == startInclusive, arrayEnd - arrayStart is 0 if branch is not included, 2 if included
+        // (i.e. startInclusive is odd).
+
+        final long depthDiff = Cursor.depthCorrectionValue(copyFrom.currentPosition);
+        ByteSource.Peekable[] sources = new ByteSource.Peekable[arrayEnd - arrayStart];
+        final long[] nextPositions = new long[arrayEnd - arrayStart];
+
+        int newStartIdx;
+
+        // Duplicate all selected boundaries, adjust depths and reverse the order if the direction doesn't match.
+        if (directionMatches)
+        {
+            for (int i = startInclusive; i < endExclusive; ++i)
+            {
+                sources[i - arrayStart] = copyFrom.duplicateSource(i);
+                nextPositions[i - arrayStart] = copyFrom.nextPositions[i] - depthDiff;
+            }
+            newStartIdx = startInclusive - arrayStart;
+        }
+        else
+        {
+            for (int i = startInclusive; i < endExclusive; ++i)
+            {
+                int destIndex = arrayEnd - 1 - i;
+                sources[destIndex] = copyFrom.duplicateSource(i);
+                nextPositions[destIndex] = (copyFrom.nextPositions[i] - depthDiff) ^ TRANSITION_MASK;
+                if (sources[destIndex].peek() == ByteSource.END_OF_STREAM)
+                    nextPositions[destIndex] ^= ON_RETURN_PATH_BIT;
+            }
+            newStartIdx = arrayEnd - endExclusive;
+        }
+
+        // Determine the state the root needs to present.
+        boolean startIsContained = (newStartIdx & 1) != 0;
+        RangeState rootState = startIsContained ? newDirection.select(RangeState.START, RangeState.END)
+                                                : RangeState.NOT_CONTAINED;
+        long rootPosition = Cursor.rootPosition(newDirection);
+
+        // Add an onReturnPath root position for open-ended sets.
+        int last = nextPositions.length - 1;
+        if (last > 0 && sources[last] == null)
+        {
+            sources[last] = ByteSource.EMPTY;
+            nextPositions[last] = rootPosition | ON_RETURN_PATH_BIT;
+        }
+
+        return new RangesCursor(copyFrom.byteComparableVersion,
+                                copyFrom.endsAfterMask,
+                                nextPositions,
+                                sources,
+                                newStartIdx,
+                                nextPositions.length,
+                                rootPosition,
+                                rootState);
+    }


We allocate sources and nextPositions each time, maybe we could reuse?

We have to duplicate sources so we can't use parent's array. nextPositions must also be distinct because the child cursor modifies them.

lesnik2u · 2026-05-28T14:30:43Z

+    /// Allocate a new position in the object array. Used by the memory allocation strategy to allocate a content spot
+    /// when it runs out of recycled positions.
+    private int allocateNewObject()
+    {
+        int index = reservedCount++;
+        int leadBit = getBufferIdx(index, CONTENTS_START_SHIFT, CONTENTS_START_SIZE);
+        AtomicReferenceArray<T> array = contentArrays[leadBit];
+        if (array == null)
+        {
+            assert inBufferOffset(index, leadBit, CONTENTS_START_SIZE) == 0 : "Error in content arrays configuration.";
+            contentArrays[leadBit] = new AtomicReferenceArray<>(CONTENTS_START_SIZE << leadBit);
+        }
+        return index;
+    }


Is this possible we can go over the 2^28 allocations? If so we could throw TrieSpaceExhaustedException with explicit check

It's actually 2^29th (the largest array is 2^28) and we can't fill it, because the cell space is limited at 2GiB and it takes 4 bytes to store an int.

Doesn't hurt to have a check, though.

If we are using bytes + specials assigned by pojo (which we don't currently do), we have 2GiB / 32 possible special indexes and it is actually possible to exhaust them. Added check.

lesnik2u · 2026-05-28T14:31:34Z

+    @Override
+    public ColumnData clone(Cloner cloner)
+    {
+        return null;
+    }


Shouldn't we clone the wrapped column here? why null

Interesting omission. Fixed.

lesnik2u · 2026-05-28T14:54:37Z

+    @Override
+    public T content()
+    {
+        T content = data.content();
+        if (content == null)
+            return null;
+        if (deletions == null)
+            return content;
+
+        E applicableDeletions = atDeletions ? deletions.content() : null;
+        if (applicableDeletions == null)
+        {
+            applicableDeletions = deletions.precedingState();
+            if (applicableDeletions == null)
+                return content;
+        }
+
+        return resolver.apply(applicableDeletions, content);
+    }


In RangesApplyCursor.contet() we have an exhaustion guard check. Shouldn't we do the same here?

Well, adding this guard would seriously complicate other code.

Instead went for officially allowing state and precedingState to be called in exhausted state, and also refactored some of the definition of TrieSetCursor and its verification to ensure it's in better agreement with RangeCursor.

Implements a row-level trie memtable that uses deletion-aware tries to store deletions separately from live data, together with the associated TrieBackedPartition and TriePartitionUpdate. Refactors trie hierarchy to support multiple trie types: - plain - range, which stores range boundaries and is able to answer questions about the range that applies to every point in the trie - deletion aware, which combines a data part and a deletion range trie Every trie type supports suitable operations, including merging and intersection that make sense for the type of trie. In particular, deletion-aware tries apply range branches to delete data during merges. Adds a new method to UnfilteredRowIterator that is implemented by the new trie-backed partitions to ask them to stop issuing tombstones. This is done on filtering (i.e. conversion from UnfilteredRowIterator to RowIterator) where tombstones have already done their job and are no longer needed. Adds JMH tests of tombstones that demonstrate tombstone-independent performance on memtable queries. # Conflicts: # test/burn/org/apache/cassandra/index/sai/LongVectorTest.java

in a combined `encodedState` returned by advancing methods. This saves megamorphic calls to `incomingTransition` and can be augmented by further information at no cost.

This functionality has two main applications: - it allows reverse walks that present prefix content in the correct byte-comparable order (i.e. prefixes after children) - it makes it possible to have full control over what is and isn't included in a trie ranges (e.g. making it possible to have a branch set and nested ranges)

…and TrieMemtable to Stage3 version Remove duplicate configuration object and add tests for stage 3

This change extends the coverage of the memtable trie to the cell level, defining mappings of trie branches to and from the legacy concepts of complex columns and rows.

This makes it possible to have completely off-heap trie memtable, where cell data is stored inside the trie structure if it is small enough to fit, or placed in natively-allocated memory and referenced by memory address.

Rework row filtering to avoid moving deletion boundaries Implement generalized range intersection and use it to perform row filtering Drop mapping merge

Add opOrder safety for some iterations over memtable data that aren't in a opOrder protected group. Make sure all TrieMemtable buffers are copied, including for on-heap tries that can be overwritten. Add facility to overwrite data and a simple test.

when deletionsAtFixedPoints is not true

taking into account overflow positions constructed by positionForSkippingBranch

Change TrieSetCursor to be fully compatible with RangeCursor and implement a separate nonNullState method.

Change incomingPosition to suport overflows and improve its implementation.

plpesvc-ds · 2026-06-23T14:41:48Z

❌ Build ds-cassandra-pr-gate/PR-2308 rejected by Butler

5 regressions found
See build details here

Found 5 new test failures

Test	Explanation	Runs	Upstream
o.a.c.distributed.test.hostreplacement.FailedBootstrapTest.roleSetupDoesNotProduceUnavailables	REGRESSION	🔵🔴	0 / 30
o.a.c.index.sai.cql.VectorCompaction100dTest.testZeroOrOneToManyCompaction[version=ec enableNVQ=false] ()	NEW	🔴⚪	0 / 30
o.a.c.index.sai.cql.VectorKeyRestrictedOnPartitionTest.partitionRestrictedWidePartitionBqCompressedTest[dc false] (compression)	REGRESSION	🔴⚪	0 / 30
o.a.c.index.sai.cql.VectorSiftSmallTest.testSiftSmall[ed false] ()	NEW	🔴⚪	0 / 30
o.a.c.index.sai.disk.RowAwareSkinnyPrimaryKeyMapTest.testFloorSkinny[version=ca] (compression)	REGRESSION	🔵🔴	0 / 30

Found 6 known test failures

blambov force-pushed the CNDB-15669 branch from fbca7ea to 1ffd142 Compare April 7, 2026 12:09

lesnik2u self-requested a review April 8, 2026 13:47

blambov force-pushed the CNDB-15669 branch from bdb5b8c to 0f133d7 Compare April 9, 2026 11:19

michaelsembwever force-pushed the main-5.0 branch from 8f66239 to c00cfa8 Compare April 16, 2026 09:29

blambov force-pushed the CNDB-15669 branch 2 times, most recently from 23772ed to 5b58b4d Compare April 27, 2026 13:50

datastax deleted a comment from cassci-bot May 14, 2026

blambov force-pushed the CNDB-15669 branch 3 times, most recently from a3edf99 to 81e7231 Compare May 14, 2026 13:19

lesnik2u reviewed May 28, 2026

View reviewed changes

blambov force-pushed the CNDB-15669 branch 2 times, most recently from f22cd48 to bc643e7 Compare June 3, 2026 14:10

blambov force-pushed the CNDB-15669 branch 2 times, most recently from 85abd58 to dee7251 Compare June 16, 2026 12:33

blambov force-pushed the CNDB-15669 branch from 02b66fe to 1d85975 Compare June 19, 2026 12:50

blambov and others added 10 commits June 22, 2026 14:39

Remove duplicate Row/CellWithSourceTable

2b61cce

Change trie interfaces to combine depth and incoming character

cf9c79d

in a combined `encodedState` returned by advancing methods. This saves megamorphic calls to `incomingTransition` and can be augmented by further information at no cost.

Test fixes

c039486

Copy TrieBackedPartition, TriePartitionUpdate, TriePartritionUpdater …

658108a

…and TrieMemtable to Stage3 version Remove duplicate configuration object and add tests for stage 3

Implements cell-level trie

1656bc7

This change extends the coverage of the memtable trie to the cell level, defining mappings of trie branches to and from the legacy concepts of complex columns and rows.

Test fixes

40fe3e7

Permit in-memory tries to store bytes in the trie structure

aaced0b

This makes it possible to have completely off-heap trie memtable, where cell data is stored inside the trie structure if it is small enough to fit, or placed in natively-allocated memory and referenced by memory address.

Implement InMemoryRangeCursor.getNearestContent directly

7b6fb32

blambov added 27 commits June 22, 2026 14:42

Unused imports

ce3e75c

Fix score column injection

b216cc9

Fix double partition deletion error

7266549

Improve row and tombstone counting

7f5d5f3

Test fixes

3f7359c

Review fixes

37017bc

Add subselection support in TrieBackedRow.filter

456b847

Rework row filtering to avoid moving deletion boundaries Implement generalized range intersection and use it to perform row filtering Drop mapping merge

Test fixes

759c003

Improve data size tracking and fix tests

3d5ac94

Test fixes

c992a45

Test and fix DeletionAwareTrie.prefixedBySeparately

2e1ee27

Implement a direct getCellForKey for SAI

13ca044

Improve guarding of buffers

6ffec44

Add opOrder safety for some iterations over memtable data that aren't in a opOrder protected group. Make sure all TrieMemtable buffers are copied, including for on-heap tries that can be overwritten. Add facility to overwrite data and a simple test.

Fixups

4319d86

Fix concurrency issue in DeletionAwareTrie mutation

21c3b78

when deletionsAtFixedPoints is not true

Improve DeletionAwareTrie.tailTrie

e181f12

Review

657b2de

Precalculate minLocalDeletionTime

e0ac557

Review

fb9a800

Add some in-memory trie internals tests

4dc180d

Clear references in recycled content ids

7768bd7

Explain sparse child deletion

07d41a7

Add Cursor.incomingTransitionWithOverflow for skipTo implementations

4729678

taking into account overflow positions constructed by positionForSkippingBranch

Review

8acfd43

Allow state and precedingState to be called exhausted

dac29b3

Change TrieSetCursor to be fully compatible with RangeCursor and implement a separate nonNullState method.

Fix fullPartitionDelete

7ae4922

Fix ant check

a677f13

blambov force-pushed the CNDB-15669 branch from 1d85975 to a677f13 Compare June 22, 2026 11:42

Drop the incomingPositionWithOverflow method

6803619

Change incomingPosition to suport overflows and improve its implementation.

		if (estimatedAverageRowSize == null \|\| currentOperations.get() > estimatedAverageRowSize.operations * 1.5)
		estimatedAverageRowSize = new MemtableAverageRowSize(this, mergedTrie);

Uh oh!

Conversation

blambov commented Apr 7, 2026

What is the issue

What does this PR fix and why was it fixed

Uh oh!

github-actions Bot commented Apr 7, 2026

Checklist before you submit for review

Uh oh!

sonarqubecloud Bot commented Apr 27, 2026

Quality Gate passed

Uh oh!

sonarqubecloud Bot commented May 26, 2026

Quality Gate passed

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lesnik2u Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

blambov Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

plpesvc-ds commented Jun 23, 2026

❌ Build ds-cassandra-pr-gate/PR-2308 rejected by Butler

Found 5 new test failures

Found 6 known test failures

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lesnik2u Jun 19, 2026 •

edited

Loading

blambov Jun 19, 2026 •

edited

Loading