Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
1651188
mt add vector
dor-forer Dec 14, 2025
4b4807f
improve serialzor
dor-forer Dec 14, 2025
2bd5423
Changes
dor-forer Dec 15, 2025
1720d54
batch process
dor-forer Dec 15, 2025
5e1ddc7
Remove rocksdb lock
dor-forer Dec 15, 2025
9603aeb
entry point
dor-forer Dec 16, 2025
e5747db
entrypoint declare
dor-forer Dec 16, 2025
fa6e5f3
really add it
dor-forer Dec 16, 2025
e4ff62f
decalre for real
dor-forer Dec 16, 2025
ca07b4a
fix
dor-forer Dec 17, 2025
1f42ade
Remove entry point from job
dor-forer Dec 17, 2025
8589334
Merge branch 'disk-poc' of https://github.com/RedisAI/VectorSimilarit…
dor-forer Dec 17, 2025
013b258
entry point
dor-forer Dec 17, 2025
1cfd8af
fix the bad recall
dor-forer Dec 18, 2025
99f903a
wait in the serialzer
dor-forer Dec 18, 2025
37299d5
batchless
dor-forer Dec 21, 2025
ed5b84f
fix unit_test
dor-forer Dec 21, 2025
42bfed1
diskWriteBatchThreshold in serialize
dor-forer Dec 21, 2025
4b274a8
remove the sparse
dor-forer Dec 21, 2025
256c7c7
Remove unused functions
dor-forer Dec 22, 2025
a94f222
remove more old functions
dor-forer Dec 22, 2025
119ef4c
Critical Race Condition
dor-forer Dec 22, 2025
aef11f3
old bm
dor-forer Dec 22, 2025
9b1838b
Merge branch 'disk-poc' of https://github.com/RedisAI/VectorSimilarit…
dor-forer Dec 22, 2025
c8e7f12
Remove old
dor-forer Dec 22, 2025
a1a1228
Segment-local flushing
dor-forer Dec 22, 2025
4ccb195
guard enrty point
dor-forer Dec 22, 2025
f0de127
pr
dor-forer Dec 23, 2025
b842d1e
char padding[64]; remove
dor-forer Dec 23, 2025
17c2f99
Mt explainations
dor-forer Dec 23, 2025
3996df5
fix the test to use setDiskWriteBatchThreshold
dor-forer Dec 23, 2025
a55d543
print every 5 seconds
dor-forer Dec 24, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
240 changes: 240 additions & 0 deletions docs/DISK_HNSW_MT_ARCHITECTURE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,240 @@
# HNSWDisk Multi-Threading Architecture

## Overview

This document describes the architectural changes introduced in the `dorer-disk-poc-add-delete-mt` branch compared to the original `disk-poc` branch. The focus is on multi-threading, synchronization, concurrency in writing to disk, and performance enhancements.

## Key Architectural Changes

### 1. Insertion Mode

**Previous single threaded approach:** Vectors were accumulated in batches before being written to disk, requiring complex coordination between threads.


**Current approach:** Each insert job is self-contained and can write directly to disk upon completion, optimized for workloads where disk writes are cheap but neighbor searching (reads from disk) is the bottleneck.

```
┌──────────────────────────────────────────────────────────────┐
│ HNSWDiskSingleInsertJob │
├──────────────────────────────────────────────────────────────┤
│ - vectorId │
│ - elementMaxLevel │
│ - rawVectorData (copied into job - no external references) │
│ - processedVectorData (quantized vector for distance calc) │
└──────────────────────────────────────────────────────────────┘
```

### 2. Segmented Neighbor Cache

To reduce lock contention in multi-threaded scenarios, the neighbor changes cache is partitioned into **64 independent segments**:

```cpp
static constexpr size_t NUM_CACHE_SEGMENTS = 64; // Power of 2 for efficient modulo

struct alignas(64) CacheSegment {
std::shared_mutex guard; // Per-segment lock
std::unordered_map<uint64_t, std::vector<idType>> cache; // Neighbor lists
std::unordered_set<uint64_t> dirty; // Nodes needing disk write
std::unordered_set<uint64_t> newNodes; // Nodes never written to disk
};
```
Note:
NUM_CACHE_SEGMENTS can be changes which will be cause better separation of the cache,
but will require more RAM usage.

**Key benefits:**
- Threads accessing different segments proceed in parallel.
- Cache-line alignment (`alignas(64)`) prevents false sharing.
- Hash-based segment distribution.

### 3. Lock Hierarchy

| Lock | Type | Protects | Notes |
|------|------|----------|------|
| `indexDataGuard` | `shared_mutex` | `entrypointNode`, `maxLevel`, `idToMetaData`, `labelToIdMap` | Metadata access during graph construction |
| `vectorsGuard` | `shared_mutex` | Vectors container (prevents resize during access) | Vectors access during graph construction |
| `rawVectorsGuard` | `shared_mutex` | `rawVectorsInRAM` map | Raw vectors access during graph construction |
| `stagedUpdatesGuard` | `shared_mutex` | Staged graph updates for deletions |
| `diskWriteGuard` | `mutex` | Serializes global flush operations |
| `cacheSegments_[i].guard` | `shared_mutex` | Per-segment cache, dirty set, newNodes |

### 4. Atomic Variables

```cpp
std::atomic<size_t> curElementCount; // Thread-safe element counting
std::atomic<size_t> totalDirtyCount_{0}; // Fast threshold check without locking
std::atomic<size_t> pendingSingleInsertJobs_{0}; // Track pending async jobs
```
Note:
We can think of more atomics that can be added to further improve performance.

## Concurrency Patterns

### Swap-and-Flush Pattern (Critical for Preventing Lost Updates)

The `writeDirtyNodesToDisk` and `flushDirtyNodesToDisk` methods implement a two-phase approach:

```
Phase 1: Read cache data under SHARED lock (concurrent writers allowed)
- Read neighbor lists
- Build RocksDB WriteBatch

Phase 2: Clear dirty flags under EXCLUSIVE lock (brief, per-segment)
- Atomically swap dirty set contents
- Release lock immediately

Phase 3: Write to disk (NO locks held)
- Any new inserts after Phase 2 will re-add to dirty set
- On failure: re-add nodes to dirty set for retry
```

This prevents the "Lost Update" race condition where a concurrent insert's changes could be lost.

### Atomic Check-and-Add for Neighbor Lists

```cpp
bool tryAddNeighborToCacheIfCapacity(nodeId, level, newNeighborId, maxCapacity) {
// Under exclusive lock:
// 1. Check if already present (avoid duplicates)
// 2. Check capacity
// 3. If space available: add neighbor, mark dirty
// 4. If full: return false (caller must use heuristic)
}
```

This atomic operation prevents race conditions when multiple threads add neighbors to the same node.

### Read-Copy-Update Pattern for Cache Access

```cpp
// Read path (getNeighborsFromCache):
1. Try shared lock → read from cache
2. If miss: read from disk (no lock held during I/O) - rocksdb is thread-safe
3. Acquire exclusive lock → insert to cache (double-check)

// Write path (addNeighborToCache):
1. Acquire exclusive lock
2. Load from disk if needed (release/re-acquire around I/O) - rocksdb is thread-safe
3. Add neighbor, mark dirty
```

## Disk Write Batching

Controlled by `diskWriteBatchThreshold`:

| Value | Behavior |
|-------|----------|
| `0` | Flush after every insert (no batching) |
| `>0` | Accumulate until `totalDirtyCount_ >= threshold`, then flush |

```cpp
if (diskWriteBatchThreshold == 0) {
writeDirtyNodesToDisk(modifiedNodes, rawVectorData, vectorId);
} else if (totalDirtyCount_ >= diskWriteBatchThreshold) {
flushDirtyNodesToDisk(); // Flush ALL dirty nodes
}
```

## Job Queue Integration

### Async Processing Flow

```
addVector()
├── Single-threaded path (no job queue):
│ └── executeGraphInsertionCore() [inline]
└── Multi-threaded path (with job queue):
├── Create HNSWDiskSingleInsertJob (copies vector data)
├── Submit via SubmitJobsToQueue callback
└── Worker thread executes:
└── executeSingleInsertJob()
└── executeGraphInsertionCore()
```

### Job Structure

```cpp
struct HNSWDiskSingleInsertJob : public AsyncJob {
idType vectorId;
size_t elementMaxLevel;
std::string rawVectorData; // Copied - no external references
std::string processedVectorData; // Quantized data for distance calc
};
```

Copying vector data into the job eliminates race conditions with the caller's buffer.

## Data Flow During Insert

```
1. addVector()
├── Atomically allocate vectorId (curElementCount.fetch_add)
├── Store raw vector in rawVectorsInRAM (for other jobs to access)
├── Preprocess vector (quantization)
├── Store processed vector in vectors container
└── Store metadata (label, topLevel)

2. executeGraphInsertionCore()
├── insertElementToGraph()
│ ├── greedySearchLevel() [levels > element_max_level]
│ └── For each level:
│ ├── searchLayer() → find neighbors
│ └── mutuallyConnectNewElement()
│ ├── setNeighborsInCache() [new node's neighbors]
│ └── For each neighbor:
│ ├── tryAddNeighborToCacheIfCapacity()
│ └── or revisitNeighborConnections() [if full]
└── Handle disk write (based on threshold)

3. Disk Write (when triggered)
├── writeDirtyNodesToDisk() [per-insert path]
└── or flushDirtyNodesToDisk() [batch flush path]
```

## Performance Optimizations

### 1. Lock-Free Hot Paths

```cpp
// Fast deleted check without acquiring indexDataGuard
template <Flags FLAG>
bool isMarkedAsUnsafe(idType internalId) const {
return __atomic_load_n(&idToMetaData[internalId].flags, 0) & FLAG;
}
```

Used in `processCandidate()` to avoid lock contention during search.

### 2. Atomic Counters for Fast Threshold Checks

```cpp
// No locking needed to check if flush is required
if (totalDirtyCount_.load(std::memory_order_relaxed) >= diskWriteBatchThreshold) {
flushDirtyNodesToDisk();
}
```

### 3. newNodes Tracking

Nodes created in the current batch are tracked in `cacheSegment.newNodes`:
- Avoids disk lookups for vectors that haven't been written yet
- Cleared after successful flush to disk

### 4. Raw Vectors in RAM

Raw vectors are kept in `rawVectorsInRAM` until flushed to disk:
- Allows concurrent jobs to access vectors before disk write
- Eliminates redundant disk reads during graph construction
- Protected by `rawVectorsGuard` (shared_mutex)

## Thread Safety Summary

| Operation | Thread Safety | Notes |
|-----------|---------------|-------|
| `addVector()` | ✅ Safe | Atomic ID allocation, locked metadata access |
| `topKQuery()` | ✅ Safe | Read-only with lock-free deleted checks |
| Cache read | ✅ Safe | Shared lock per segment |
| Cache write | ✅ Safe | Exclusive lock per segment |
| Disk flush | ✅ Safe | Swap-and-flush pattern |
Loading
Loading