Skip to content

Conversation

@dor-forer
Copy link
Collaborator

Describe the changes in the pull request

This PR introduces multi-threaded vector insertion support and batchless mode for the HNSW Disk index. The main changes include:

  1. Multi-threaded Insert Jobs: Added HNSWDiskInsertJob and HNSWDiskSingleInsertJob structures to support parallel vector insertions via a job queue system. Jobs are self-contained and hold copies of vector data to avoid race conditions.
  2. Batchless Mode with Segmented Cache: Replaced the batch-based insertion approach with a batchless mode. Introduced a segmented neighbor cache (NUM_CACHE_SEGMENTS = 64) with per-segment locks to reduce lock contention in multi-threaded scenarios. Each segment maintains its own cache, dirty set, and new nodes tracking.
  3. Thread Safety Improvements:
  • Changed curElementCount to std::atomic<size_t> for lock-free reads
  • Added multiple shared mutexes (stagedUpdatesGuard, vectorsGuard, rawVectorsGuard) for better read concurrency
  • Added lock-free versions of common operations (isMarkedAsUnsafe) for hot paths
  • Fixed critical race conditions in neighbor filtering and metadata access

Which issues this PR fixes

  1. #...
  2. MOD...

Main objects this PR modified

  1. src/VecSim/algorithms/hnsw/hnsw_disk.h - Core HNSW Disk index with MT support and segmented cache
  2. src/VecSim/algorithms/hnsw/hnsw_disk_serializer.h - Updated serialization for atomic fields and removed legacy batch state
  3. src/VecSim/spaces/computer/preprocessors.h - Added 4-bit scalar quantization preprocessor
  4. tests/unit/test_hnsw_disk.cpp - Updated tests for batchless mode

Mark if applicable

  • This PR introduces API changes
  • This PR introduces serialization changes

@dor-forer dor-forer changed the title Dorer disk poc add delete mt Disk poc add multi-threaded Dec 22, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces multi-threaded vector insertion support and transitions the HNSW Disk index from batch-based to batchless mode. The main objectives are to enable parallel vector insertions via a job queue system and reduce lock contention through a segmented neighbor cache architecture.

Key changes:

  • Added HNSWDiskInsertJob and HNSWDiskSingleInsertJob structures for parallel vector insertions with self-contained vector data to avoid race conditions
  • Replaced batch-based insertion with batchless mode using a 64-segment neighbor cache with per-segment locks for reduced contention
  • Changed curElementCount to std::atomic<size_t> and added multiple shared mutexes (stagedUpdatesGuard, vectorsGuard, rawVectorsGuard) for improved concurrency

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 17 comments.

Show a summary per file
File Description
tests/utils/mock_thread_pool.h Added isIdle() helper to check if all jobs are complete
tests/unit/test_quantized_hnsw_disk.cpp Removed flushBatch() calls, updated comments for batchless mode
tests/unit/test_hnsw_disk.cpp Updated tests to reflect new indexSize() behavior (active elements only) and removed batch flushing
tests/benchmark/run_files/bm_hnsw_disk_single_fp32.cpp Changed index file path to use .zip extension
tests/benchmark/data/scripts/hnsw_disk_serializer.cpp Added multi-threading support with new parameters and progress reporting
tests/benchmark/data/scripts/CMakeLists.txt Included mock_thread_pool source files and headers for serializer
tests/benchmark/bm_vecsim_index.h Added comment clarifying job queue is not set by default
tests/benchmark/bm_initialization/bm_hnsw_disk_initialize_fp32.h Added new async AddLabel benchmark with multi-threaded configuration
src/VecSim/vec_sim_common.h Added three new job types for disk insert operations
src/VecSim/algorithms/hnsw/hnsw_disk_serializer.h Updated serialization to handle atomic curElementCount and removed legacy batch state
src/VecSim/algorithms/hnsw/hnsw_disk.h Core implementation: segmented cache, lock-free operations, batchless insertion, and MT job execution

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 26 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@dor-forer dor-forer changed the title Disk poc add multi-threaded disk-poc add multi-threaded Dec 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants