Skip to content

Add On-Disk Graph Index with Delete Tracking and Streaming N:1 Compaction #580

@dian-lun-lin

Description

@dian-lun-lin

Problem

OnDiskGraphIndex is immutable. Today, to delete nodes or perform compaction, clients must:

  1. Load each on-disk graph into an in-memory OnHeapGraphIndex.
  2. Apply deletions.
  3. Rewrite a new on-disk graph from the fully materialized heap index.

For large graphs (millions of nodes), this approach requires loading the full graph structure, which frequently leads to OOM and does not scale.

Proposal

Introduce an OnDiskGraphIndexCompactor that performs streaming N:1 compaction over multiple on-disk indexes without loading them into memory.

The compactor provides:

  1. Delete Tracking
    Track live/dead nodes using an in-memory FixedBitSet, allowing deletion without rewriting the index or constructing an OnHeapGraphIndex.

  2. Streaming N:1 Compaction that performs streaming compaction
    Write a new OnDiskGraphIndex via a streaming writer (no full materialization of the graph). Because the existing writer does not support streaming individual nodes, this proposal includes implementing a new CompactorWriter.

  3. Custom Ordinal Remapping
    Allow clients to pass a custom OridnalMapper that remaps ordinals from each source index to ordinals in the compacted output index.

Compactor Internal State

List<ImmutableGraphIndex> sources;                    // Source graph indexes to compact
Map<ImmutableGraphIndex, FixedBitSet> livenodes;      // Tracks which ordinals are live 
Map<ImmutableGraphIndex, OrdinalMapper> remappers;    // Maps old ordinals -> new ordinals

API Usage

// Create a compactor from multiple immutable graph indexes
OnDiskGraphIndexCompactor compactor =
    new OnDiskGraphIndexCompactor(List.of(source1, source2, source3));

// Mark which nodes are still live in each source
compactor.setLiveNodes(source1, liveNodes1); // FixedBitSet
compactor.setLiveNodes(source2, liveNodes2);

// Supply remapping functions for each source
compactor.setRemapper(source1, remapper1);
compactor.setRemapper(source2, remapper2);

// Execute streaming compaction
OnDiskGraphIndex result = compactor.compact(outputPath, numNeighbors);

Implementation of Compaction

  1. Traverse each node in each index
    Iterate through all ordinals across all sources and skip deleted ones.

  2. Perform search across all sources
    For each node, run a multi-source search to retrieve top-K candidates from all indexes.

  3. Apply pruning
    Apply standard HNSW pruning rules to maintain neighbor limits and graph quality.

  4. Apply ordinal remapping
    Convert both node ordinal and neighbor ordinals using the per-source remappers.

  5. Stream the node to I/O
    Use the new CompactorWriter to write each node sequentially to disk.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions