Skip to content

Conversation

@ashkrisk
Copy link

@ashkrisk ashkrisk commented Dec 1, 2025

MultiFileDataSource makes use of SiftLoader.readFvecs to read base and query vectors, and SiftLoader.readIvecs to read the provided ground truth. The readIvecs function is currently quite inefficient due to lack of buffering, disproportionately slowing down the time taken to load the Dataset.

This is not so important for Bench, where the time taken to load the dataset is insignificant compared to the time taken to build the index. However, this becomes quite important when running short-lived programs with pre-created graphs, especially during rapid prototyping.

This PR addresses this by adding a BufferedInputStream, similar to the current implementation of readFvecs.

Some numbers from my machine based on a dataset with ~2M base vectors and ~50K query vectors illustrates the difference:

File Size Contents Time
base.fvecs 964M ~2M 128D fvecs 2.6s
query.fvecs 25M ~50K 128D fvecs 0.12s
gt.ivecs 58M ~50K 300D ivecs 10.3s (unbuffered)
0.34s (buffered)

Without buffering, reading the ground truth is ~4x slower than the actual base vectors. With buffering, the ground truth is no longer the bottleneck.

Copy link
Collaborator

@marianotepper marianotepper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants