Add vector embedding support and hybrid search to ToolStore#3808
Add vector embedding support and hybrid search to ToolStore#3808aponcedeleonch wants to merge 2 commits intomainfrom
Conversation
There was a problem hiding this comment.
Large PR Detected
This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.
How to unblock this PR:
Add a section to your PR description with the following format:
## Large PR Justification
[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformationAlternative:
Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.
See our Contributing Guidelines for more details.
This review will be automatically dismissed once you add the justification section.
c06effa to
5d93afe
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3808 +/- ##
==========================================
+ Coverage 66.84% 66.91% +0.07%
==========================================
Files 439 441 +2
Lines 43509 43660 +151
==========================================
+ Hits 29083 29216 +133
- Misses 12175 12182 +7
- Partials 2251 2262 +11 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
631f1d6 to
85a4c84
Compare
jerm-dro
left a comment
There was a problem hiding this comment.
The diff is quite large due to the conflicts with the base branch. The structure looks good, but I will take a closer look when the base branch is merged.
5d93afe to
0fa9e53
Compare
Large PR justification has been provided. Thank you!
|
✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review. |
0fa9e53 to
d17bbb0
Compare
d17bbb0 to
53657a3
Compare
53657a3 to
39ad0e2
Compare
Extend the SQLite-backed ToolStore with optional embedding-based semantic search. When an EmbeddingClient is provided, Search runs FTS5 and cosine similarity in parallel via errgroup, merges and deduplicates results by keeping the lower distance score, and caps output at 4 results. Key additions: - EmbeddingClient interface in internal/types (Embed, EmbedBatch, Dimension, Close) - FakeEmbeddingClient using SHA-256 seeded RNG with L2-normalized vectors - Pure Go cosine similarity/distance in internal/similarity package - Binary encode/decode helpers for embedding BLOB storage - normalizeBM25 rescaled to [0, 2) range to align with cosine distance - Comprehensive unit tests, concurrency tests, and benchmarks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
39ad0e2 to
09908e0
Compare
| AND t.name IN (SELECT value FROM json_each(?)) | ||
| ORDER BY rank` | ||
| ORDER BY rank | ||
| LIMIT ?` |
There was a problem hiding this comment.
Why limit? What happens if you have a hybrid search and a row's rank is low and below the limit, but the cosine similarity is high?
| queryStr := `SELECT name, description, embedding | ||
| FROM llm_capabilities | ||
| WHERE embedding IS NOT NULL | ||
| AND name IN (SELECT value FROM json_each(?))` | ||
|
|
||
| rows, err := s.db.QueryContext(ctx, queryStr, string(allowedJSON)) |
There was a problem hiding this comment.
Blocker, question: Do we really need to query the DB twice?
Why can't we:
- Load the allowed tools, their rank, and embeddings.
- Compute the cosine similarity for each returned row
- Compute the merged score
- Return the top K results based on score.
| // Dimension returns the dimensionality of the embeddings produced. | ||
| Dimension() int |
There was a problem hiding this comment.
suggestion: delete - you can just gather this from the returned embedding?
| // EmbeddingClient generates vector embeddings from text. | ||
| // It is defined in the internal/types package and aliased here so that | ||
| // external consumers continue to use optimizer.EmbeddingClient. | ||
| type EmbeddingClient = types.EmbeddingClient |
There was a problem hiding this comment.
Question, blocking: why does this need to be re-exported?
There was a problem hiding this comment.
Suggestion: You may be able to avoid re-exporting things by making the entrypoint to this package, NewSQLiteToolStore, accept the config object. When that function is called, it can freely reference things in the internal packages to construct objects.
| // SPDX-FileCopyrightText: Copyright 2025 Stacklok, Inc. | ||
| // SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| package optimizer |
There was a problem hiding this comment.
suggestion: move this embedding related code to the similarity package it seems very related, and the actual CosineSimilarity function is an implementation detail.
Closes: #3732
Extend the SQLite-backed ToolStore with optional embedding-based semantic search. When an EmbeddingClient is provided, Search runs FTS5 and cosine similarity in parallel via errgroup, merges and deduplicates results by keeping the lower distance score, and caps output at 4 results.
Key additions:
Large PR Justification