Fix(consolidation): populate search_vector on observation INSERT/UPDATE in consolidator#2425
Merged
nicoloboschi merged 5 commits intoJun 30, 2026
Conversation
The consolidator creates and updates observations without populating the search_vector tsvector column. Under the native text search extension, this means observations are invisible to BM25 full-text retrieval - the BM25 arm returns 0 candidates regardless of query content. Four code paths write observation text to memory_units: 1. _dedup_reconcile_create (merge into existing twin) 2. _dedup_reconcile_update (drift-merge into different twin) 3. _execute_update_action (LLM rewrite of existing observation) 4. _create_observation_directly (new observation INSERT) None populated search_vector. This patch adds conditional tsvector generation gated on config.text_search_extension == 'native', matching the existing pattern in ops_postgresql.insert_facts_batch. Non-native backends (pg_textsearch, pgroonga, pg_search) continue to leave search_vector NULL as they index base text columns directly. The INSERT path (Site 4) splits the existing else branch into an explicit elif/else to avoid applying native tsvector logic to backends that don't use it. Fixes: observations invisible to BM25 retrieval arm.
Add test for observation creation with native search vector
Contributor
Author
|
Deployment note: Existing observations created before this fix still have UPDATE memory_units
SET search_vector = to_tsvector('english'::regconfig, COALESCE(text, ''))
WHERE fact_type = 'observation' AND search_vector IS NULL;Suggest including this in the release notes. Note: the For external PG instances this is a manual step; for the default baked-in pg0 deployment it could potentially be an Alembic migration, but that's a separate decision for maintainers. |
…ations The writer fix only populates search_vector for observations created or updated after deploy. Observations already written under the native backend keep a NULL search_vector and stay invisible to BM25 until re-consolidated. Add migration c3f7a1b9d2e4 to backfill them, gated on the native tsvector column type and scoped to fact_type='observation' with a NULL search_vector (idempotent). Matches the writer's text-only tsvector and the configured native language.
post-checkout/post-commit/post-merge/pre-push were git-lfs stubs picked up from the contributor's local hookspath and committed by mistake. They are unrelated to this change; the project's real .githooks/pre-commit is left intact.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The consolidator creates and updates observations in
memory_unitswithout populating thesearch_vectorcolumn. Under thenativetext search extension backend, this makes all observations invisible to BM25 full-text retrieval.This is a known gap documented in the source code itself:
This PR fills that gap.
The raw fact batch insert path (
ops_postgresql.insert_facts_batch) correctly populatessearch_vectorwithto_tsvector(). The consolidation pipeline - which produces the observation layer that recall surfaces - does not. The BM25 retrieval arm consistently returns 0 candidates for observations:After the fix:
Root cause
Four code paths in
consolidator.pywrite observation text without populatingsearch_vector:_dedup_reconcile_create_dedup_reconcile_update_execute_update_action_create_observation_directlyFix
Add conditional
search_vectorpopulation to all four sites, gated onconfig.text_search_extension == "native". This matches the existing pattern inops_postgresql.insert_facts_batch.For UPDATE paths (Sites 1-3), a
search_vector_clausestring is conditionally appended to the SET clause:For the INSERT path (Site 4), the existing
elsebranch (which bundled native + pg_textsearch + pgroonga + pg_search) is split intoelif "native"(with tsvector) andelse(without), ensuring non-native backends continue to leavesearch_vectorNULL as they index base text columns directly.Backend gating rationale
The fix is restricted to
text_search_extension == "native"because:tokenize()- already handles search_vectorto_tsvector()populationtextcolumn directly via their own extension mechanisms -search_vectoris a dummy column that should remain NULLtsvector content: observation text only
The tsvector is built from
COALESCE(text, '')- the observation's own text content only. Entity names, source fact text, and temporal metadata are intentionally excluded because:BM25's role is keyword matching on the observation's surface text - the one signal the other three arms don't cover. Enriching the tsvector with entity names or source context would create "contextual leakage" where non-central details pull general observations into narrow keyword matches.
Regression risk
None. The fix only fires when
text_search_extension == "native"- all other backends are unaffected. The tsvector is built from the observation's own text content using the configured language dictionary (config.text_search_extension_native_language), matching the existing pattern ininsert_facts_batch. No new parameters are added to any SQL query.Testing
Added
test_create_observation_populates_search_vector_nativeintests/test_consolidation.py. Asserts that observations created via consolidation havesearch_vector IS NOT NULLunder the native text search backend (Site 4 - the INSERT path, highest frequency).The UPDATE path (Site 3) is verified to work via live deployment testing but not unit-tested - triggering an UPDATE deterministically requires a real LLM making consolidation decisions (the mock LLM always creates fresh observations). Sites 1-2 (dedup reconcile) are similarly impractical to trigger without controlling embedding similarity thresholds.
Additionally verified on a live deployment with ~139 observations:
search_vector IS NULL, BM25 returns 0