Skip to content

Fix(consolidation): populate search_vector on observation INSERT/UPDATE in consolidator#2425

Merged
nicoloboschi merged 5 commits into
vectorize-io:mainfrom
qxxaa:fix(consolidation)-observation-search-vector
Jun 30, 2026
Merged

Fix(consolidation): populate search_vector on observation INSERT/UPDATE in consolidator#2425
nicoloboschi merged 5 commits into
vectorize-io:mainfrom
qxxaa:fix(consolidation)-observation-search-vector

Conversation

@qxxaa

@qxxaa qxxaa commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Problem

The consolidator creates and updates observations in memory_units without populating the search_vector column. Under the native text search extension backend, this makes all observations invisible to BM25 full-text retrieval.

This is a known gap documented in the source code itself:

# Native: the migration p4q5r6s7t8u9 dropped the GENERATED expression on
# search_vector to allow per-deployment language configuration; the
# batch insert path in ops_postgresql.insert_facts_batch now populates
# it via to_tsvector($lang, ...). This single-observation INSERT does
# not, so observations under the native backend currently land with
# NULL search_vector and are not BM25-searchable until reflected/
# re-ingested. Tracking a separate fix for that gap.

This PR fills that gap.

The raw fact batch insert path (ops_postgresql.insert_facts_batch) correctly populates search_vector with to_tsvector(). The consolidation pipeline - which produces the observation layer that recall surfaces - does not. The BM25 retrieval arm consistently returns 0 candidates for observations:

Parallel retrieval: semantic=12, bm25=0, graph=95, temporal=1

After the fix:

Parallel retrieval: semantic=12, bm25=3, graph=95, temporal=1

Root cause

Four code paths in consolidator.py write observation text without populating search_vector:

Function Operation Trigger
_dedup_reconcile_create UPDATE New observation merges into near-twin
_dedup_reconcile_update UPDATE Updated observation drifts into different twin
_execute_update_action UPDATE LLM rewrites existing observation
_create_observation_directly INSERT New observation created

Fix

Add conditional search_vector population to all four sites, gated on config.text_search_extension == "native". This matches the existing pattern in ops_postgresql.insert_facts_batch.

For UPDATE paths (Sites 1-3), a search_vector_clause string is conditionally appended to the SET clause:

search_vector_clause = (
    f",\n            search_vector = to_tsvector('{config.text_search_extension_native_language}'::regconfig, COALESCE($1, ''))"
    if config.text_search_extension == "native"
    else ""
)

For the INSERT path (Site 4), the existing else branch (which bundled native + pg_textsearch + pgroonga + pg_search) is split into elif "native" (with tsvector) and else (without), ensuring non-native backends continue to leave search_vector NULL as they index base text columns directly.

Backend gating rationale

The fix is restricted to text_search_extension == "native" because:

  • vchord: Has its own INSERT branch with tokenize() - already handles search_vector
  • native: Uses PostgreSQL's built-in tsvector/GIN for BM25 - needs explicit to_tsvector() population
  • pg_textsearch / pgroonga / pg_search: Index the raw text column directly via their own extension mechanisms - search_vector is a dummy column that should remain NULL

tsvector content: observation text only

The tsvector is built from COALESCE(text, '') - the observation's own text content only. Entity names, source fact text, and temporal metadata are intentionally excluded because:

  • Semantic retrieval already handles meaning-based discovery (embeddings)
  • Graph retrieval already handles entity-based discovery (entity links)
  • Temporal retrieval already handles time-based discovery (occurred_start/end, mentioned_at)

BM25's role is keyword matching on the observation's surface text - the one signal the other three arms don't cover. Enriching the tsvector with entity names or source context would create "contextual leakage" where non-central details pull general observations into narrow keyword matches.

Regression risk

None. The fix only fires when text_search_extension == "native" - all other backends are unaffected. The tsvector is built from the observation's own text content using the configured language dictionary (config.text_search_extension_native_language), matching the existing pattern in insert_facts_batch. No new parameters are added to any SQL query.

Testing

Added test_create_observation_populates_search_vector_native in tests/test_consolidation.py. Asserts that observations created via consolidation have search_vector IS NOT NULL under the native text search backend (Site 4 - the INSERT path, highest frequency).

The UPDATE path (Site 3) is verified to work via live deployment testing but not unit-tested - triggering an UPDATE deterministically requires a real LLM making consolidation decisions (the mock LLM always creates fresh observations). Sites 1-2 (dedup reconcile) are similarly impractical to trigger without controlling embedding similarity thresholds.

Additionally verified on a live deployment with ~139 observations:

  • Pre-patch: all observations have search_vector IS NULL, BM25 returns 0
  • Post-patch: new observations created by consolidation have populated tsvectors
  • BM25 arm returns relevant candidates (3+ observations matching keyword queries)
  • Semantic, graph, and temporal arms unaffected

qxxaa added 2 commits June 26, 2026 16:24
The consolidator creates and updates observations without populating the
search_vector tsvector column. Under the native text search extension,
this means observations are invisible to BM25 full-text retrieval - the
BM25 arm returns 0 candidates regardless of query content.

Four code paths write observation text to memory_units:
1. _dedup_reconcile_create (merge into existing twin)
2. _dedup_reconcile_update (drift-merge into different twin)
3. _execute_update_action (LLM rewrite of existing observation)
4. _create_observation_directly (new observation INSERT)

None populated search_vector. This patch adds conditional tsvector
generation gated on config.text_search_extension == 'native', matching
the existing pattern in ops_postgresql.insert_facts_batch. Non-native
backends (pg_textsearch, pgroonga, pg_search) continue to leave
search_vector NULL as they index base text columns directly.

The INSERT path (Site 4) splits the existing else branch into an
explicit elif/else to avoid applying native tsvector logic to backends
that don't use it.

Fixes: observations invisible to BM25 retrieval arm.
Add test for observation creation with native search vector
@qxxaa

qxxaa commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

Deployment note: Existing observations created before this fix still have search_vector IS NULL. A one-time backfill is needed for full BM25 coverage:

UPDATE memory_units
SET search_vector = to_tsvector('english'::regconfig, COALESCE(text, ''))
WHERE fact_type = 'observation' AND search_vector IS NULL;

Suggest including this in the release notes. Note: the regconfig language should match the deployment's HINDSIGHT_API_TEXT_SEARCH_EXTENSION_NATIVE_LANGUAGE setting. This only applies to deployments using text_search_extension = "native" and the query can be expensive on large tables, so operators should schedule it during low traffic.

For external PG instances this is a manual step; for the default baked-in pg0 deployment it could potentially be an Alembic migration, but that's a separate decision for maintainers.

qxxaa and others added 3 commits June 26, 2026 16:46
…ations

The writer fix only populates search_vector for observations created or
updated after deploy. Observations already written under the native
backend keep a NULL search_vector and stay invisible to BM25 until
re-consolidated. Add migration c3f7a1b9d2e4 to backfill them, gated on the
native tsvector column type and scoped to fact_type='observation' with a
NULL search_vector (idempotent). Matches the writer's text-only tsvector
and the configured native language.
post-checkout/post-commit/post-merge/pre-push were git-lfs stubs picked
up from the contributor's local hookspath and committed by mistake. They
are unrelated to this change; the project's real .githooks/pre-commit is
left intact.
@nicoloboschi nicoloboschi merged commit 21176f8 into vectorize-io:main Jun 30, 2026
87 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants