Fix(consolidation): populate search_vector on observation INSERT/UPDATE in consolidator by qxxaa · Pull Request #2425 · vectorize-io/hindsight

qxxaa · 2026-06-26T15:57:34Z

Problem

The consolidator creates and updates observations in memory_units without populating the search_vector column. Under the native text search extension backend, this makes all observations invisible to BM25 full-text retrieval.

This is a known gap documented in the source code itself:

# Native: the migration p4q5r6s7t8u9 dropped the GENERATED expression on
# search_vector to allow per-deployment language configuration; the
# batch insert path in ops_postgresql.insert_facts_batch now populates
# it via to_tsvector($lang, ...). This single-observation INSERT does
# not, so observations under the native backend currently land with
# NULL search_vector and are not BM25-searchable until reflected/
# re-ingested. Tracking a separate fix for that gap.

This PR fills that gap.

The raw fact batch insert path (ops_postgresql.insert_facts_batch) correctly populates search_vector with to_tsvector(). The consolidation pipeline - which produces the observation layer that recall surfaces - does not. The BM25 retrieval arm consistently returns 0 candidates for observations:

Parallel retrieval: semantic=12, bm25=0, graph=95, temporal=1

After the fix:

Parallel retrieval: semantic=12, bm25=3, graph=95, temporal=1

Root cause

Four code paths in consolidator.py write observation text without populating search_vector:

Function	Operation	Trigger
`_dedup_reconcile_create`	UPDATE	New observation merges into near-twin
`_dedup_reconcile_update`	UPDATE	Updated observation drifts into different twin
`_execute_update_action`	UPDATE	LLM rewrites existing observation
`_create_observation_directly`	INSERT	New observation created

Fix

Add conditional search_vector population to all four sites, gated on config.text_search_extension == "native". This matches the existing pattern in ops_postgresql.insert_facts_batch.

For UPDATE paths (Sites 1-3), a search_vector_clause string is conditionally appended to the SET clause:

search_vector_clause = (
    f",\n            search_vector = to_tsvector('{config.text_search_extension_native_language}'::regconfig, COALESCE($1, ''))"
    if config.text_search_extension == "native"
    else ""
)

For the INSERT path (Site 4), the existing else branch (which bundled native + pg_textsearch + pgroonga + pg_search) is split into elif "native" (with tsvector) and else (without), ensuring non-native backends continue to leave search_vector NULL as they index base text columns directly.

Backend gating rationale

The fix is restricted to text_search_extension == "native" because:

vchord: Has its own INSERT branch with tokenize() - already handles search_vector
native: Uses PostgreSQL's built-in tsvector/GIN for BM25 - needs explicit to_tsvector() population
pg_textsearch / pgroonga / pg_search: Index the raw text column directly via their own extension mechanisms - search_vector is a dummy column that should remain NULL

tsvector content: observation text only

The tsvector is built from COALESCE(text, '') - the observation's own text content only. Entity names, source fact text, and temporal metadata are intentionally excluded because:

Semantic retrieval already handles meaning-based discovery (embeddings)
Graph retrieval already handles entity-based discovery (entity links)
Temporal retrieval already handles time-based discovery (occurred_start/end, mentioned_at)

BM25's role is keyword matching on the observation's surface text - the one signal the other three arms don't cover. Enriching the tsvector with entity names or source context would create "contextual leakage" where non-central details pull general observations into narrow keyword matches.

Regression risk

None. The fix only fires when text_search_extension == "native" - all other backends are unaffected. The tsvector is built from the observation's own text content using the configured language dictionary (config.text_search_extension_native_language), matching the existing pattern in insert_facts_batch. No new parameters are added to any SQL query.

Testing

Added test_create_observation_populates_search_vector_native in tests/test_consolidation.py. Asserts that observations created via consolidation have search_vector IS NOT NULL under the native text search backend (Site 4 - the INSERT path, highest frequency).

The UPDATE path (Site 3) is verified to work via live deployment testing but not unit-tested - triggering an UPDATE deterministically requires a real LLM making consolidation decisions (the mock LLM always creates fresh observations). Sites 1-2 (dedup reconcile) are similarly impractical to trigger without controlling embedding similarity thresholds.

Additionally verified on a live deployment with ~139 observations:

Pre-patch: all observations have search_vector IS NULL, BM25 returns 0
Post-patch: new observations created by consolidation have populated tsvectors
BM25 arm returns relevant candidates (3+ observations matching keyword queries)
Semantic, graph, and temporal arms unaffected

The consolidator creates and updates observations without populating the search_vector tsvector column. Under the native text search extension, this means observations are invisible to BM25 full-text retrieval - the BM25 arm returns 0 candidates regardless of query content. Four code paths write observation text to memory_units: 1. _dedup_reconcile_create (merge into existing twin) 2. _dedup_reconcile_update (drift-merge into different twin) 3. _execute_update_action (LLM rewrite of existing observation) 4. _create_observation_directly (new observation INSERT) None populated search_vector. This patch adds conditional tsvector generation gated on config.text_search_extension == 'native', matching the existing pattern in ops_postgresql.insert_facts_batch. Non-native backends (pg_textsearch, pgroonga, pg_search) continue to leave search_vector NULL as they index base text columns directly. The INSERT path (Site 4) splits the existing else branch into an explicit elif/else to avoid applying native tsvector logic to backends that don't use it. Fixes: observations invisible to BM25 retrieval arm.

Add test for observation creation with native search vector

qxxaa · 2026-06-26T16:04:06Z

Deployment note: Existing observations created before this fix still have search_vector IS NULL. A one-time backfill is needed for full BM25 coverage:

UPDATE memory_units
SET search_vector = to_tsvector('english'::regconfig, COALESCE(text, ''))
WHERE fact_type = 'observation' AND search_vector IS NULL;

Suggest including this in the release notes. Note: the regconfig language should match the deployment's HINDSIGHT_API_TEXT_SEARCH_EXTENSION_NATIVE_LANGUAGE setting. This only applies to deployments using text_search_extension = "native" and the query can be expensive on large tables, so operators should schedule it during low traffic.

For external PG instances this is a manual step; for the default baked-in pg0 deployment it could potentially be an Alembic migration, but that's a separate decision for maintainers.

…ations The writer fix only populates search_vector for observations created or updated after deploy. Observations already written under the native backend keep a NULL search_vector and stay invisible to BM25 until re-consolidated. Add migration c3f7a1b9d2e4 to backfill them, gated on the native tsvector column type and scoped to fact_type='observation' with a NULL search_vector (idempotent). Matches the writer's text-only tsvector and the configured native language.

post-checkout/post-commit/post-merge/pre-push were git-lfs stubs picked up from the contributor's local hookspath and committed by mistake. They are unrelated to this change; the project's real .githooks/pre-commit is left intact.

qxxaa added 2 commits June 26, 2026 16:24

Implement test for search vector population in observations

8a891cc

Add test for observation creation with native search vector

qxxaa and others added 3 commits June 26, 2026 16:46

style: run lint

b960307

chore: remove accidentally committed git-lfs hooks

986fd5c

post-checkout/post-commit/post-merge/pre-push were git-lfs stubs picked up from the contributor's local hookspath and committed by mistake. They are unrelated to this change; the project's real .githooks/pre-commit is left intact.

nicoloboschi merged commit 21176f8 into vectorize-io:main Jun 30, 2026
87 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix(consolidation): populate search_vector on observation INSERT/UPDATE in consolidator#2425

Fix(consolidation): populate search_vector on observation INSERT/UPDATE in consolidator#2425
nicoloboschi merged 5 commits into
vectorize-io:mainfrom
qxxaa:fix(consolidation)-observation-search-vector

qxxaa commented Jun 26, 2026

Uh oh!

qxxaa commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

qxxaa commented Jun 26, 2026

Problem

Root cause

Fix

Backend gating rationale

tsvector content: observation text only

Regression risk

Testing

Uh oh!

qxxaa commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants