getSegmentCoOccurrence: SQL LIMIT applies before variant-to-word folding, so plural-variant noise can crowd out a genuine two-word match

## Summary

`getSegmentCoOccurrence`'s SQL counts `DISTINCT segment` (the raw variant string) toward the
co-occurrence threshold and applies `ORDER BY`/`LIMIT` before the JS-side caller folds variants back
to distinct original words. A name whose "2+ matching segments" are actually plural-variant pairs of
one real word (e.g. both `service` and `services` present as segments) is indistinguishable from a
genuine two-different-word match at the SQL layer, so it can occupy a `LIMIT` slot ahead of — and
potentially crowd out — a real match on repos with many candidate rows.

## Root cause

```ts
// src/db/queries.ts:486-501
getSegmentCoOccurrence(segments, minSegments, limit) {
  ...
  SELECT name, COUNT(DISTINCT segment) AS matches
  FROM name_segment_vocab
  WHERE segment IN (${placeholders})
  GROUP BY name
  HAVING matches >= ?
  ORDER BY matches DESC, length(name) ASC
  LIMIT ?
  ...
}
```

```ts
// src/index.ts:941-943 (getSegmentMatches, the caller)
for (const hit of this.queries.getSegmentCoOccurrence(variants, 2, 24)) {
  const matched = this.wordsMatchingName(hit.name, variantToWord);
  if (matched.size >= 2) candidates.push({ name: hit.name, matchedWords: matched });
}
```

`variants` passed into the SQL call already includes plural-fold variants from
`segmentLookupVariants` (see the separate `-es` bug, #1145) — so `service` and `services` are two
distinct entries in the `segment IN (...)` list. If a name's vocab rows include both, SQL's
`COUNT(DISTINCT segment)` sees 2 "matches" even though they're the same underlying word. The
`matched.size >= 2` fold-back-to-real-words check in `wordsMatchingName` only runs on rows the SQL
`LIMIT` already let through — it can exclude a false candidate from the final output, but it can't
rescue a genuine 2-different-word match that got pushed past position `limit` (24) by enough of these
inflated single-word rows ranking equal or higher.

The maintainer's own comment one line above the caller acknowledges the variant-vs-word gap for
inclusion but not for the ranking/truncation order:

```ts
// src/index.ts:938-939
// Tier A: co-occurrence. minSegments=2 counts VARIANTS, so fold a name's
// matched variants back to distinct words before trusting the coverage.
```

## Impact

I have not constructed a concrete repro that demonstrates the truncation actually dropping a real
match — building one needs a corpus with 24+ qualifying rows where enough are plural-variant
false-positives, which I didn't have time to fabricate this session. This is a traced-but-unexecuted
finding: the mechanism is verified directly from the code (SQL doesn't fold variants, LIMIT applies
before the JS fold), but I have not observed the truncation happening in practice, so treat the
real-world frequency as unknown rather than demonstrated.

## Suggested fix

Fold variants back to their original word inside the SQL (would need the variant→word mapping passed
into the query, e.g. as a `CASE` expression or a temp mapping table), or raise `limit` enough to give
the JS-side fold room to still find real matches even with some inflated rows ahead of them — I don't
have enough context on the intended precision/cost tradeoff to recommend which.

## Verification / scope

- Checked all 29 currently-open PRs' changed-file lists for src/db/queries.ts — 1 touches it (#1005,
  chunked resolved-reference deletes, hunk at line ~1731 — unrelated, already confirmed for #1141).
- Checked issues/PRs for "getSegmentCoOccurrence", "segment vocab limit" — nothing.
- Re-checked immediately before filing — no new overlap.
- Read directly, not executed end-to-end (see Impact hedge above).

## Environment

Found on `main` (tip `e699ee9`, v1.2.0).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getSegmentCoOccurrence: SQL LIMIT applies before variant-to-word folding, so plural-variant noise can crowd out a genuine two-word match #1146

Summary

Root cause

Impact

Suggested fix

Verification / scope

Environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

getSegmentCoOccurrence: SQL LIMIT applies before variant-to-word folding, so plural-variant noise can crowd out a genuine two-word match #1146

Description

Summary

Root cause

Impact

Suggested fix

Verification / scope

Environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions