Skip to content

getSegmentCoOccurrence: SQL LIMIT applies before variant-to-word folding, so plural-variant noise can crowd out a genuine two-word match #1146

Description

@inth3shadows

Summary

getSegmentCoOccurrence's SQL counts DISTINCT segment (the raw variant string) toward the
co-occurrence threshold and applies ORDER BY/LIMIT before the JS-side caller folds variants back
to distinct original words. A name whose "2+ matching segments" are actually plural-variant pairs of
one real word (e.g. both service and services present as segments) is indistinguishable from a
genuine two-different-word match at the SQL layer, so it can occupy a LIMIT slot ahead of — and
potentially crowd out — a real match on repos with many candidate rows.

Root cause

// src/db/queries.ts:486-501
getSegmentCoOccurrence(segments, minSegments, limit) {
  ...
  SELECT name, COUNT(DISTINCT segment) AS matches
  FROM name_segment_vocab
  WHERE segment IN (${placeholders})
  GROUP BY name
  HAVING matches >= ?
  ORDER BY matches DESC, length(name) ASC
  LIMIT ?
  ...
}
// src/index.ts:941-943 (getSegmentMatches, the caller)
for (const hit of this.queries.getSegmentCoOccurrence(variants, 2, 24)) {
  const matched = this.wordsMatchingName(hit.name, variantToWord);
  if (matched.size >= 2) candidates.push({ name: hit.name, matchedWords: matched });
}

variants passed into the SQL call already includes plural-fold variants from
segmentLookupVariants (see the separate -es bug, #1145) — so service and services are two
distinct entries in the segment IN (...) list. If a name's vocab rows include both, SQL's
COUNT(DISTINCT segment) sees 2 "matches" even though they're the same underlying word. The
matched.size >= 2 fold-back-to-real-words check in wordsMatchingName only runs on rows the SQL
LIMIT already let through — it can exclude a false candidate from the final output, but it can't
rescue a genuine 2-different-word match that got pushed past position limit (24) by enough of these
inflated single-word rows ranking equal or higher.

The maintainer's own comment one line above the caller acknowledges the variant-vs-word gap for
inclusion but not for the ranking/truncation order:

// src/index.ts:938-939
// Tier A: co-occurrence. minSegments=2 counts VARIANTS, so fold a name's
// matched variants back to distinct words before trusting the coverage.

Impact

I have not constructed a concrete repro that demonstrates the truncation actually dropping a real
match — building one needs a corpus with 24+ qualifying rows where enough are plural-variant
false-positives, which I didn't have time to fabricate this session. This is a traced-but-unexecuted
finding: the mechanism is verified directly from the code (SQL doesn't fold variants, LIMIT applies
before the JS fold), but I have not observed the truncation happening in practice, so treat the
real-world frequency as unknown rather than demonstrated.

Suggested fix

Fold variants back to their original word inside the SQL (would need the variant→word mapping passed
into the query, e.g. as a CASE expression or a temp mapping table), or raise limit enough to give
the JS-side fold room to still find real matches even with some inflated rows ahead of them — I don't
have enough context on the intended precision/cost tradeoff to recommend which.

Verification / scope

Environment

Found on main (tip e699ee9, v1.2.0).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions