Summary
getSegmentCoOccurrence's SQL counts DISTINCT segment (the raw variant string) toward the
co-occurrence threshold and applies ORDER BY/LIMIT before the JS-side caller folds variants back
to distinct original words. A name whose "2+ matching segments" are actually plural-variant pairs of
one real word (e.g. both service and services present as segments) is indistinguishable from a
genuine two-different-word match at the SQL layer, so it can occupy a LIMIT slot ahead of — and
potentially crowd out — a real match on repos with many candidate rows.
Root cause
// src/db/queries.ts:486-501
getSegmentCoOccurrence(segments, minSegments, limit) {
...
SELECT name, COUNT(DISTINCT segment) AS matches
FROM name_segment_vocab
WHERE segment IN (${placeholders})
GROUP BY name
HAVING matches >= ?
ORDER BY matches DESC, length(name) ASC
LIMIT ?
...
}
// src/index.ts:941-943 (getSegmentMatches, the caller)
for (const hit of this.queries.getSegmentCoOccurrence(variants, 2, 24)) {
const matched = this.wordsMatchingName(hit.name, variantToWord);
if (matched.size >= 2) candidates.push({ name: hit.name, matchedWords: matched });
}
variants passed into the SQL call already includes plural-fold variants from
segmentLookupVariants (see the separate -es bug, #1145) — so service and services are two
distinct entries in the segment IN (...) list. If a name's vocab rows include both, SQL's
COUNT(DISTINCT segment) sees 2 "matches" even though they're the same underlying word. The
matched.size >= 2 fold-back-to-real-words check in wordsMatchingName only runs on rows the SQL
LIMIT already let through — it can exclude a false candidate from the final output, but it can't
rescue a genuine 2-different-word match that got pushed past position limit (24) by enough of these
inflated single-word rows ranking equal or higher.
The maintainer's own comment one line above the caller acknowledges the variant-vs-word gap for
inclusion but not for the ranking/truncation order:
// src/index.ts:938-939
// Tier A: co-occurrence. minSegments=2 counts VARIANTS, so fold a name's
// matched variants back to distinct words before trusting the coverage.
Impact
I have not constructed a concrete repro that demonstrates the truncation actually dropping a real
match — building one needs a corpus with 24+ qualifying rows where enough are plural-variant
false-positives, which I didn't have time to fabricate this session. This is a traced-but-unexecuted
finding: the mechanism is verified directly from the code (SQL doesn't fold variants, LIMIT applies
before the JS fold), but I have not observed the truncation happening in practice, so treat the
real-world frequency as unknown rather than demonstrated.
Suggested fix
Fold variants back to their original word inside the SQL (would need the variant→word mapping passed
into the query, e.g. as a CASE expression or a temp mapping table), or raise limit enough to give
the JS-side fold room to still find real matches even with some inflated rows ahead of them — I don't
have enough context on the intended precision/cost tradeoff to recommend which.
Verification / scope
Environment
Found on main (tip e699ee9, v1.2.0).
Summary
getSegmentCoOccurrence's SQL countsDISTINCT segment(the raw variant string) toward theco-occurrence threshold and applies
ORDER BY/LIMITbefore the JS-side caller folds variants backto distinct original words. A name whose "2+ matching segments" are actually plural-variant pairs of
one real word (e.g. both
serviceandservicespresent as segments) is indistinguishable from agenuine two-different-word match at the SQL layer, so it can occupy a
LIMITslot ahead of — andpotentially crowd out — a real match on repos with many candidate rows.
Root cause
variantspassed into the SQL call already includes plural-fold variants fromsegmentLookupVariants(see the separate-esbug, #1145) — soserviceandservicesare twodistinct entries in the
segment IN (...)list. If a name's vocab rows include both, SQL'sCOUNT(DISTINCT segment)sees 2 "matches" even though they're the same underlying word. Thematched.size >= 2fold-back-to-real-words check inwordsMatchingNameonly runs on rows the SQLLIMITalready let through — it can exclude a false candidate from the final output, but it can'trescue a genuine 2-different-word match that got pushed past position
limit(24) by enough of theseinflated single-word rows ranking equal or higher.
The maintainer's own comment one line above the caller acknowledges the variant-vs-word gap for
inclusion but not for the ranking/truncation order:
Impact
I have not constructed a concrete repro that demonstrates the truncation actually dropping a real
match — building one needs a corpus with 24+ qualifying rows where enough are plural-variant
false-positives, which I didn't have time to fabricate this session. This is a traced-but-unexecuted
finding: the mechanism is verified directly from the code (SQL doesn't fold variants, LIMIT applies
before the JS fold), but I have not observed the truncation happening in practice, so treat the
real-world frequency as unknown rather than demonstrated.
Suggested fix
Fold variants back to their original word inside the SQL (would need the variant→word mapping passed
into the query, e.g. as a
CASEexpression or a temp mapping table), or raiselimitenough to givethe JS-side fold room to still find real matches even with some inflated rows ahead of them — I don't
have enough context on the intended precision/cost tradeoff to recommend which.
Verification / scope
chunked resolved-reference deletes, hunk at line ~1731 — unrelated, already confirmed for name_segment_vocab silently drifts from renamed nodes — updateNode() never calls insertNameSegments(), and the only backfill never re-triggers #1141).
Environment
Found on
main(tipe699ee9, v1.2.0).