fix: prevent silent node loss from qualified_name collisions#586
Open
CodeBlackwell wants to merge 2 commits into
Open
fix: prevent silent node loss from qualified_name collisions#586CodeBlackwell wants to merge 2 commits into
CodeBlackwell wants to merge 2 commits into
Conversation
qualified_name is the graph's stable identity key, built as file::name or file::parent.name. It is not unique per symbol: same-named symbols in one file (e.g. methods of different object literals, or repeated generated helpers) produce the same key. upsert_node inserts with ON CONFLICT(qualified_name) DO UPDATE, so colliding nodes collapsed last-writer-wins and earlier definitions were silently dropped. Deduplicate per file in the batch store functions, which already hold the whole file's node list. The first occurrence keeps its bare key so existing edges (which reference the same parser-computed key) still resolve to it; later collisions are suffixed with :L<line_start>, which is always unique. Unique nodes keep byte-identical keys, so embeddings and incremental updates are unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Surfaces same-file qualified_name collisions that were previously invisible. Adds GraphStore.find_disambiguated_nodes() (matches the :L<line> marker the batch store appends), threads a disambiguated_nodes list into the full_build and incremental_update results, logs it, and prints a one-line summary after build/update. Counts are unchanged: now that collisions are kept rather than dropped, the reported node total and the stored COUNT(*) already agree. Adds tests/test_collisions.py covering no-drop, the bare-key-first invariant, edge-resolution safety, and the disambiguation report. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #585.
Problem
qualified_nameis the graph's stable identity key, built asfile::nameorfile::parent.name. It is not unique per symbol: same-named symbols in one file (methods of different object literals, repeated generated helpers) produce the same key.upsert_nodeinserts withON CONFLICT(qualified_name) DO UPDATE, so colliding rows collapse last writer wins and earlier definitions are silently dropped. A build then reports more nodes than the database stores, and caller/callee analysis on those names points at a single merged node.Changes
Commit 1 — disambiguate colliding qualified_names. Deduplicate per file in
store_file_nodes_edgesandstore_file_batch, which already hold the whole file's node list. The first occurrence keeps its bare key so existing edges still resolve to it; later collisions are suffixed with:L<line_start>, which is always unique because two definitions cannot share a start line. Unique nodes keep byte identical keys, so embeddings and incremental updates are unaffected.Commit 2 — report disambiguated names. Adds
GraphStore.find_disambiguated_nodes(), adisambiguated_nodesfield in thefull_buildandincremental_updateresults, a log line, and a one line CLI summary after build/update. Counts are not rewritten: now that collisions are kept rather than dropped, the reported node total and the storedCOUNT(*)already agree.Verification
New
tests/test_collisions.pycovers: both same-named methods persist, three module scope duplicates persist, an edge to the bare key still resolves to the first definition, the disambiguation report, and that unique names keep bare keys. Full suite green (1389 passed), ruff and mypy clean.Caveat
A collided node's identity shifts if its own
line_startmoves. This affects only ambiguous duplicates, never the unique majority. Open to a different marker convention if you prefer (see #585).🤖 Generated with Claude Code