Skip to content

maint(brain): add orphan node pruning script#92

Open
rajkripal wants to merge 2 commits into
mainfrom
maint/prune-orphan-nodes
Open

maint(brain): add orphan node pruning script#92
rajkripal wants to merge 2 commits into
mainfrom
maint/prune-orphan-nodes

Conversation

@rajkripal

Copy link
Copy Markdown
Owner

159 active nodes (4.8% of 3,295) have no edges in derivation_edges.
These accumulate from extraction runs that don't link new nodes to
existing ones. Left alone they waste embedding space and dilute retrieval
quality over time.

scripts/prune_orphans.py queries for nodes where id appears in
neither parent_id nor child_id of any edge and prints a report:
total count, breakdown by domain and node_type, and 5 sample nodes.

Run --execute to set decayed=1 on all orphan nodes (reversible, not
a delete). Defaults to --dry-run if no flag is given. DB path is read
from core.config.get_db_path() with a fallback to data/graph.db
relative to the cashew root.

Current orphan distribution: bunny 96, raj 62, user 1. Top types: fact
43, observation 36, commitment 35, derived 22.

rajkripal and others added 2 commits June 9, 2026 12:05
Move extractor_state from <data_dir>/extractor_state/<name>.json into an
extractor_state table in brain.db so that state lifecycle matches DB
lifecycle. Wiping the DB now also resets extractor progress, eliminating
silent skip-all-reprocessing on reingest.

- ExtractorRegistry.__init__ accepts optional db_path; creates
  extractor_state table (name TEXT PK, state_json TEXT, updated_at TEXT)
- _save_state / _load_state use DB when db_path is set, JSON otherwise
- Backwards-compat shim: on first load, if DB row is missing, import from
  legacy JSON file and migrate into DB
- CLI (cmd_ingest) passes db_path to registry
- Four new tests: DB persistence, DB lifecycle matches DB wipe, JSON
  fallback, legacy shim import

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
159 active nodes (4.8%) have no edges in derivation_edges. These
accumulate from extraction runs that don't link new nodes to existing
ones. Left alone they waste embedding space and dilute retrieval.

`scripts/prune_orphans.py` reports orphan count, domain/type breakdown,
and 5 sample nodes. Pass `--execute` to set decayed=1 on all orphans.
Defaults to dry-run; reads DB path from core.config or falls back to
data/graph.db relative to cashew root.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant