Skip to content

membench: add AttrScore-style free-form judge and runner skeleton#44

Open
rajkripal wants to merge 3 commits into
mainfrom
feat/membench-free-form-judge
Open

membench: add AttrScore-style free-form judge and runner skeleton#44
rajkripal wants to merge 3 commits into
mainfrom
feat/membench-free-form-judge

Conversation

@rajkripal

Copy link
Copy Markdown
Owner

bench/free_form_judge.py implements an AttrScore-style citation-precision judge. Given a question, system answer, retrieved context, and optional gold sketch, it calls claude-haiku-4-5-20251001 and returns a score of 1.0 (fully supported), 0.5 (partially), or 0.0 (not supported / hallucinated). Includes batch_judge for bulk runs.

bench/run_free_form.py is the runner skeleton for the free-form synthesis track. It loads questions from a JSON file (schema with qid, corpus, question, gold_sketch, authoring_note), retrieves context from a cashew DB via generate_session_context, generates a 2-3 sentence answer, judges it, and writes JSONL results. Supports --smoke (first 3 questions), --questions-file, --db, and --output.

The questions JSON (papers/locomo-run/membench-free-form-questions.json) doesn't exist yet -- blocked on Raj filling in gold_sketch fields from the draft at papers/locomo-run/membench-questions-draft.md. The runner exits with a clear error if the file is missing.

Both files import claude_p from the locomo adapter, same pattern as cashew_membench.py.

rajkripal added 3 commits May 9, 2026 21:32
The cluster-merge LLM synthesis step was free to drop date, weekday,
and relative-time phrases ("on March 5", "last Tuesday", "two weeks
ago") when consolidating near-duplicate nodes. Critic diff on
LoCoMo conv-26 variant B traced 435 dropped-content cases to this
path and matched it to a -9.6 F1 regression on temporal-reasoning
questions (cat-2).

Two changes:

1. Strengthen the synthesis prompt: explicitly instruct the model
   that temporal anchors are load-bearing and must not be dropped
   for brevity.

2. Add a deterministic guard: collect every temporal-anchor token
   from the source snippets, and if the model returns a synthesis
   that contains none of them while the inputs had at least one,
   fall back to the longest source. Bleached-but-fluent merges
   never replace anchored facts.

Anchor regex covers: ISO dates, slash dates, month-day-year forms,
weekdays, bare 19xx/20xx years, and relative phrases like "in
3 days", "two weeks ago", "last Tuesday".

Behavior when sources have no temporal content: unchanged (LLM
output accepted as-is).
`bench/free_form_judge.py` implements a citation-precision judge using
`claude-haiku-4-5-20251001`. Given a question, system answer, retrieved
context, and optional gold sketch, it scores 1.0/0.5/0.0 based on
whether the answer is fully/partially/not supported by context.
Includes `batch_judge` for bulk evaluation.

`bench/run_free_form.py` is the runner skeleton for the free-form track.
Loads questions from JSON, retrieves cashew context via
`generate_session_context`, generates a 2-3 sentence answer, judges
it, and writes JSONL results. Supports `--smoke` (first 3 questions),
`--questions-file`, `--db`, and `--output` flags.

Questions JSON file (`membench-free-form-questions.json`) does not
exist yet -- Raj fills in `gold_sketch` fields before running.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant