membench: add AttrScore-style free-form judge and runner skeleton by rajkripal · Pull Request #44 · rajkripal/cashew

rajkripal · 2026-05-14T19:04:51Z

bench/free_form_judge.py implements an AttrScore-style citation-precision judge. Given a question, system answer, retrieved context, and optional gold sketch, it calls claude-haiku-4-5-20251001 and returns a score of 1.0 (fully supported), 0.5 (partially), or 0.0 (not supported / hallucinated). Includes batch_judge for bulk runs.

bench/run_free_form.py is the runner skeleton for the free-form synthesis track. It loads questions from a JSON file (schema with qid, corpus, question, gold_sketch, authoring_note), retrieves context from a cashew DB via generate_session_context, generates a 2-3 sentence answer, judges it, and writes JSONL results. Supports --smoke (first 3 questions), --questions-file, --db, and --output.

The questions JSON (papers/locomo-run/membench-free-form-questions.json) doesn't exist yet -- blocked on Raj filling in gold_sketch fields from the draft at papers/locomo-run/membench-questions-draft.md. The runner exits with a clear error if the file is missing.

Both files import claude_p from the locomo adapter, same pattern as cashew_membench.py.

The cluster-merge LLM synthesis step was free to drop date, weekday, and relative-time phrases ("on March 5", "last Tuesday", "two weeks ago") when consolidating near-duplicate nodes. Critic diff on LoCoMo conv-26 variant B traced 435 dropped-content cases to this path and matched it to a -9.6 F1 regression on temporal-reasoning questions (cat-2). Two changes: 1. Strengthen the synthesis prompt: explicitly instruct the model that temporal anchors are load-bearing and must not be dropped for brevity. 2. Add a deterministic guard: collect every temporal-anchor token from the source snippets, and if the model returns a synthesis that contains none of them while the inputs had at least one, fall back to the longest source. Bleached-but-fluent merges never replace anchored facts. Anchor regex covers: ISO dates, slash dates, month-day-year forms, weekdays, bare 19xx/20xx years, and relative phrases like "in 3 days", "two weeks ago", "last Tuesday". Behavior when sources have no temporal content: unchanged (LLM output accepted as-is).

`bench/free_form_judge.py` implements a citation-precision judge using `claude-haiku-4-5-20251001`. Given a question, system answer, retrieved context, and optional gold sketch, it scores 1.0/0.5/0.0 based on whether the answer is fully/partially/not supported by context. Includes `batch_judge` for bulk evaluation. `bench/run_free_form.py` is the runner skeleton for the free-form track. Loads questions from JSON, retrieves cashew context via `generate_session_context`, generates a 2-3 sentence answer, judges it, and writes JSONL results. Supports `--smoke` (first 3 questions), `--questions-file`, `--db`, and `--output` flags. Questions JSON file (`membench-free-form-questions.json`) does not exist yet -- Raj fills in `gold_sketch` fields before running.

rajkripal added 3 commits May 9, 2026 21:32

membench: add 50 free-form synthesis questions draft

59ea62a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

membench: add AttrScore-style free-form judge and runner skeleton#44

membench: add AttrScore-style free-form judge and runner skeleton#44
rajkripal wants to merge 3 commits into
mainfrom
feat/membench-free-form-judge

rajkripal commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rajkripal commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant