membench: add AttrScore-style free-form judge and runner skeleton#44
Open
rajkripal wants to merge 3 commits into
Open
membench: add AttrScore-style free-form judge and runner skeleton#44rajkripal wants to merge 3 commits into
rajkripal wants to merge 3 commits into
Conversation
The cluster-merge LLM synthesis step was free to drop date, weekday,
and relative-time phrases ("on March 5", "last Tuesday", "two weeks
ago") when consolidating near-duplicate nodes. Critic diff on
LoCoMo conv-26 variant B traced 435 dropped-content cases to this
path and matched it to a -9.6 F1 regression on temporal-reasoning
questions (cat-2).
Two changes:
1. Strengthen the synthesis prompt: explicitly instruct the model
that temporal anchors are load-bearing and must not be dropped
for brevity.
2. Add a deterministic guard: collect every temporal-anchor token
from the source snippets, and if the model returns a synthesis
that contains none of them while the inputs had at least one,
fall back to the longest source. Bleached-but-fluent merges
never replace anchored facts.
Anchor regex covers: ISO dates, slash dates, month-day-year forms,
weekdays, bare 19xx/20xx years, and relative phrases like "in
3 days", "two weeks ago", "last Tuesday".
Behavior when sources have no temporal content: unchanged (LLM
output accepted as-is).
`bench/free_form_judge.py` implements a citation-precision judge using `claude-haiku-4-5-20251001`. Given a question, system answer, retrieved context, and optional gold sketch, it scores 1.0/0.5/0.0 based on whether the answer is fully/partially/not supported by context. Includes `batch_judge` for bulk evaluation. `bench/run_free_form.py` is the runner skeleton for the free-form track. Loads questions from JSON, retrieves cashew context via `generate_session_context`, generates a 2-3 sentence answer, judges it, and writes JSONL results. Supports `--smoke` (first 3 questions), `--questions-file`, `--db`, and `--output` flags. Questions JSON file (`membench-free-form-questions.json`) does not exist yet -- Raj fills in `gold_sketch` fields before running.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
bench/free_form_judge.pyimplements an AttrScore-style citation-precision judge. Given a question, system answer, retrieved context, and optional gold sketch, it callsclaude-haiku-4-5-20251001and returns a score of 1.0 (fully supported), 0.5 (partially), or 0.0 (not supported / hallucinated). Includesbatch_judgefor bulk runs.bench/run_free_form.pyis the runner skeleton for the free-form synthesis track. It loads questions from a JSON file (schema withqid,corpus,question,gold_sketch,authoring_note), retrieves context from a cashew DB viagenerate_session_context, generates a 2-3 sentence answer, judges it, and writes JSONL results. Supports--smoke(first 3 questions),--questions-file,--db, and--output.The questions JSON (
papers/locomo-run/membench-free-form-questions.json) doesn't exist yet -- blocked on Raj filling ingold_sketchfields from the draft atpapers/locomo-run/membench-questions-draft.md. The runner exits with a clear error if the file is missing.Both files import
claude_pfrom the locomo adapter, same pattern ascashew_membench.py.