Skip to content

fix(eval): deterministic accuracy-benchmark sampling with model_settings opt-in#1722

Open
JimStenstrom wants to merge 1 commit into
jundot:mainfrom
JimStenstrom:fix/606-eval-per-model-temperature
Open

fix(eval): deterministic accuracy-benchmark sampling with model_settings opt-in#1722
JimStenstrom wants to merge 1 commit into
jundot:mainfrom
JimStenstrom:fix/606-eval-per-model-temperature

Conversation

@JimStenstrom

@JimStenstrom JimStenstrom commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Fixes #606.

Supersedes this PR's original approach (per-model temperature winning by default), which would make saved benchmark scores non-deterministic. This adopts the design @jundot proposed on #1254:

"Accuracy sampling: the default moved to model_settings, which makes scores non-deterministic when a model has a non-zero temperature. Would you be open to defaulting sampling_profile to deterministic (with model_settings as opt-in) so saved scores stay reproducible?"

Summary

Why

eval/base.py force-set temperature=0.0 (and neutral penalties) and accuracy_benchmark.py never read ms.temperature, so temperature-sensitive benchmarks scored a model far from its real configuration. But simply honoring temperature by default makes saved scores depend on per-model settings — exactly the non-determinism @jundot flagged. The sampling_profile toggle resolves both: reproducible by default, real-world sampling on request.

Changes

  • omlx/admin/accuracy_benchmark.py: sampling_profile: Literal["deterministic", "model_settings"] = "deterministic" on AccuracyBenchmarkRequest; the per-model sampling read is gated on model_settings and now includes temperature.
  • omlx/eval/base.py: _eval_single uses setdefault for temperature/presence_penalty/repetition_penalty (were force-set) so the opt-in profile can supply them; max_tokens stays benchmark-forced (a model's small limit must not truncate answers).
  • Dashboard: a "Sampling" toggle cloned from the existing Thinking Mode control (_bench_accuracy.html + dashboard.js); i18n keys added to en.json and propagated to all locales via scripts/normalize_i18n.py. No new tailwind classes (reuses the toggle's classes) → no CSS rebuild.

Behavior / compatibility

  • Default deterministic is score-identical to today (main already force-set temp=0; top_p/top_k are vestigial at greedy and penalties were already neutralized in _eval_single). API callers that omit the field get deterministic.
  • ModelSettings is untouched — no new persisted field, no migration.

Sibling sweep

  • Both sampling paths covered: the request gate (accuracy_benchmark.py) and the eval defaults (eval/base.py) — setdefault applied to all three forced sampling params for consistency, not just temperature.
  • _eval_single is shared by every benchmark subclass (mmlu/gsm8k/humaneval/mbpp/livecodebench); the setdefault change applies uniformly to all of them.
  • The performance benchmark (omlx/admin/benchmark.py) reads per-model settings only for experimental-feature metadata, not sampling — intentionally out of scope (it measures throughput, not accuracy, so determinism doesn't apply).
  • UI reachable + localized (toggle + 8 locales). enable_thinking stays orthogonal (its own field, merged into chat_template_kwargs regardless of profile).

Tests

python -m pytest tests/test_accuracy_benchmark.py tests/test_eval.py -q   # 93 passed
  • Request: sampling_profile default is deterministic; model_settings accepted; invalid value rejected.
  • Gating: deterministic forwards no sampling (sampling_kwargs == {}); model_settings forwards the model's temperature/top_p.
  • _eval_single: greedy defaults when empty, caller sampling overrides them, max_tokens always benchmark-controlled.
  • Full smoke (pytest -m "not slow and not integration"): 5938 passed, 23 skipped.
  • Verified both gates fail when reverted (deterministic test sees leaked sampling; override test sees forced 0.0).

…ngs opt-in

The accuracy benchmark ran every model at temperature 0, ignoring a model's
configured sampling. Making per-model temperature win by default would make
saved scores non-deterministic. Instead add a sampling_profile choice — default
"deterministic" (reproducible greedy) with "model_settings" as an explicit
opt-in that honors the model's temperature/top_p/penalties.

- accuracy_benchmark.py: sampling_profile on the request (Literal, default
  "deterministic"); read per-model sampling only under "model_settings",
  including temperature.
- eval/base.py: setdefault the benchmark's neutral temperature/presence/
  repetition values so the opt-in profile can override them; max_tokens stays
  benchmark-controlled.
- Dashboard: a Sampling toggle (Deterministic | Model settings), i18n across
  all locales.
- Tests: profile default/validation, deterministic-vs-model_settings gating,
  and _eval_single setdefault behavior.

Fixes: 606
@JimStenstrom JimStenstrom force-pushed the fix/606-eval-per-model-temperature branch from 52ccd86 to 72cebde Compare June 19, 2026 17:09
@JimStenstrom JimStenstrom changed the title fix(eval): honor per-model temperature in accuracy benchmark fix(eval): deterministic accuracy-benchmark sampling with model_settings opt-in Jun 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Issue : Gemma 4 26b a4b, bf16 - not matching official Livecode bechmark in omlx 0.34

1 participant