fix(eval): deterministic accuracy-benchmark sampling with model_settings opt-in by JimStenstrom · Pull Request #1722 · jundot/omlx

JimStenstrom · 2026-06-07T16:15:17Z

Fixes #606.

Supersedes this PR's original approach (per-model temperature winning by default), which would make saved benchmark scores non-deterministic. This adopts the design @jundot proposed on #1254:

"Accuracy sampling: the default moved to model_settings, which makes scores non-deterministic when a model has a non-zero temperature. Would you be open to defaulting sampling_profile to deterministic (with model_settings as opt-in) so saved scores stay reproducible?"

Summary

Add sampling_profile to the accuracy-benchmark request: deterministic (default, reproducible greedy) or model_settings (opt-in, honors the model's configured sampling).
Issue : Gemma 4 26b a4b, bf16 - not matching official Livecode bechmark in omlx 0.34 #606's root cause — the benchmark ignored a model's temperature (Gemma 4 26B scoring ~47% vs ~77% on LiveCodeBench) — is fixed via the opt-in, while default runs stay reproducible.

Why

eval/base.py force-set temperature=0.0 (and neutral penalties) and accuracy_benchmark.py never read ms.temperature, so temperature-sensitive benchmarks scored a model far from its real configuration. But simply honoring temperature by default makes saved scores depend on per-model settings — exactly the non-determinism @jundot flagged. The sampling_profile toggle resolves both: reproducible by default, real-world sampling on request.

Changes

omlx/admin/accuracy_benchmark.py: sampling_profile: Literal["deterministic", "model_settings"] = "deterministic" on AccuracyBenchmarkRequest; the per-model sampling read is gated on model_settings and now includes temperature.
omlx/eval/base.py: _eval_single uses setdefault for temperature/presence_penalty/repetition_penalty (were force-set) so the opt-in profile can supply them; max_tokens stays benchmark-forced (a model's small limit must not truncate answers).
Dashboard: a "Sampling" toggle cloned from the existing Thinking Mode control (_bench_accuracy.html + dashboard.js); i18n keys added to en.json and propagated to all locales via scripts/normalize_i18n.py. No new tailwind classes (reuses the toggle's classes) → no CSS rebuild.

Behavior / compatibility

Default deterministic is score-identical to today (main already force-set temp=0; top_p/top_k are vestigial at greedy and penalties were already neutralized in _eval_single). API callers that omit the field get deterministic.
ModelSettings is untouched — no new persisted field, no migration.

Sibling sweep

Both sampling paths covered: the request gate (accuracy_benchmark.py) and the eval defaults (eval/base.py) — setdefault applied to all three forced sampling params for consistency, not just temperature.
_eval_single is shared by every benchmark subclass (mmlu/gsm8k/humaneval/mbpp/livecodebench); the setdefault change applies uniformly to all of them.
The performance benchmark (omlx/admin/benchmark.py) reads per-model settings only for experimental-feature metadata, not sampling — intentionally out of scope (it measures throughput, not accuracy, so determinism doesn't apply).
UI reachable + localized (toggle + 8 locales). enable_thinking stays orthogonal (its own field, merged into chat_template_kwargs regardless of profile).

Tests

python -m pytest tests/test_accuracy_benchmark.py tests/test_eval.py -q   # 93 passed

Request: sampling_profile default is deterministic; model_settings accepted; invalid value rejected.
Gating: deterministic forwards no sampling (sampling_kwargs == {}); model_settings forwards the model's temperature/top_p.
_eval_single: greedy defaults when empty, caller sampling overrides them, max_tokens always benchmark-controlled.
Full smoke (pytest -m "not slow and not integration"): 5938 passed, 23 skipped.
Verified both gates fail when reverted (deterministic test sees leaked sampling; override test sees forced 0.0).

…ngs opt-in The accuracy benchmark ran every model at temperature 0, ignoring a model's configured sampling. Making per-model temperature win by default would make saved scores non-deterministic. Instead add a sampling_profile choice — default "deterministic" (reproducible greedy) with "model_settings" as an explicit opt-in that honors the model's temperature/top_p/penalties. - accuracy_benchmark.py: sampling_profile on the request (Literal, default "deterministic"); read per-model sampling only under "model_settings", including temperature. - eval/base.py: setdefault the benchmark's neutral temperature/presence/ repetition values so the opt-in profile can override them; max_tokens stays benchmark-controlled. - Dashboard: a Sampling toggle (Deterministic | Model settings), i18n across all locales. - Tests: profile default/validation, deterministic-vs-model_settings gating, and _eval_single setdefault behavior. Fixes: 606

JimStenstrom force-pushed the fix/606-eval-per-model-temperature branch from 52ccd86 to 72cebde Compare June 19, 2026 17:09

JimStenstrom changed the title ~~fix(eval): honor per-model temperature in accuracy benchmark~~ fix(eval): deterministic accuracy-benchmark sampling with model_settings opt-in Jun 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(eval): deterministic accuracy-benchmark sampling with model_settings opt-in#1722

fix(eval): deterministic accuracy-benchmark sampling with model_settings opt-in#1722
JimStenstrom wants to merge 1 commit into
jundot:mainfrom
JimStenstrom:fix/606-eval-per-model-temperature

JimStenstrom commented Jun 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JimStenstrom commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Changes

Behavior / compatibility

Sibling sweep

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JimStenstrom commented Jun 7, 2026 •

edited

Loading