fix(eval): deterministic accuracy-benchmark sampling with model_settings opt-in#1722
Open
JimStenstrom wants to merge 1 commit into
Open
fix(eval): deterministic accuracy-benchmark sampling with model_settings opt-in#1722JimStenstrom wants to merge 1 commit into
JimStenstrom wants to merge 1 commit into
Conversation
…ngs opt-in The accuracy benchmark ran every model at temperature 0, ignoring a model's configured sampling. Making per-model temperature win by default would make saved scores non-deterministic. Instead add a sampling_profile choice — default "deterministic" (reproducible greedy) with "model_settings" as an explicit opt-in that honors the model's temperature/top_p/penalties. - accuracy_benchmark.py: sampling_profile on the request (Literal, default "deterministic"); read per-model sampling only under "model_settings", including temperature. - eval/base.py: setdefault the benchmark's neutral temperature/presence/ repetition values so the opt-in profile can override them; max_tokens stays benchmark-controlled. - Dashboard: a Sampling toggle (Deterministic | Model settings), i18n across all locales. - Tests: profile default/validation, deterministic-vs-model_settings gating, and _eval_single setdefault behavior. Fixes: 606
52ccd86 to
72cebde
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #606.
Supersedes this PR's original approach (per-model temperature winning by default), which would make saved benchmark scores non-deterministic. This adopts the design @jundot proposed on #1254:
Summary
sampling_profileto the accuracy-benchmark request:deterministic(default, reproducible greedy) ormodel_settings(opt-in, honors the model's configured sampling).Why
eval/base.pyforce-settemperature=0.0(and neutral penalties) andaccuracy_benchmark.pynever readms.temperature, so temperature-sensitive benchmarks scored a model far from its real configuration. But simply honoring temperature by default makes saved scores depend on per-model settings — exactly the non-determinism @jundot flagged. Thesampling_profiletoggle resolves both: reproducible by default, real-world sampling on request.Changes
omlx/admin/accuracy_benchmark.py:sampling_profile: Literal["deterministic", "model_settings"] = "deterministic"onAccuracyBenchmarkRequest; the per-model sampling read is gated onmodel_settingsand now includestemperature.omlx/eval/base.py:_eval_singleusessetdefaultfortemperature/presence_penalty/repetition_penalty(were force-set) so the opt-in profile can supply them;max_tokensstays benchmark-forced (a model's small limit must not truncate answers)._bench_accuracy.html+dashboard.js); i18n keys added toen.jsonand propagated to all locales viascripts/normalize_i18n.py. No new tailwind classes (reuses the toggle's classes) → no CSS rebuild.Behavior / compatibility
deterministicis score-identical to today (main already force-set temp=0; top_p/top_k are vestigial at greedy and penalties were already neutralized in_eval_single). API callers that omit the field getdeterministic.ModelSettingsis untouched — no new persisted field, no migration.Sibling sweep
accuracy_benchmark.py) and the eval defaults (eval/base.py) —setdefaultapplied to all three forced sampling params for consistency, not just temperature._eval_singleis shared by every benchmark subclass (mmlu/gsm8k/humaneval/mbpp/livecodebench); the setdefault change applies uniformly to all of them.omlx/admin/benchmark.py) reads per-model settings only for experimental-feature metadata, not sampling — intentionally out of scope (it measures throughput, not accuracy, so determinism doesn't apply).enable_thinkingstays orthogonal (its own field, merged intochat_template_kwargsregardless of profile).Tests
sampling_profiledefault isdeterministic;model_settingsaccepted; invalid value rejected.deterministicforwards no sampling (sampling_kwargs == {});model_settingsforwards the model'stemperature/top_p._eval_single: greedy defaults when empty, caller sampling overrides them,max_tokensalways benchmark-controlled.pytest -m "not slow and not integration"): 5938 passed, 23 skipped.