chore(eval): add rstest-best-practices eval and report#56
Conversation
73d6aa2 to
5be7ab8
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5be7ab879a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
5be7ab8 to
203262f
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 203262f168
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Ran the 10-eval suite from evals/evals.json against the current SKILL.md (with_skill) and a no-skill baseline (without_skill), 1 sample per cell on Sonnet 4.6. Aggregate: with_skill 74/75 (98.7%), without_skill 65/75 (86.7%) — +12 pts. Skill also cuts mean tokens ~17% and mean wall time ~26%. Largest gap in browser-mode (eval 5 react-dropdown-browser-mode: 8/8 vs 3/8) — without the skill the model writes JSDOM-style tests (querySelector + dispatchEvent KeyboardEvent + document.activeElement) in a real-Chromium fixture, missing every benefit of @rstest/browser + Locator API + expect.element web-first retry. Also tightens the Test-writing section to prefer await expect(fn()).rejects.toThrow() / .resolves.toEqual() over try/catch + expect.fail or .catch(e => e) patterns — surfaced by the only with_skill failure (eval 2 fetch-with-retry).
203262f to
77c7296
Compare
|
@codex review |
|
Codex Review: Didn't find any major issues. Can't wait for the next one! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Mirrors #54.
What to look at
skills/rstest-best-practices/SKILL.md+2 lines under Test-writing — preferawait expect(...).rejects.toThrow()/.resolves.toEqual()overtry/catch+expect.fail. Surfaced by eval 2 (fetch-with-retry), the onlywith_skillfailure.report.mdmark the recommendation bundled here. Numbers (74/75 with_skill vs 65/75 baseline, +12pp) are not re-run.evals.jsonschema follows chore(eval): add migrate-to-rstest eval and report #54:fixture_root/runs_root/runner_instructions/notes/evals[]./tmppaths are defaults — runner agent picks any OS scratch dir perrunner_instructions.Test plan