chore(eval): add rstest-best-practices eval and report by fi3ework · Pull Request #56 · rstackjs/agent-skills

fi3ework · 2026-04-29T07:32:03Z

Mirrors #54.

What to look at

Production change: skills/rstest-best-practices/SKILL.md +2 lines under Test-writing — prefer await expect(...).rejects.toThrow() / .resolves.toEqual() over try/catch + expect.fail. Surfaced by eval 2 (fetch-with-retry), the only with_skill failure.
Report describes the pre-commit (131-line) baseline run, not the post-fix skill. "[Done in this commit]" notes in report.md mark the recommendation bundled here. Numbers (74/75 with_skill vs 65/75 baseline, +12pp) are not re-run.
evals.json schema follows chore(eval): add migrate-to-rstest eval and report #54: fixture_root / runs_root / runner_instructions / notes / evals[]. /tmp paths are defaults — runner agent picks any OS scratch dir per runner_instructions.

Test plan

CI green

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5be7ab879a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 203262f168

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Ran the 10-eval suite from evals/evals.json against the current SKILL.md (with_skill) and a no-skill baseline (without_skill), 1 sample per cell on Sonnet 4.6. Aggregate: with_skill 74/75 (98.7%), without_skill 65/75 (86.7%) — +12 pts. Skill also cuts mean tokens ~17% and mean wall time ~26%. Largest gap in browser-mode (eval 5 react-dropdown-browser-mode: 8/8 vs 3/8) — without the skill the model writes JSDOM-style tests (querySelector + dispatchEvent KeyboardEvent + document.activeElement) in a real-Chromium fixture, missing every benefit of @rstest/browser + Locator API + expect.element web-first retry. Also tightens the Test-writing section to prefer await expect(fn()).rejects.toThrow() / .resolves.toEqual() over try/catch + expect.fail or .catch(e => e) patterns — surfaced by the only with_skill failure (eval 2 fetch-with-retry).

fi3ework · 2026-04-29T09:33:43Z

@codex review

chatgpt-codex-connector · 2026-04-29T09:36:13Z

Codex Review: Didn't find any major issues. Can't wait for the next one!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

fi3ework force-pushed the chore/rstest-best-practices-eval branch 2 times, most recently from 73d6aa2 to 5be7ab8 Compare April 29, 2026 07:44

chatgpt-codex-connector Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread skills-test/rstest-best-practices/report.md Outdated

fi3ework force-pushed the chore/rstest-best-practices-eval branch from 5be7ab8 to 203262f Compare April 29, 2026 09:01

chatgpt-codex-connector Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread skills-test/rstest-best-practices/evals/evals.json Outdated

fi3ework force-pushed the chore/rstest-best-practices-eval branch from 203262f to 77c7296 Compare April 29, 2026 09:23

SoonIter enabled auto-merge (squash) April 29, 2026 10:02

SoonIter approved these changes Apr 29, 2026

View reviewed changes

SoonIter merged commit 5beac9c into main Apr 29, 2026
4 checks passed

SoonIter deleted the chore/rstest-best-practices-eval branch April 29, 2026 10:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(eval): add rstest-best-practices eval and report#56

chore(eval): add rstest-best-practices eval and report#56
SoonIter merged 1 commit intomainfrom
chore/rstest-best-practices-eval

fi3ework commented Apr 29, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

fi3ework commented Apr 29, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fi3ework commented Apr 29, 2026

What to look at

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

fi3ework commented Apr 29, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants