Skip to content

fix(skills): guard one-time setup with try/catch (best-effort instead of fatal)#171

Open
lxistired wants to merge 1 commit into
OpenCoworkAI:mainfrom
lxistired:fix/skills-missing-dir-enoent
Open

fix(skills): guard one-time setup with try/catch (best-effort instead of fatal)#171
lxistired wants to merge 1 commit into
OpenCoworkAI:mainfrom
lxistired:fix/skills-missing-dir-enoent

Conversation

@lxistired

Copy link
Copy Markdown

Problem

Skills directory setup runs once per ClaudeAgentRunner instance and currently lets any filesystem error abort chat startup. On fresh-install machines — Windows especially — we see ENOENT/EPERM races during the initial mkdir/symlink/copy work. A single broken built-in skill or a partially-created ~/.claude config dir prevents the user from sending any message at all.

Repro: install on a Windows box that has never run any Claude-related app, the very first chat fails because ~/.claude doesn't exist yet and one of the per-skill copy ops races against the parent dir creation.

Fix

Three layers of defensive try/catch plus one defensive parent mkdir:

  1. Outer try around the whole one-time setup block. mkdir/readdir failures no longer bubble up — they get logged and the chat continues (without skills, but functional).
  2. Per-skill try/catch in the built-in copy loop. One bad skill directory doesn't skip the rest.
  3. Individual try/catches around syncUserSkillsToAppDir and syncConfiguredSkillsToRuntimeDir — they're independent failure modes (one walks ~/.claude/skills, the other walks the configured global skills dir) and shouldn't take each other down.
  4. Defensive path.dirname(userSkillPath) mkdir before each per-skill copy/symlink — Windows fresh-install races and reports the parent missing even after the runtime skills dir was created a few statements above.

All failures go through logWarn (already imported), no behavior change on healthy machines.

Files

  • src/main/claude/agent-runner.ts — wrap one-time setup block (effective change ~29 lines via git diff -w; the larger raw diff is just indentation churn)

Test plan

  • Fresh Windows install (no ~/.claude) — chat starts, skills set up, no crash
  • Skills setup with one corrupted built-in skill — other skills still register
  • Same skills work as before on machines where setup already succeeded (no behavior change)
  • ~/.claude/skills doesn't exist — syncUserSkillsToAppDir early-returns (existing behavior preserved)

Skills directory setup runs once per runner instance and currently lets
any filesystem error abort the chat startup. On fresh-install machines
(Windows especially) we see ENOENT/EPERM races during mkdir/symlink/copy
ops — a single broken built-in skill or a partially-created
~/.claude config dir blocks the user from sending any message.

Wrap with three layers of defensive try/catch:

1. Outer try around the whole one-time setup block (so mkdir/readdir
   failures don't propagate up).
2. Per-skill try/catch in the built-in copy loop (so one bad skill
   directory doesn't skip the rest).
3. Individual try/catches around syncUserSkillsToAppDir and
   syncConfiguredSkillsToRuntimeDir (independent failure modes).

Also added a defensive parent mkdir before each per-skill copy/symlink —
Windows fresh-install occasionally races and reports the parent
missing even though we created the runtime skills dir a few lines up.

Failures are logged via logWarn (non-fatal) and the chat continues
without skills support.
@hqhq1025 hqhq1025 added bot-rerun Temporary label for rerunning bot automation and removed bot-rerun Temporary label for rerunning bot automation labels Apr 30, 2026

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Transient setup failures become sticky for the lifetime of the runner. The new non-fatal outer catch means mkdirSync/readdirSync races now fall through, but _skillsSetupDone was already latched before the try, so later queries will skip setup entirely until some unrelated invalidation path runs. That leaves the Windows race described in the PR as a permanent loss of built-in/user skills for the rest of the chat session instead of recovering on the next message, evidence src/main/claude/agent-runner.ts:1496 with setup latch context at src/main/claude/agent-runner.ts:1408.

Summary

Review mode: initial

  • 1 finding.
  • No targeted regression test was found under src/tests/ for retrying skills setup after a transient failure.

Testing

  • Not run (automation)

Open Cowork Bot

} catch (e) {
logWarn('[ClaudeAgentRunner] syncConfiguredSkillsToRuntimeDir failed:', e);
}
} catch (setupErr) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] This catch makes skills setup non-fatal, but _skillsSetupDone was already set to true earlier in the same block. If mkdirSync/readdirSync hits the transient Windows race described in the PR, every later query skips setup entirely until an explicit invalidateSkillsSetup() call, so the failure becomes sticky for the lifetime of this runner instead of recovering on the next message.

Suggested fix:

} catch (setupErr) {
  this._skillsSetupDone = false;
  logWarn('[ClaudeAgentRunner] Skills setup failed (non-fatal):', setupErr);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants