fix(skills): guard one-time setup with try/catch (best-effort instead of fatal)#171
Open
lxistired wants to merge 1 commit into
Open
fix(skills): guard one-time setup with try/catch (best-effort instead of fatal)#171lxistired wants to merge 1 commit into
lxistired wants to merge 1 commit into
Conversation
Skills directory setup runs once per runner instance and currently lets any filesystem error abort the chat startup. On fresh-install machines (Windows especially) we see ENOENT/EPERM races during mkdir/symlink/copy ops — a single broken built-in skill or a partially-created ~/.claude config dir blocks the user from sending any message. Wrap with three layers of defensive try/catch: 1. Outer try around the whole one-time setup block (so mkdir/readdir failures don't propagate up). 2. Per-skill try/catch in the built-in copy loop (so one bad skill directory doesn't skip the rest). 3. Individual try/catches around syncUserSkillsToAppDir and syncConfiguredSkillsToRuntimeDir (independent failure modes). Also added a defensive parent mkdir before each per-skill copy/symlink — Windows fresh-install occasionally races and reports the parent missing even though we created the runtime skills dir a few lines up. Failures are logged via logWarn (non-fatal) and the chat continues without skills support.
There was a problem hiding this comment.
Findings
- [Major] Transient setup failures become sticky for the lifetime of the runner. The new non-fatal outer catch means
mkdirSync/readdirSyncraces now fall through, but_skillsSetupDonewas already latched before thetry, so later queries will skip setup entirely until some unrelated invalidation path runs. That leaves the Windows race described in the PR as a permanent loss of built-in/user skills for the rest of the chat session instead of recovering on the next message, evidencesrc/main/claude/agent-runner.ts:1496with setup latch context atsrc/main/claude/agent-runner.ts:1408.
Summary
Review mode: initial
- 1 finding.
- No targeted regression test was found under
src/tests/for retrying skills setup after a transient failure.
Testing
- Not run (automation)
Open Cowork Bot
| } catch (e) { | ||
| logWarn('[ClaudeAgentRunner] syncConfiguredSkillsToRuntimeDir failed:', e); | ||
| } | ||
| } catch (setupErr) { |
There was a problem hiding this comment.
[MAJOR] This catch makes skills setup non-fatal, but _skillsSetupDone was already set to true earlier in the same block. If mkdirSync/readdirSync hits the transient Windows race described in the PR, every later query skips setup entirely until an explicit invalidateSkillsSetup() call, so the failure becomes sticky for the lifetime of this runner instead of recovering on the next message.
Suggested fix:
} catch (setupErr) {
this._skillsSetupDone = false;
logWarn('[ClaudeAgentRunner] Skills setup failed (non-fatal):', setupErr);
}
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Skills directory setup runs once per
ClaudeAgentRunnerinstance and currently lets any filesystem error abort chat startup. On fresh-install machines — Windows especially — we see ENOENT/EPERM races during the initialmkdir/symlink/copywork. A single broken built-in skill or a partially-created~/.claudeconfig dir prevents the user from sending any message at all.Repro: install on a Windows box that has never run any Claude-related app, the very first chat fails because
~/.claudedoesn't exist yet and one of the per-skill copy ops races against the parent dir creation.Fix
Three layers of defensive
try/catchplus one defensive parentmkdir:syncUserSkillsToAppDirandsyncConfiguredSkillsToRuntimeDir— they're independent failure modes (one walks~/.claude/skills, the other walks the configured global skills dir) and shouldn't take each other down.path.dirname(userSkillPath)mkdir before each per-skill copy/symlink — Windows fresh-install races and reports the parent missing even after the runtime skills dir was created a few statements above.All failures go through
logWarn(already imported), no behavior change on healthy machines.Files
src/main/claude/agent-runner.ts— wrap one-time setup block (effective change ~29 lines viagit diff -w; the larger raw diff is just indentation churn)Test plan
~/.claude) — chat starts, skills set up, no crash~/.claude/skillsdoesn't exist —syncUserSkillsToAppDirearly-returns (existing behavior preserved)