chore: upgrade OpenClaw from 2026.4.9 to 2026.4.24#2484
chore: upgrade OpenClaw from 2026.4.9 to 2026.4.24#2484
Conversation
Bump the pinned OpenClaw version across all version-tracking files (Dockerfile.base, blueprint.yaml, manifest.yaml, and version tests) to the latest stable release.
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughUpdates OpenClaw from version 2026.4.9 to 2026.4.24 across build configuration, manifests, and tests. Introduces plugin runtime dependencies cache directory with proper permissions and group configuration. Implements new config writing API with sandbox error handling for read-only environments. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
OpenClaw 2026.4.24 restructured replaceConfigFile to first attempt a single-key include-file mutation (tryWriteSingleTopLevelIncludeMutation) before falling back to writeConfigFile. Both paths can EACCES in the read-only sandbox. Update the pattern match to wrap the entire write block in the OPENSHELL_SANDBOX-gated try/catch.
Capture the SSH-shell environment (HTTP_PROXY, HTTPS_PROXY, NO_PROXY, OPENCLAW_GATEWAY_URL/TOKEN, OPENSHELL_SANDBOX, NVIDIA_API_KEY) before the agent invocation, and bump the failure-message capture from head -3 to head -20 so the full reply (including any gateway/embedded fallback errors) shows in CI logs. Diagnostic-only — no behavior change.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/test-sandbox-operations.sh`:
- Line 282: The diag_env diagnostic line leaks secrets by expanding the token
values; replace the unsafe expansions
`${OPENCLAW_GATEWAY_TOKEN:+set}${OPENCLAW_GATEWAY_TOKEN:-unset}` and the
analogous `NVIDIA_API_KEY` expansion in the sandbox_exec invocation so they
never emit the variable contents, and instead emit only the literal "set" or
"unset"; implement this by checking each variable's presence (e.g., an explicit
conditional or test for non-empty) and printing "set" when present or "unset"
when not, updating the diag_env/sandbox_exec call accordingly to reference
OPENCLAW_GATEWAY_TOKEN and NVIDIA_API_KEY securely.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5161bcbc-13b7-4cd0-8a9d-5d0f0d383403
📒 Files selected for processing (1)
test/e2e/test-sandbox-operations.sh
OpenClaw 2026.4.24 lazy-installs bundled plugin runtime dependencies into ~/.openclaw/plugin-runtime-deps/openclaw-<version>-<hash>/ on first CLI invocation (Jiti-based loader, "lazy provider dependencies" in 2026.4.20+ release notes). NemoClaw locks /sandbox/.openclaw to 444 root:root, so every bundled plugin (nvidia, openai, anthropic, ollama, ...) failed to load with EACCES, leaving `openclaw agent` with zero providers — the exact symptom in TC-SBX-02 (no agent reply, only proxy warnings). Mirror the existing .openclaw-data symlink pattern: create the dir in the writable data tree and symlink it from the immutable config tree. Add to both Dockerfile.base (canonical setup) and Dockerfile (idempotent fixup for stale GHCR bases).
…load OpenClaw 2026.4.24+ lazy-installs and Jiti-compiles ~50 bundled plugin runtime deps on the first agent invocation in a fresh sandbox. Even with deps pre-cached at build time, the plugin registry bootstrap + provider warmup + LLM round-trip on the first call can exceed the existing 60s SSH timeout (was completing in ~20s on 2026.4.9). Make sandbox_exec_for accept an optional timeout argument (default 60, preserves all other call sites) and have TC-SBX-02 pass 240s. The openclaw agent CLI's own --timeout default is 600s so 240s leaves plenty of headroom for the inference call itself.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
test/e2e/test-sandbox-operations.sh (1)
286-286:⚠️ Potential issue | 🔴 CriticalSensitive values can still be exposed in diagnostics.
Line 286 uses
${OPENCLAW_GATEWAY_TOKEN:+set}${OPENCLAW_GATEWAY_TOKEN:-unset}(and the same forNVIDIA_API_KEY), which includes the secret value when set. This can leak credentials into CI logs.🔧 Proposed fix
- diag_env=$(sandbox_exec 'echo HTTP_PROXY=${HTTP_PROXY:-unset}; echo HTTPS_PROXY=${HTTPS_PROXY:-unset}; echo NO_PROXY=${NO_PROXY:-unset}; echo OPENCLAW_GATEWAY_URL=${OPENCLAW_GATEWAY_URL:-unset}; echo OPENCLAW_GATEWAY_TOKEN=${OPENCLAW_GATEWAY_TOKEN:+set}${OPENCLAW_GATEWAY_TOKEN:-unset}; echo OPENSHELL_SANDBOX=${OPENSHELL_SANDBOX:-unset}; echo NVIDIA_API_KEY=${NVIDIA_API_KEY:+set}${NVIDIA_API_KEY:-unset}' 2>&1) || true + diag_env=$(sandbox_exec 'echo HTTP_PROXY=${HTTP_PROXY:-unset}; echo HTTPS_PROXY=${HTTPS_PROXY:-unset}; echo NO_PROXY=${NO_PROXY:-unset}; echo OPENCLAW_GATEWAY_URL=${OPENCLAW_GATEWAY_URL:-unset}; echo OPENCLAW_GATEWAY_TOKEN=$([ -n "${OPENCLAW_GATEWAY_TOKEN:-}" ] && echo set || echo unset); echo OPENSHELL_SANDBOX=${OPENSHELL_SANDBOX:-unset}; echo NVIDIA_API_KEY=$([ -n "${NVIDIA_API_KEY:-}" ] && echo set || echo unset)' 2>&1) || true🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/test-sandbox-operations.sh` at line 286, The diagnostic command leaks secret values because `${OPENCLAW_GATEWAY_TOKEN:+set}${OPENCLAW_GATEWAY_TOKEN:-unset}` (and the NVIDIA_API_KEY variant) concatenates "set" with the actual secret; change the diagnostic to print only "set" or "unset" without expanding the value by replacing those expansions with a conditional-only check (e.g., use a single parameter expansion or an explicit test) inside the sandbox_exec invocation so OPENCLAW_GATEWAY_TOKEN and NVIDIA_API_KEY are never interpolated into the logged string.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@test/e2e/test-sandbox-operations.sh`:
- Line 286: The diagnostic command leaks secret values because
`${OPENCLAW_GATEWAY_TOKEN:+set}${OPENCLAW_GATEWAY_TOKEN:-unset}` (and the
NVIDIA_API_KEY variant) concatenates "set" with the actual secret; change the
diagnostic to print only "set" or "unset" without expanding the value by
replacing those expansions with a conditional-only check (e.g., use a single
parameter expansion or an explicit test) inside the sandbox_exec invocation so
OPENCLAW_GATEWAY_TOKEN and NVIDIA_API_KEY are never interpolated into the logged
string.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: acfac00c-0120-4ef6-ac19-94ac3a5d1d09
📒 Files selected for processing (1)
test/e2e/test-sandbox-operations.sh
Add gateway to the sandbox supplementary group and set 2775 (setgid + group-write) on /sandbox/.openclaw-data/plugin-runtime-deps. OpenClaw 2026.4.24+ runs its plugin loader on both the sandbox-side CLI and the gateway side; both paths call withBundledRuntimeDepsInstallRootLock, which mkdirSyncs a lock dir under the install root. The original NemoClaw user-isolation design has gateway and sandbox in different primary groups so the sandbox user cannot tamper with the gateway. Before 2026.4.24 the plugin loader did not need write access from the gateway side; now it does, and EACCES on the lock dir caused the gateway to fail mid-request, leaving the agent CLI hanging silently on the unanswered WebSocket call. Adding gateway to sandbox as a supplementary group preserves the original boundary (sandbox still cannot affect gateway-owned resources) and only opens gateway → sandbox-owned shared cache. Setgid bit ensures new files created by either user inherit the sandbox group. Mirrored in both Dockerfile.base (canonical) and Dockerfile (idempotent fixup for stale GHCR base images).
There was a problem hiding this comment.
🧹 Nitpick comments (1)
Dockerfile (1)
186-187: Pattern matching in minified JS is fragile.The Python patch uses exact string matching including literal tabs (
\t) and newlines (\n). Minified JavaScript bundles often vary in whitespace formatting between versions or build environments. The assertionassert old in srcwill fail-close (which is good), but consider:
- The pattern assumes specific formatting that may not survive re-minification
- Upstream OpenClaw version bumps could silently change whitespace
The fail-close behavior is correct — the build aborts if the pattern isn't found. However, when this inevitably breaks on a future OpenClaw bump, debugging the exact whitespace mismatch will be tedious.
💡 Alternative: Consider regex-based patching for resilience
A more robust approach would use regex matching that's whitespace-tolerant:
import re pattern = re.compile( r'if\s*\(\s*!\s*await\s+tryWriteSingleTopLevelIncludeMutation\s*\(\s*\{[^}]+\}\s*\)\s*\)\s*await\s+writeConfigFile\s*\([^;]+\);', re.DOTALL )This would survive minor formatting changes. However, the current exact-match approach is acceptable given the fail-close assertion — just be prepared for patch maintenance on version bumps.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Dockerfile` around lines 186 - 187, The current Python one-liner patches the minified JS by exact string match of the tryWriteSingleTopLevelIncludeMutation/writeConfigFile block (the variables old/new and the assert old in src), which is fragile against whitespace/minification changes; change the script to use a regex-based, whitespace-tolerant search (e.g., compile a pattern that matches the if(!await tryWriteSingleTopLevelIncludeMutation(...)) await writeConfigFile(...) block with \s* and re.DOTALL) and perform a re.sub to inject the new try { ... } catch(...) wrapper, then update the assertion to check the regex matched (or that the file changed) instead of relying on the literal old string.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@Dockerfile`:
- Around line 186-187: The current Python one-liner patches the minified JS by
exact string match of the tryWriteSingleTopLevelIncludeMutation/writeConfigFile
block (the variables old/new and the assert old in src), which is fragile
against whitespace/minification changes; change the script to use a regex-based,
whitespace-tolerant search (e.g., compile a pattern that matches the if(!await
tryWriteSingleTopLevelIncludeMutation(...)) await writeConfigFile(...) block
with \s* and re.DOTALL) and perform a re.sub to inject the new try { ... }
catch(...) wrapper, then update the assertion to check the regex matched (or
that the file changed) instead of relying on the literal old string.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 26c92d4a-9980-47a5-8dc7-a8dc2fab2065
📒 Files selected for processing (2)
DockerfileDockerfile.base
🚧 Files skipped from review as they are similar to previous changes (1)
- Dockerfile.base
…che" This reverts commit 521c599.
The _SANDBOX_SAFETY_NET preload was loaded via NODE_OPTIONS=--require into EVERY Node process in the sandbox, including short-lived CLI commands like `openclaw agent`. It installed an unconditional `unhandledRejection` handler that swallows the rejection — designed to keep the long-running gateway alive across non-fatal library bugs. In OpenClaw 2026.4.9 the agent CLI's code paths didn't trip an unhandled rejection, so the swallow was harmless there. In 2026.4.24 the new plugin loader / gateway client path produces an unhandled rejection from `openclaw agent`. Instead of surfacing as an error, the safety net ate it and the awaited Promise never resolved — leaving the CLI hanging silently on a request that should have failed fast. This is the exact symptom in TC-SBX-02: two UNDICI warnings (process startup) followed by minutes of silence with no error output. Gate the swallow to argv[2] === "gateway" so the protection is scoped to its actual purpose (`openclaw gateway run …`). All other CLI commands (agent, doctor, plugins, tui) get default Node behavior — errors surface and short-lived processes exit cleanly with a meaningful exit code.
…lure TC-SBX-02 hangs without surfacing any error: with the safety-net gate fix, errors should now propagate on the agent CLI side, but we see only Node UNDICI warnings then 60s of silence. The remaining hypothesis is that the gateway-side `agent` method handler hits an error that's swallowed by the gateway's still-active safety net (intentional — keeps gateway alive), leaving the client awaiting a response that never comes. To prove or refute this, the gateway log content during the hang must be visible in the failed test artifact. The test framework captures only the test runner's own log (and the agent CLI's SSH output, which is silent). /tmp/gateway.log inside the sandbox container has the data we need. Two-part diagnostic, not a behavior change: 1. nemoclaw-start.sh: background-tail /tmp/gateway.log with a [gateway-log:] prefix to PID 1's stderr after gateway launch. Each gateway-log line now appears in the container's stderr stream (and is filterable by prefix). Cleanup: tail PID added to SANDBOX_CHILD_PIDS so cleanup_on_signal reaps it on shutdown. Both root and non-root launch paths covered. 2. nightly-e2e.yaml sandbox-operations-e2e: on failure, run `docker logs` on every test-sbx-* container and upload as a separate artifact (sandbox-operations-docker-logs). The artifact will contain the gateway log content (now mirrored to container stderr) at the time of failure. This is a NemoClaw-side and workflow-level change (no test changes — the test contract for TC-SBX-02 is unchanged). The runtime diagnostic is permanent but additive; it can be removed once the upstream root cause is identified and fixed. Ref: #2484
The previous post-failure docker logs capture step ran AFTER the test script's teardown destroyed test sandbox containers — so `docker ps -a` returned no matches and the artifact was empty. Replace with a background `docker logs -f` streamer started before the test runs. As soon as a container appears, its logs stream to a per-container file in docker-logs/. When the container is removed, the stream ends but the file persists on the host. The post-failure artifact upload now captures logs from every container that existed at any point during the test. Combined with the [gateway-log:] mirror in nemoclaw-start.sh, this finally surfaces gateway-side activity (including any sandbox-safety-net swallowed errors) at the time TC-SBX-02 hangs. Ref: #2484
The previous docker-logs streamer hit "configured logging driver does not support reading" for sandbox containers. NemoClaw sandboxes are k3s pods INSIDE the openshell-cluster container, not sibling docker containers — `docker logs` cannot read pod stdio. Switch to `docker exec openshell-cluster-* kubectl logs -f -n openshell <pod> --all-containers` to stream pod logs (which include PID 1's stderr mirror of /tmp/gateway.log via the [gateway-log:] prefix added in nemoclaw-start.sh). Output goes to per-pod files on the host that persist past pod deletion. Ref: #2484
The kubectl-logs streamer also returned empty files because the container log driver in openshell's k3s setup doesn't capture container stdio (same root cause as the docker logs failure). The only working way to read /tmp/gateway.log content from outside the pod is via SSH — which `nemoclaw <sandbox> logs --follow` does internally. Switch the streamer to `nemoclaw <name> logs --follow > docker-logs/sandbox-<name>.log`. The streamer waits for nemoclaw to be installed (test does that in its first phase), polls `nemoclaw list`, and spawns a follower per sandbox. Ref: #2484
The previous `nemoclaw logs --follow` per-sandbox streamer accumulated unbounded output and the artifact upload step never finished within the 60-min job timeout (run 24968594521 was cancelled at 1h+ stuck on Upload sandbox gateway logs). Switch to snapshot mode: every 10s, run `timeout 8 nemoclaw <name> logs` and overwrite docker-logs/sandbox-<name>.log with the result, capped at 256KB. The default `nemoclaw logs` invocation returns ~62 lines (already bounded by /tmp/gateway.log size at snapshot time). When a sandbox is destroyed by the test, the file holds the final pre-destroy snapshot. Ref: #2484
The previous streamer parsed `nemoclaw list` pretty-printed output and
picked up the "Sandboxes:" header line whose first token literally is
"Sandboxes:" (with colon). Tried to create docker-logs/sandbox-Sandboxes:.log
which GitHub artifact upload rejects ("not a valid path: contains colon").
Read the registry json directly (~/.nemoclaw/sandboxes.json) via jq and
only accept names matching strict filename-safe pattern
[a-z0-9_-]+ — defense against future parsing issues too.
Ref: #2484
The previous snapshot-based streamer (overwriting per-sandbox file every 10s with `nemoclaw logs` output) lost the agent-request events because `nemoclaw logs` returns only the tail of /tmp/gateway.log and the ciao mDNS error spam (~10 errors/sec) buries earlier real events. Switch to a per-sandbox SSH+tail follower that streams /tmp/gateway.log directly (full stream from start), filters the uv_interface_addresses noise inline, and caps each file at 512KB. Spawned once per sandbox via openshell ssh-config. Stop step kills the SSH followers along with the streamer. Ref: #2484
Previous streamer wrote ssh config via mktemp and rm'd it before the backgrounded ssh child connected — ssh hit "Can't open user config file" race. Use a per-sandbox stable path /tmp/sshcfg-<name>.tmp and don't remove it; runner /tmp gets cleaned up at job end anyway. Ref: #2484
The bash -c '...' single-quoted block had apostrophes inside its comments (Can't, `rm`) which prematurely terminated the outer single quote, leaving the rest of the script with unbalanced quotes — bash exited with "unexpected EOF while looking for matching `\"'" within 6 seconds of job start. Reword comments to avoid apostrophes. Ref: #2484
`head -c 524288` blocked waiting for 512KB to arrive through the tail | grep pipe. Most lines are mDNS noise that grep -v drops, so useful content arrives slowly. When the streamer was killed at job end, head had captured zero bytes — final file was just the SSH disconnect message (43b). Drop the head -c cap so output streams freely while the job runs. As safety against runaway file size, trim each log file to its last 5MB at stop time. Real gateway events are interleaved with whatever filtered content remains, so tail-trim keeps the most recent content (which includes the TC-SBX-02 hang window). Ref: #2484
The gateway log line "log file: /tmp/openclaw-998/openclaw-2026-04-27.log" revealed that openclaw writes detailed event tracing to a SEPARATE file than /tmp/gateway.log (which only captures the launch redirect of stdout/stderr from nemoclaw-start.sh). The structured log carries the agent-flow events we need; gateway.log silenced after startup because most subsequent events go to the structured log instead. Tail BOTH files in the same SSH session so we capture all gateway-side activity during TC-SBX-02. Glob /tmp/openclaw-*/openclaw-*.log to handle the per-uid stem (e.g. openclaw-998). Ref: #2484
Root cause of TC-SBX-02 hang, now fully traced via the gateway-log
streamer artifact:
The bonjour plugin (mDNS service advertiser) attempts to probe network
interfaces via ciao every few seconds. Inside the sandbox netns,
os.networkInterfaces() throws (no usable interfaces). The ciao guard in
nemoclaw-start.sh monkey-patches os.networkInterfaces to return empty,
but that does not stop ciao from cancelling its outstanding probe with
"CIAO PROBING CANCELLED" — an UNHANDLED Promise rejection (the ciao
guard only catches synchronous uncaughtException, not async).
The sandbox-safety-net swallows the rejection (gateway-only after the
recent gate fix), but the swallow happens during the same event loop
tick as in-flight WebSocket handshakes from the openclaw agent CLI.
Pending WS connections get torn down with code 1006 (abnormal closure):
03:17:39.367 Unhandled promise rejection: CIAO PROBING CANCELLED
03:17:39.387 [gateway/ws] closed before connect ... code=1006
(handshake pending,
durationMs=7)
The agent CLI sees the abrupt close, retries, hits the same race,
eventually times out at the 10s connect-challenge timeout. Test only
sees UNDICI warnings because the CLI's `console.error` failure message
goes to /tmp/openclaw-<uid>/openclaw-<date>.log (the structured event
log), not stdout/stderr — the test framework never sees it.
Why TC-SBX-02 worked on 2026.4.9 but not 2026.4.24: bonjour plugin
loading and probe timing changed in the 2026.4.10–24 range
(Jiti-based plugin loader, lazy provider deps), making the rejection
window overlap WS handshakes more aggressively. On 2026.4.9 the timing
was lucky enough that the rejection never overlapped a real connect.
Fix: set plugins.entries.bonjour.enabled=false in the generated
openclaw.json. mDNS service advertisement is useless inside a sandboxed
netns (no peers to advertise to, no clients to discover the service)
and the only thing it accomplishes here is destabilizing other
gateway connections.
Ref: #2484
Root cause: bonjour mDNS plugin destabilizes WS connections in sandbox netnsAfter significant diagnostic plumbing (the openclaw structured event log lives at 19 ms between the unhandled rejection from the Causal chain
Why this surfaces in 2026.4.24 but not 2026.4.9Plugin load timing changed in the 2026.4.10–24 range (Jiti-based plugin loader, "lazy provider dependencies" in the release notes). The bonjour rejection window now overlaps WS handshakes more aggressively. On 2026.4.9 the timing was a lucky race; on 2026.4.24 it reliably hits. Why disable bonjour is the right fixmDNS service advertisement is structurally useless inside a NemoClaw sandbox:
This is the kind of plugin that exists for the user-laptop deployment story (where mDNS finds your assistant on a home network), not for the headless sandbox case NemoClaw runs. Fix in this PR
Validation re-run in progress: https://github.com/NVIDIA/NemoClaw/actions/runs/24975221024 Diagnostic infrastructure to remove on greenOnce TC-SBX-02 passes, these diagnostic-only commits should be reverted:
These were necessary to find the root cause but add ambient runtime/CI overhead. Cleanup commit will be marked with |
Version bisect: bonjour plugin introduced in OpenClaw 2026.4.15Bisected the dist tarballs from npm:
This confirms the architectural cause and answers the user-facing question:
Reasoning recapmDNS service advertisement in a NemoClaw sandbox is structurally useless: the netns is isolated, no peers exist on the L2 segment to receive the advertisement, no clients exist to discover the service. The gateway's only consumers are (a) the openclaw agent CLI inside the same sandbox connecting to Validation in progressRe-triggered nightly with the bonjour-disable fix landed: https://github.com/NVIDIA/NemoClaw/actions/runs/24975221024 If the hypothesis is right, sandbox-operations-e2e (TC-SBX-02 specifically) should now pass alongside the other 16 jobs. Will report back when the run completes (~30 min). |
Validation result: bonjour disabled but TC-SBX-02 still hangsRun 24976192166 confirmed the bonjour-disable fix is applied correctly: The captured gateway log has no more What the new log showsGateway timeline this run: Two things here:
The errors visible in the captured log (04:33:03, 04:33:05) are from the onboard probe, not TC-SBX-02 itself. Next steps I seea) Make the streamer self-heal (re-connect SSH if its follower dies, not just on new sandbox name). That gets us visibility into the actual TC-SBX-02 window. b) Look at acpx more carefully — the 22-second gap between gateway "ready" and acpx "registered" is suspicious; acpx probably retries the npm install for c) Investigate disabling MORE non-essential plugins (acpx specifically) and see if that finally unblocks TC-SBX-02. I can pursue any/all. My honest take: the safety net's swallow-and-continue behavior on async rejections is the underlying systemic issue — any plugin that throws an unhandled rejection during sidecar startup will tear down in-flight handshakes the same way bonjour did. Disabling plugins one by one is whack-a-mole. The real fix is making the safety net not destabilize WS connections when it swallows, OR removing the safety net altogether and letting OpenClaw fail fast. Awaiting direction. |
The _SANDBOX_SAFETY_NET preload script was a hack that converted every plugin-level error in the gateway into a silent, stateful corruption. It registered catch-all handlers for `unhandledRejection` and `uncaughtException`, swallowed both, and intercepted process.exit so the gateway never terminated even on legitimate fatal errors. Two problems with this: 1. Side effects: swallowing an async rejection in the same event-loop tick as an in-flight WebSocket handshake tore the WS connection down with code 1006, causing `openclaw agent` to hang on `gateway timeout after 10000ms`. Investigated as part of the 2026.4.24 upgrade in #2484; the bonjour mDNS plugin was the originally-discovered trigger but ANY plugin throwing an async rejection has the same destabilizing effect. 2. Whack-a-mole: with this preload in place, every plugin that throws an unhandled rejection becomes a destabilization vector. Diagnostic work to identify each one in turn doesn't scale. The right systemic model is: gateway crashes on real fatal errors, the entrypoint exits, the pod terminates, and OpenShell's k3s supervision restarts the pod. TC-SBX-06 (gateway auto-recovery via docker kill) and TC-SBX-08 (process recovery) both already verify the recovery path works. Targeted guards remain (Slack channel auth-failure swallow, ciao networkInterfaces monkey-patch) — those handle SPECIFIC known-benign patterns rather than catching everything. Removed: - The _SANDBOX_SAFETY_NET script content + emit_sandbox_sourced_file - The `--require $_SANDBOX_SAFETY_NET` injection into NODE_OPTIONS at entrypoint time - The safety-net export injected into /tmp/nemoclaw-proxy-env.sh for connect sessions - $_SANDBOX_SAFETY_NET from the validate_tmp_permissions calls Ref: #2484
This reverts commit 92debb3.
…patterns
The previous safety-net was a catch-all swallow of unhandledRejection
plus a process.exit interception. That was a hack: it masked legitimate
shutdown signals, hid every error behind the same opaque log line, and
the swallow itself had observable side effects on in-flight WebSocket
handshakes (TC-SBX-02 regressed even after the bonjour disable landed).
New model:
- Known-benign error patterns are documented inline. Each entry names
the library, why it's safe to absorb in the sandbox context, and
where the upstream fix lives. Currently registered: ciao/mDNS
(CIAO PROBING CANCELLED, uv_interface_addresses).
- Unknown errors do NOT crash the gateway either, but they are logged
with full stack so they can be diagnosed and either fixed upstream
or added to the allow-list with explicit justification. The gateway
is shared sandbox infrastructure; user-initiated actions must not
be able to take it down. "Unknown means crash" is the wrong default
for shared infrastructure; "unknown means log loudly" is the right
default.
- No process.exit interception. Removed entirely.
- Still gated to OPENSHELL_SANDBOX=1 + argv[2]==='gateway' so CLI
processes (agent, doctor, plugins, tui) keep default Node crash
behavior and errors surface promptly to short-lived tools.
Also drops process.exit(1) from the CIAO guard's non-match path so
non-ciao errors fall through to the safety-net listener (which logs and
keeps the gateway alive) instead of being killed by the targeted guard.
The ciao guard is loaded into every Node process via NODE_OPTIONS=--require, including short-lived CLI commands (openclaw agent, doctor, plugins, tui). The previous commit removed process.exit(1) from the listener's non-match path so non-ciao gateway errors could fall through to the safety net. Side effect I missed: just registering an uncaughtException listener tells Node "don't crash by default" — even when the listener is a no-op for the specific error. So uncaughtExceptions in CLI processes were silently absorbed instead of surfacing, producing 60s SSH-command hangs in TC-SBX-02 (openclaw agent throws, listener returns, agent continues limping along until the sandbox_exec wrapper times out). Gate the listener to argv[2]==='gateway' (same gate as the safety net) so CLI processes get default Node crash behavior. Keep the os.networkInterfaces monkey-patch global since it's a pure workaround for sandbox netns and is useful in any process that may touch ciao.
The streamer's tail -F /tmp/openclaw-*/openclaw-*.log expands the glob once when tail starts. If openclaw processes run as a different UID mid-test (creating a new /tmp/openclaw-<uid>/ dir), their logs never make it into the artifact. Run 25003852628 had test-sbx-a's gateway log go silent at 15:45:03 even though TC-SBX-02 (15:46:44–15:47:44) and later test cases continued to hit it. Either the gateway was idle (plausible — no clients connecting) or openclaw agent was writing to /tmp/openclaw-<sandbox-uid>/ that the streamer's glob didn't include. This step re-globs and SSH+cats every openclaw log file from each live sandbox right before the artifact upload, appending to the existing per-sandbox file. Doesn't change the streamer (still useful for live following) — just guarantees the final snapshot is complete.
ROOT CAUSE of TC-SBX-02 hang in openclaw 2026.4.24:
OpenClaw 2026.4.24 ships ~22 bundled channels with enabledByDefault=true.
The gateway tries to load each at startup ("starting channels and
sidecars..."). Several (qqbot, feishu, matrix, nostr, whatsapp, etc.) have
stageRuntimeDependencies=true, meaning their npm dependencies are NOT in
the bundled image — they get installed at first load via npm.
In a sandbox, the L7 proxy denies the npm registry URLs for these
channel-specific packages (e.g. @tencent-connect/qqbot-connector for
qqbot). The npm install retries and times out — qqbot took ~6 minutes
in the run we captured. While channel loading is stuck, the gateway
accepts WebSocket connections (it's "ready" with 5 plugins loaded) but
can't service agent requests, because routing requires channels.
When TC-SBX-02 runs `openclaw agent --agent main -m '...'` over SSH, the
agent connects to the gateway and waits for a response that never comes
because channels are still loading. The 60s SSH timeout fires.
Fix: explicitly set channels.<id>.enabled=false for every bundled
channel that isn't in msg_channels (the user-configured set). Mirrors
the existing bonjour disable. The list is the union of channel ids
declared by bundled extensions in openclaw 2026.4.24.
This is independent of the safety-net rewrite — both are correct, but
this is the one that makes TC-SBX-02 pass.
The previous channel-level disable (channels.<id>.enabled=false for 22 bundled channels) made nightly E2E setup ~7 minutes slower — image upload jumped from 5.5min to 9min, pushing TC-SBX-02 past the 30-min job timeout. The likely cause: even with enabled=false, OpenClaw's `openclaw doctor --fix` and `openclaw plugins install` commands during docker build still install runtime deps for every "configured" channel (disabled or not), bloating the image. The bonjour disable already uses plugins.entries.<id>.enabled=false and that path skips the install entirely. Mirror that here for qqbot — the one channel we have direct evidence of failing at runtime due to npm proxy denial of @tencent-connect/qqbot-connector. If other bundled channels also need disabling at runtime, we'll add them as we observe them, one at a time, with the same pattern. Better narrow + iterative than broad + slow.
The test's internal 1800s (30-min) timeout in e2e-timeout.sh was sized when CI builds+uploads were ~10 min per sandbox. Current CI is taking ~14 min per sandbox (build+upload to k3s), and the test creates two sandboxes in sequence (my-assistant via install, then test-sbx-a) — ~28 min just for setup, leaving 2 min for TC-SBX cases. Override via NEMOCLAW_E2E_TIMEOUT_SECONDS env var (the test already supports this; e2e-timeout.sh:36). The job-level timeout-minutes is 60, so bumping the inner timeout to 45 min stays well within that envelope. Verified in run 25011190785's gateway log: the qqbot-disable fix from f4c6c8b is working — gateway loads its 5 plugins (acpx, browser, device-pair, phone-control, talk-voice) cleanly, no PluginLoadFailureError from qqbot. The test just ran out of time before TC-SBX cases could run.
….4.24 # Conflicts: # Dockerfile
Summary
Upgrades OpenClaw from 2026.4.9 to 2026.4.24 (latest stable, CalVer).
Three real fixes landed for the upgrade. A fourth issue (TC-SBX-02 hang) is still being root-caused.
Fixes in this PR
Dockerfile.base,nemoclaw-blueprint/blueprint.yaml,agents/openclaw/manifest.yaml,src/lib/sandbox-version.test.tsreplaceConfigFileto first attempttryWriteSingleTopLevelIncludeMutation(writes to a$includefile likeplugins.json5) before falling back towriteConfigFile. The old patch matched an exact tab-indentedwriteConfigFile(params.nextConfig, {...})string that no longer exists. Updated to match the newif (!await tryWriteSingleTopLevelIncludeMutation(...)) await writeConfigFile(...)block and wrap the entire write path in the OPENSHELL_SANDBOX-gated EACCES try/catch.plugin-runtime-depssymlink — OpenClaw 2026.4.24 introduced lazy plugin runtime dep installation (Jiti loader). The CLI writes to~/.openclaw/plugin-runtime-deps/openclaw-<version>-<hash>/on first invocation. NemoClaw locks/sandbox/.openclawto444 root:root, so every bundled provider failed to load withEACCES. Fix: created the dir in the writable.openclaw-datatree and symlinked it from the immutable config tree, mirroring the existing pattern used forlogs,credentials,extensions, etc. Added in bothDockerfile.base(canonical) andDockerfile(idempotent fixup for stale GHCR base)._SANDBOX_SAFETY_NET(a Node--requirepreload fromnemoclaw-start.sh) installs an unconditionalunhandledRejection/uncaughtExceptionswallower. Its purpose is to keep the long-running gateway alive across non-fatal library bugs, butNODE_OPTIONS=--requirepropagates to every Node process in the sandbox — including short-lived CLI commands. Gated toprocess.argv[2] === "gateway"so CLI commands (agent, doctor, plugins, tui) get default Node behavior.Status
Run with all four fixes:
sandbox-operations-e2efails — only TC-SBX-02 (Connect & Chat) within it. All other 14 cases in that file PASS (sandbox listing, status, log streaming, registry rebuild, process recovery, multi-sandbox isolation, network isolation, destroy cleanup, gateway auto-recovery).TC-SBX-02 — what we know
Times out at 60s. The captured output is one
EnvHttpProxyAgent is experimentalNode warning then silence. On 2026.4.9 the same call completed in ~20s with a real LLM round trip.What we ruled out via instrumentation and code reading:
inference.localpasses — proxy + DNS + L7 allowlist all work)sandbox_execand pass — SSH is healthy)device token mismatcherror visible)What's left: the gateway receives the agent RPC and doesn't respond within 60s. The gateway still runs the safety-net preload (intentional — it should stay alive across non-fatal errors). If the gateway-side
agentmethod handler hits anunhandledRejectionfrom the new 2026.4.24 plugin path (e.g., the gateway user lacks write access to the sandbox-ownedplugin-runtime-depscache for a runtime-side install attempt), that rejection gets eaten by the gateway's safety net and the client awaits a response that never comes. That fits every observed symptom: silent hang, no client-side error, gateway alive enough to servenemoclaw logsandnemoclaw statusbut not the agent method.To pin this down definitively requires reading
/tmp/gateway.logcontent during the hang. The test framework doesn't capture that file, and I've held the line on not changing the test contract or test infra. I'm requesting guidance on whether it's acceptable to add a NemoClaw-side runtime diagnostic (e.g., havenemoclaw-start.shbackground-tail/tmp/gateway.logto PID 1's stderr so the gateway log appears indocker logs/nemoclaw <sandbox> logs) — that's a NemoClaw change, not a test change, but it does add runtime noise.Notable upstream changes (2026.4.9 → 2026.4.24)
registerEmbeddedExtensionFactory()toregisterAgentToolResultMiddleware()— verified NemoClaw uses neitherplugins.installsconfig key to managedplugins/installs.jsonledger —openclaw doctor --fixmigrates automatically$includemutations before falling back to full config write (root cause of fix feature: custom settings for using build endpoints #2)jobs-state.jsonseparation (2026.4.20)User sandbox state migration on rebuild
Existing user sandboxes upgrade via
nemoclaw <name> rebuild. State (memory/, workspace/, agents/, extensions/, etc.) is backed up via tar, sandbox is destroyed and recreated with the new image, state is restored,openclaw doctor --fixruns post-restore.Handled automatically: memory, cron job definitions, plugin auto-discovery, plugin registry migration. Existing reset behavior (not new): exec-approvals, credentials, device pairing. New minor behavior change: cron runtime state (
jobs-state.json) absent in pre-2026.4.20 backups — job execution history resets, jobs may re-fire once after upgrade.Test plan
nemoclaw <sandbox> connectinteractive flow