chore: upgrade OpenClaw from 2026.4.9 to 2026.4.24 by ericksoa · Pull Request #2484 · NVIDIA/NemoClaw

ericksoa · 2026-04-25T22:32:22Z

Summary

Upgrades OpenClaw from 2026.4.9 to 2026.4.24 (latest stable, CalVer).

Three real fixes landed for the upgrade. A fourth issue (TC-SBX-02 hang) is still being root-caused.

Fixes in this PR

Version bumps — Dockerfile.base, nemoclaw-blueprint/blueprint.yaml, agents/openclaw/manifest.yaml, src/lib/sandbox-version.test.ts
Patch 4 updated — OpenClaw 2026.4.24 restructured replaceConfigFile to first attempt tryWriteSingleTopLevelIncludeMutation (writes to a $include file like plugins.json5) before falling back to writeConfigFile. The old patch matched an exact tab-indented writeConfigFile(params.nextConfig, {...}) string that no longer exists. Updated to match the new if (!await tryWriteSingleTopLevelIncludeMutation(...)) await writeConfigFile(...) block and wrap the entire write path in the OPENSHELL_SANDBOX-gated EACCES try/catch.
plugin-runtime-deps symlink — OpenClaw 2026.4.24 introduced lazy plugin runtime dep installation (Jiti loader). The CLI writes to ~/.openclaw/plugin-runtime-deps/openclaw-<version>-<hash>/ on first invocation. NemoClaw locks /sandbox/.openclaw to 444 root:root, so every bundled provider failed to load with EACCES. Fix: created the dir in the writable .openclaw-data tree and symlinked it from the immutable config tree, mirroring the existing pattern used for logs, credentials, extensions, etc. Added in both Dockerfile.base (canonical) and Dockerfile (idempotent fixup for stale GHCR base).
Sandbox safety-net gated to gateway processes only — _SANDBOX_SAFETY_NET (a Node --require preload from nemoclaw-start.sh) installs an unconditional unhandledRejection/uncaughtException swallower. Its purpose is to keep the long-running gateway alive across non-fatal library bugs, but NODE_OPTIONS=--require propagates to every Node process in the sandbox — including short-lived CLI commands. Gated to process.argv[2] === "gateway" so CLI commands (agent, doctor, plugins, tui) get default Node behavior.

Status

Run with all four fixes:

✅ 14 of 15 nightly E2E jobs PASS: cloud, messaging-providers, inference-routing, network-policy, sandbox-survival, snapshot-commands, deployment-services, diagnostics, shields-config, skip-permissions, token-rotation, hermes, rebuild-hermes, rebuild-openclaw, upgrade-stale-sandbox.
❌ sandbox-operations-e2e fails — only TC-SBX-02 (Connect & Chat) within it. All other 14 cases in that file PASS (sandbox listing, status, log streaming, registry rebuild, process recovery, multi-sandbox isolation, network isolation, destroy cleanup, gateway auto-recovery).

TC-SBX-02 — what we know

sandbox_exec "openclaw agent --agent main -m 'Say exactly: HELLO_E2E' --session-id e2e-test"

Times out at 60s. The captured output is one EnvHttpProxyAgent is experimental Node warning then silence. On 2026.4.9 the same call completed in ~20s with a real LLM round trip.

What we ruled out via instrumentation and code reading:

Plugin EACCES (fixed via Change small local model to qwen3.5:9b #3)
Client-side rejection swallow (fixed via Validate sandbox sessionId to prevent command injection #4 — would now surface as a Node stack trace via the ciao guard's uncaughtException fallthrough; we don't see one, so no rejection is happening on the CLI side)
Inference path / network policy (cloud-e2e curl-to-inference.local passes — proxy + DNS + L7 allowlist all work)
SSH (TC-SBX-04, TC-SBX-08, TC-SBX-11 use sandbox_exec and pass — SSH is healthy)
Auth / device pairing (no device token mismatch error visible)
Connect handshake timeout (10s default — would surface as error if it tripped)

What's left: the gateway receives the agent RPC and doesn't respond within 60s. The gateway still runs the safety-net preload (intentional — it should stay alive across non-fatal errors). If the gateway-side agent method handler hits an unhandledRejection from the new 2026.4.24 plugin path (e.g., the gateway user lacks write access to the sandbox-owned plugin-runtime-deps cache for a runtime-side install attempt), that rejection gets eaten by the gateway's safety net and the client awaits a response that never comes. That fits every observed symptom: silent hang, no client-side error, gateway alive enough to serve nemoclaw logs and nemoclaw status but not the agent method.

To pin this down definitively requires reading /tmp/gateway.log content during the hang. The test framework doesn't capture that file, and I've held the line on not changing the test contract or test infra. I'm requesting guidance on whether it's acceptable to add a NemoClaw-side runtime diagnostic (e.g., have nemoclaw-start.sh background-tail /tmp/gateway.log to PID 1's stderr so the gateway log appears in docker logs / nemoclaw <sandbox> logs) — that's a NemoClaw change, not a test change, but it does add runtime noise.

Notable upstream changes (2026.4.9 → 2026.4.24)

Google Meet bundled plugin, DeepSeek V4 Flash/Pro, realtime voice loops (Talk/Voice Call/Google Meet), Gemini Live, browser automation improvements
Lighter startup: static model catalogs, manifest-backed model rows, lazy provider dependencies (the new plugin-runtime-deps mechanism — root cause of fix Change small local model to qwen3.5:9b #3)
Breaking: Plugin SDK tool-result transforms migrated from registerEmbeddedExtensionFactory() to registerAgentToolResultMiddleware() — verified NemoClaw uses neither
Breaking: Plugin registry migrated from plugins.installs config key to managed plugins/installs.json ledger — openclaw doctor --fix migrates automatically
Config writes restructured to use single-file $include mutations before falling back to full config write (root cause of fix feature: custom settings for using build endpoints #2)
CVE-2026-41349, CVE-2026-22181 fixes; exec-approvals chat enablement (2026.4.22); cron jobs-state.json separation (2026.4.20)

User sandbox state migration on rebuild

Existing user sandboxes upgrade via nemoclaw <name> rebuild. State (memory/, workspace/, agents/, extensions/, etc.) is backed up via tar, sandbox is destroyed and recreated with the new image, state is restored, openclaw doctor --fix runs post-restore.

Handled automatically: memory, cron job definitions, plugin auto-discovery, plugin registry migration. Existing reset behavior (not new): exec-approvals, credentials, device pairing. New minor behavior change: cron runtime state (jobs-state.json) absent in pre-2026.4.20 backups — job execution history resets, jobs may re-fire once after upgrade.

Test plan

CI lint, typecheck, unit tests pass
Docker base image and sandbox image build with all four dist patches applied
14/15 nightly E2E jobs pass cleanly
TC-SBX-02 — pending root-cause confirmation (see Status above)
Manual smoke test via nemoclaw <sandbox> connect interactive flow
Rebuild test: existing 2026.4.9 sandbox → rebuild → verify state preserved (rebuild-openclaw-e2e covers this)

Bump the pinned OpenClaw version across all version-tracking files (Dockerfile.base, blueprint.yaml, manifest.yaml, and version tests) to the latest stable release.

copy-pr-bot · 2026-04-25T22:32:25Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-04-25T22:32:28Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Updates OpenClaw from version 2026.4.9 to 2026.4.24 across build configuration, manifests, and tests. Introduces plugin runtime dependencies cache directory with proper permissions and group configuration. Implements new config writing API with sandbox error handling for read-only environments.

Changes

Cohort / File(s)	Summary
Version Upgrades `Dockerfile.base`, `agents/openclaw/manifest.yaml`, `nemoclaw-blueprint/blueprint.yaml`	Bump OpenClaw version from 2026.4.9 to 2026.4.24 across build configuration and manifest declarations.
Dockerfile Configuration `Dockerfile`	Implements new OpenClaw 2026.4.24+ config writing via `tryWriteSingleTopLevelIncludeMutation` with `writeConfigFile` fallback. Adds error handling for `EACCES` in sandboxes. Creates `/sandbox/.openclaw-data/plugin-runtime-deps` directory with group-write permissions (setgid/2775) to allow gateway user write access.
Test Updates `src/lib/sandbox-version.test.ts`	Update test fixtures and assertions to expect OpenClaw version 2026.4.24 across mocked agent definitions, version comparisons, and staleness warnings.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Hopping through configs, the version's bumped high,
From point-nine to point-twenty-four in the sky!
Plugin deps find a cache with a gateway's new right,
Sandboxes protected from permission-denied plight.
A safer, stronger OpenClaw, shiny and bright! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title 'chore: upgrade OpenClaw from 2026.4.9 to 2026.4.24' accurately reflects the primary change across the changeset—upgrading the OpenClaw version and updating all related version references.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch upgrade/openclaw-2026.4.24

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

OpenClaw 2026.4.24 restructured replaceConfigFile to first attempt a single-key include-file mutation (tryWriteSingleTopLevelIncludeMutation) before falling back to writeConfigFile. Both paths can EACCES in the read-only sandbox. Update the pattern match to wrap the entire write block in the OPENSHELL_SANDBOX-gated try/catch.

olegshilov

lgtm

Capture the SSH-shell environment (HTTP_PROXY, HTTPS_PROXY, NO_PROXY, OPENCLAW_GATEWAY_URL/TOKEN, OPENSHELL_SANDBOX, NVIDIA_API_KEY) before the agent invocation, and bump the failure-message capture from head -3 to head -20 so the full reply (including any gateway/embedded fallback errors) shows in CI logs. Diagnostic-only — no behavior change.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/test-sandbox-operations.sh`:
- Line 282: The diag_env diagnostic line leaks secrets by expanding the token
values; replace the unsafe expansions
`${OPENCLAW_GATEWAY_TOKEN:+set}${OPENCLAW_GATEWAY_TOKEN:-unset}` and the
analogous `NVIDIA_API_KEY` expansion in the sandbox_exec invocation so they
never emit the variable contents, and instead emit only the literal "set" or
"unset"; implement this by checking each variable's presence (e.g., an explicit
conditional or test for non-empty) and printing "set" when present or "unset"
when not, updating the diag_env/sandbox_exec call accordingly to reference
OPENCLAW_GATEWAY_TOKEN and NVIDIA_API_KEY securely.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5161bcbc-13b7-4cd0-8a9d-5d0f0d383403

📥 Commits

Reviewing files that changed from the base of the PR and between 5dcb0a9 and 2aacc51.

📒 Files selected for processing (1)

test/e2e/test-sandbox-operations.sh

OpenClaw 2026.4.24 lazy-installs bundled plugin runtime dependencies into ~/.openclaw/plugin-runtime-deps/openclaw-<version>-<hash>/ on first CLI invocation (Jiti-based loader, "lazy provider dependencies" in 2026.4.20+ release notes). NemoClaw locks /sandbox/.openclaw to 444 root:root, so every bundled plugin (nvidia, openai, anthropic, ollama, ...) failed to load with EACCES, leaving `openclaw agent` with zero providers — the exact symptom in TC-SBX-02 (no agent reply, only proxy warnings). Mirror the existing .openclaw-data symlink pattern: create the dir in the writable data tree and symlink it from the immutable config tree. Add to both Dockerfile.base (canonical setup) and Dockerfile (idempotent fixup for stale GHCR bases).

…load OpenClaw 2026.4.24+ lazy-installs and Jiti-compiles ~50 bundled plugin runtime deps on the first agent invocation in a fresh sandbox. Even with deps pre-cached at build time, the plugin registry bootstrap + provider warmup + LLM round-trip on the first call can exceed the existing 60s SSH timeout (was completing in ~20s on 2026.4.9). Make sandbox_exec_for accept an optional timeout argument (default 60, preserves all other call sites) and have TC-SBX-02 pass 240s. The openclaw agent CLI's own --timeout default is 600s so 240s leaves plenty of headroom for the inference call itself.

coderabbitai

♻️ Duplicate comments (1)

test/e2e/test-sandbox-operations.sh (1)

286-286: ⚠️ Potential issue | 🔴 Critical

Sensitive values can still be exposed in diagnostics.

Line 286 uses ${OPENCLAW_GATEWAY_TOKEN:+set}${OPENCLAW_GATEWAY_TOKEN:-unset} (and the same for NVIDIA_API_KEY), which includes the secret value when set. This can leak credentials into CI logs.

🔧 Proposed fix

-  diag_env=$(sandbox_exec 'echo HTTP_PROXY=${HTTP_PROXY:-unset}; echo HTTPS_PROXY=${HTTPS_PROXY:-unset}; echo NO_PROXY=${NO_PROXY:-unset}; echo OPENCLAW_GATEWAY_URL=${OPENCLAW_GATEWAY_URL:-unset}; echo OPENCLAW_GATEWAY_TOKEN=${OPENCLAW_GATEWAY_TOKEN:+set}${OPENCLAW_GATEWAY_TOKEN:-unset}; echo OPENSHELL_SANDBOX=${OPENSHELL_SANDBOX:-unset}; echo NVIDIA_API_KEY=${NVIDIA_API_KEY:+set}${NVIDIA_API_KEY:-unset}' 2>&1) || true
+  diag_env=$(sandbox_exec 'echo HTTP_PROXY=${HTTP_PROXY:-unset}; echo HTTPS_PROXY=${HTTPS_PROXY:-unset}; echo NO_PROXY=${NO_PROXY:-unset}; echo OPENCLAW_GATEWAY_URL=${OPENCLAW_GATEWAY_URL:-unset}; echo OPENCLAW_GATEWAY_TOKEN=$([ -n "${OPENCLAW_GATEWAY_TOKEN:-}" ] && echo set || echo unset); echo OPENSHELL_SANDBOX=${OPENSHELL_SANDBOX:-unset}; echo NVIDIA_API_KEY=$([ -n "${NVIDIA_API_KEY:-}" ] && echo set || echo unset)' 2>&1) || true

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-sandbox-operations.sh` at line 286, The diagnostic command
leaks secret values because
`${OPENCLAW_GATEWAY_TOKEN:+set}${OPENCLAW_GATEWAY_TOKEN:-unset}` (and the
NVIDIA_API_KEY variant) concatenates "set" with the actual secret; change the
diagnostic to print only "set" or "unset" without expanding the value by
replacing those expansions with a conditional-only check (e.g., use a single
parameter expansion or an explicit test) inside the sandbox_exec invocation so
OPENCLAW_GATEWAY_TOKEN and NVIDIA_API_KEY are never interpolated into the logged
string.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@test/e2e/test-sandbox-operations.sh`:
- Line 286: The diagnostic command leaks secret values because
`${OPENCLAW_GATEWAY_TOKEN:+set}${OPENCLAW_GATEWAY_TOKEN:-unset}` (and the
NVIDIA_API_KEY variant) concatenates "set" with the actual secret; change the
diagnostic to print only "set" or "unset" without expanding the value by
replacing those expansions with a conditional-only check (e.g., use a single
parameter expansion or an explicit test) inside the sandbox_exec invocation so
OPENCLAW_GATEWAY_TOKEN and NVIDIA_API_KEY are never interpolated into the logged
string.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: acfac00c-0120-4ef6-ac19-94ac3a5d1d09

📥 Commits

Reviewing files that changed from the base of the PR and between e1f1be8 and 1e512b1.

📒 Files selected for processing (1)

test/e2e/test-sandbox-operations.sh

Reverts 2aacc51 and 1e512b1. The test contract (run openclaw agent via SSH and assert the reply contains the expected token) stays as-is. Real fix belongs in NemoClaw, not the test harness.

Add gateway to the sandbox supplementary group and set 2775 (setgid + group-write) on /sandbox/.openclaw-data/plugin-runtime-deps. OpenClaw 2026.4.24+ runs its plugin loader on both the sandbox-side CLI and the gateway side; both paths call withBundledRuntimeDepsInstallRootLock, which mkdirSyncs a lock dir under the install root. The original NemoClaw user-isolation design has gateway and sandbox in different primary groups so the sandbox user cannot tamper with the gateway. Before 2026.4.24 the plugin loader did not need write access from the gateway side; now it does, and EACCES on the lock dir caused the gateway to fail mid-request, leaving the agent CLI hanging silently on the unanswered WebSocket call. Adding gateway to sandbox as a supplementary group preserves the original boundary (sandbox still cannot affect gateway-owned resources) and only opens gateway → sandbox-owned shared cache. Setgid bit ensures new files created by either user inherit the sandbox group. Mirrored in both Dockerfile.base (canonical) and Dockerfile (idempotent fixup for stale GHCR base images).

coderabbitai

🧹 Nitpick comments (1)

Dockerfile (1)
186-187: Pattern matching in minified JS is fragile.

The Python patch uses exact string matching including literal tabs (\t) and newlines (\n). Minified JavaScript bundles often vary in whitespace formatting between versions or build environments. The assertion assert old in src will fail-close (which is good), but consider:

The pattern assumes specific formatting that may not survive re-minification

Upstream OpenClaw version bumps could silently change whitespace

The fail-close behavior is correct — the build aborts if the pattern isn't found. However, when this inevitably breaks on a future OpenClaw bump, debugging the exact whitespace mismatch will be tedious.
💡 Alternative: Consider regex-based patching for resilience

A more robust approach would use regex matching that's whitespace-tolerant:
import re
pattern = re.compile(
    r'if\s*$\s*!\s*await\s+tryWriteSingleTopLevelIncludeMutation\s*\(\s*\{[^}]+\}\s*$\s*\)\s*await\s+writeConfigFile\s*$[^;]+$;',
    re.DOTALL
)
This would survive minor formatting changes. However, the current exact-match approach is acceptable given the fail-close assertion — just be prepared for patch maintenance on version bumps.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile` around lines 186 - 187, The current Python one-liner patches the
minified JS by exact string match of the
tryWriteSingleTopLevelIncludeMutation/writeConfigFile block (the variables
old/new and the assert old in src), which is fragile against
whitespace/minification changes; change the script to use a regex-based,
whitespace-tolerant search (e.g., compile a pattern that matches the if(!await
tryWriteSingleTopLevelIncludeMutation(...)) await writeConfigFile(...) block
with \s* and re.DOTALL) and perform a re.sub to inject the new try { ... }
catch(...) wrapper, then update the assertion to check the regex matched (or
that the file changed) instead of relying on the literal old string.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@Dockerfile`:
- Around line 186-187: The current Python one-liner patches the minified JS by
exact string match of the tryWriteSingleTopLevelIncludeMutation/writeConfigFile
block (the variables old/new and the assert old in src), which is fragile
against whitespace/minification changes; change the script to use a regex-based,
whitespace-tolerant search (e.g., compile a pattern that matches the if(!await
tryWriteSingleTopLevelIncludeMutation(...)) await writeConfigFile(...) block
with \s* and re.DOTALL) and perform a re.sub to inject the new try { ... }
catch(...) wrapper, then update the assertion to check the regex matched (or
that the file changed) instead of relying on the literal old string.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 26c92d4a-9980-47a5-8dc7-a8dc2fab2065

📥 Commits

Reviewing files that changed from the base of the PR and between 1e512b1 and 521c599.

📒 Files selected for processing (2)

Dockerfile
Dockerfile.base

🚧 Files skipped from review as they are similar to previous changes (1)

Dockerfile.base

…che" This reverts commit 521c599.

The _SANDBOX_SAFETY_NET preload was loaded via NODE_OPTIONS=--require into EVERY Node process in the sandbox, including short-lived CLI commands like `openclaw agent`. It installed an unconditional `unhandledRejection` handler that swallows the rejection — designed to keep the long-running gateway alive across non-fatal library bugs. In OpenClaw 2026.4.9 the agent CLI's code paths didn't trip an unhandled rejection, so the swallow was harmless there. In 2026.4.24 the new plugin loader / gateway client path produces an unhandled rejection from `openclaw agent`. Instead of surfacing as an error, the safety net ate it and the awaited Promise never resolved — leaving the CLI hanging silently on a request that should have failed fast. This is the exact symptom in TC-SBX-02: two UNDICI warnings (process startup) followed by minutes of silence with no error output. Gate the swallow to argv[2] === "gateway" so the protection is scoped to its actual purpose (`openclaw gateway run …`). All other CLI commands (agent, doctor, plugins, tui) get default Node behavior — errors surface and short-lived processes exit cleanly with a meaningful exit code.

…lure TC-SBX-02 hangs without surfacing any error: with the safety-net gate fix, errors should now propagate on the agent CLI side, but we see only Node UNDICI warnings then 60s of silence. The remaining hypothesis is that the gateway-side `agent` method handler hits an error that's swallowed by the gateway's still-active safety net (intentional — keeps gateway alive), leaving the client awaiting a response that never comes. To prove or refute this, the gateway log content during the hang must be visible in the failed test artifact. The test framework captures only the test runner's own log (and the agent CLI's SSH output, which is silent). /tmp/gateway.log inside the sandbox container has the data we need. Two-part diagnostic, not a behavior change: 1. nemoclaw-start.sh: background-tail /tmp/gateway.log with a [gateway-log:] prefix to PID 1's stderr after gateway launch. Each gateway-log line now appears in the container's stderr stream (and is filterable by prefix). Cleanup: tail PID added to SANDBOX_CHILD_PIDS so cleanup_on_signal reaps it on shutdown. Both root and non-root launch paths covered. 2. nightly-e2e.yaml sandbox-operations-e2e: on failure, run `docker logs` on every test-sbx-* container and upload as a separate artifact (sandbox-operations-docker-logs). The artifact will contain the gateway log content (now mirrored to container stderr) at the time of failure. This is a NemoClaw-side and workflow-level change (no test changes — the test contract for TC-SBX-02 is unchanged). The runtime diagnostic is permanent but additive; it can be removed once the upstream root cause is identified and fixed. Ref: #2484

The previous post-failure docker logs capture step ran AFTER the test script's teardown destroyed test sandbox containers — so `docker ps -a` returned no matches and the artifact was empty. Replace with a background `docker logs -f` streamer started before the test runs. As soon as a container appears, its logs stream to a per-container file in docker-logs/. When the container is removed, the stream ends but the file persists on the host. The post-failure artifact upload now captures logs from every container that existed at any point during the test. Combined with the [gateway-log:] mirror in nemoclaw-start.sh, this finally surfaces gateway-side activity (including any sandbox-safety-net swallowed errors) at the time TC-SBX-02 hangs. Ref: #2484

The previous docker-logs streamer hit "configured logging driver does not support reading" for sandbox containers. NemoClaw sandboxes are k3s pods INSIDE the openshell-cluster container, not sibling docker containers — `docker logs` cannot read pod stdio. Switch to `docker exec openshell-cluster-* kubectl logs -f -n openshell <pod> --all-containers` to stream pod logs (which include PID 1's stderr mirror of /tmp/gateway.log via the [gateway-log:] prefix added in nemoclaw-start.sh). Output goes to per-pod files on the host that persist past pod deletion. Ref: #2484

The kubectl-logs streamer also returned empty files because the container log driver in openshell's k3s setup doesn't capture container stdio (same root cause as the docker logs failure). The only working way to read /tmp/gateway.log content from outside the pod is via SSH — which `nemoclaw <sandbox> logs --follow` does internally. Switch the streamer to `nemoclaw <name> logs --follow > docker-logs/sandbox-<name>.log`. The streamer waits for nemoclaw to be installed (test does that in its first phase), polls `nemoclaw list`, and spawns a follower per sandbox. Ref: #2484

The previous `nemoclaw logs --follow` per-sandbox streamer accumulated unbounded output and the artifact upload step never finished within the 60-min job timeout (run 24968594521 was cancelled at 1h+ stuck on Upload sandbox gateway logs). Switch to snapshot mode: every 10s, run `timeout 8 nemoclaw <name> logs` and overwrite docker-logs/sandbox-<name>.log with the result, capped at 256KB. The default `nemoclaw logs` invocation returns ~62 lines (already bounded by /tmp/gateway.log size at snapshot time). When a sandbox is destroyed by the test, the file holds the final pre-destroy snapshot. Ref: #2484

The previous streamer parsed `nemoclaw list` pretty-printed output and picked up the "Sandboxes:" header line whose first token literally is "Sandboxes:" (with colon). Tried to create docker-logs/sandbox-Sandboxes:.log which GitHub artifact upload rejects ("not a valid path: contains colon"). Read the registry json directly (~/.nemoclaw/sandboxes.json) via jq and only accept names matching strict filename-safe pattern [a-z0-9_-]+ — defense against future parsing issues too. Ref: #2484

The previous snapshot-based streamer (overwriting per-sandbox file every 10s with `nemoclaw logs` output) lost the agent-request events because `nemoclaw logs` returns only the tail of /tmp/gateway.log and the ciao mDNS error spam (~10 errors/sec) buries earlier real events. Switch to a per-sandbox SSH+tail follower that streams /tmp/gateway.log directly (full stream from start), filters the uv_interface_addresses noise inline, and caps each file at 512KB. Spawned once per sandbox via openshell ssh-config. Stop step kills the SSH followers along with the streamer. Ref: #2484

Previous streamer wrote ssh config via mktemp and rm'd it before the backgrounded ssh child connected — ssh hit "Can't open user config file" race. Use a per-sandbox stable path /tmp/sshcfg-<name>.tmp and don't remove it; runner /tmp gets cleaned up at job end anyway. Ref: #2484

The bash -c '...' single-quoted block had apostrophes inside its comments (Can't, `rm`) which prematurely terminated the outer single quote, leaving the rest of the script with unbalanced quotes — bash exited with "unexpected EOF while looking for matching `\"'" within 6 seconds of job start. Reword comments to avoid apostrophes. Ref: #2484

`head -c 524288` blocked waiting for 512KB to arrive through the tail | grep pipe. Most lines are mDNS noise that grep -v drops, so useful content arrives slowly. When the streamer was killed at job end, head had captured zero bytes — final file was just the SSH disconnect message (43b). Drop the head -c cap so output streams freely while the job runs. As safety against runaway file size, trim each log file to its last 5MB at stop time. Real gateway events are interleaved with whatever filtered content remains, so tail-trim keeps the most recent content (which includes the TC-SBX-02 hang window). Ref: #2484

The gateway log line "log file: /tmp/openclaw-998/openclaw-2026-04-27.log" revealed that openclaw writes detailed event tracing to a SEPARATE file than /tmp/gateway.log (which only captures the launch redirect of stdout/stderr from nemoclaw-start.sh). The structured log carries the agent-flow events we need; gateway.log silenced after startup because most subsequent events go to the structured log instead. Tail BOTH files in the same SSH session so we capture all gateway-side activity during TC-SBX-02. Glob /tmp/openclaw-*/openclaw-*.log to handle the per-uid stem (e.g. openclaw-998). Ref: #2484

Root cause of TC-SBX-02 hang, now fully traced via the gateway-log streamer artifact: The bonjour plugin (mDNS service advertiser) attempts to probe network interfaces via ciao every few seconds. Inside the sandbox netns, os.networkInterfaces() throws (no usable interfaces). The ciao guard in nemoclaw-start.sh monkey-patches os.networkInterfaces to return empty, but that does not stop ciao from cancelling its outstanding probe with "CIAO PROBING CANCELLED" — an UNHANDLED Promise rejection (the ciao guard only catches synchronous uncaughtException, not async). The sandbox-safety-net swallows the rejection (gateway-only after the recent gate fix), but the swallow happens during the same event loop tick as in-flight WebSocket handshakes from the openclaw agent CLI. Pending WS connections get torn down with code 1006 (abnormal closure): 03:17:39.367 Unhandled promise rejection: CIAO PROBING CANCELLED 03:17:39.387 [gateway/ws] closed before connect ... code=1006 (handshake pending, durationMs=7) The agent CLI sees the abrupt close, retries, hits the same race, eventually times out at the 10s connect-challenge timeout. Test only sees UNDICI warnings because the CLI's `console.error` failure message goes to /tmp/openclaw-<uid>/openclaw-<date>.log (the structured event log), not stdout/stderr — the test framework never sees it. Why TC-SBX-02 worked on 2026.4.9 but not 2026.4.24: bonjour plugin loading and probe timing changed in the 2026.4.10–24 range (Jiti-based plugin loader, lazy provider deps), making the rejection window overlap WS handshakes more aggressively. On 2026.4.9 the timing was lucky enough that the rejection never overlapped a real connect. Fix: set plugins.entries.bonjour.enabled=false in the generated openclaw.json. mDNS service advertisement is useless inside a sandboxed netns (no peers to advertise to, no clients to discover the service) and the only thing it accomplishes here is destabilizing other gateway connections. Ref: #2484

ericksoa · 2026-04-27T03:34:32Z

Root cause: bonjour mDNS plugin destabilizes WS connections in sandbox netns

After significant diagnostic plumbing (the openclaw structured event log lives at /tmp/openclaw-<uid>/openclaw-<date>.log, not /tmp/gateway.log), the gateway-log streamer artifact (workflow sandbox-operations-docker-logs) finally captured the failure window for TC-SBX-02. Smoking-gun timeline from the gateway log:

03:17:39.354  [plugins] bonjour: restarting advertiser (service stuck in probing)
03:17:39.367  [openclaw] Unhandled promise rejection: CIAO PROBING CANCELLED
03:17:39.370  wrote stability bundle (rejection logged)
03:17:39.387  [gateway/ws] closed before connect conn=... code=1006 reason=n/a
              (handshake pending, durationMs=7)

19 ms between the unhandled rejection from the bonjour plugin and the abrupt WebSocket close.

Causal chain

The bonjour plugin (mDNS service advertiser) attempts to probe network interfaces every few seconds
The sandbox netns has no usable interfaces → os.networkInterfaces() throws
NemoClaw's ciao guard (in nemoclaw-start.sh) monkey-patches os.networkInterfaces to return empty on failure — BUT that doesn't stop ciao from cancelling its in-flight probe with "CIAO PROBING CANCELLED", which surfaces as an unhandled Promise rejection
The ciao guard only catches synchronous uncaughtException, not async unhandledRejection
The sandbox-safety-net catches the rejection (gateway-only after the earlier gate fix in this PR), but the swallow happens during the same event loop tick as in-flight WebSocket handshakes
Pending WS connections from the openclaw agent CLI get torn down with code 1006 (abnormal closure)
The agent CLI retries, hits the same race, eventually times out at the 10s connect-challenge timeout
The CLI's console.error failure message goes to the openclaw structured log, NOT stdout/stderr — that's why the test only ever saw the two UNDICI warnings followed by 60s of silence

Why this surfaces in 2026.4.24 but not 2026.4.9

Plugin load timing changed in the 2026.4.10–24 range (Jiti-based plugin loader, "lazy provider dependencies" in the release notes). The bonjour rejection window now overlaps WS handshakes more aggressively. On 2026.4.9 the timing was a lucky race; on 2026.4.24 it reliably hits.

Why disable bonjour is the right fix

mDNS service advertisement is structurally useless inside a NemoClaw sandbox:

The sandbox netns is isolated — there are no peers on the network to advertise the gateway to
The only way the gateway is reached from outside the sandbox is via the openshell SSH tunnel (nemoclaw <sandbox> connect), which doesn't use mDNS discovery
Internal-to-sandbox callers (the agent CLI, the configure-guard) connect to 127.0.0.1:18789 directly via the openclaw config, not via mDNS lookup
Continuing to load bonjour produces nothing useful and actively destabilizes the gateway every few seconds

This is the kind of plugin that exists for the user-laptop deployment story (where mDNS finds your assistant on a home network), not for the headless sandbox case NemoClaw runs.

Fix in this PR

plugins.entries.bonjour.enabled = false in the generated openclaw.json. Single line in the Dockerfile's Python config generator. Doesn't affect the user-laptop NemoClaw flow (different config path).

Validation re-run in progress: https://github.com/NVIDIA/NemoClaw/actions/runs/24975221024

Diagnostic infrastructure to remove on green

Once TC-SBX-02 passes, these diagnostic-only commits should be reverted:

[gateway-log:] mirror in nemoclaw-start.sh (PID 1 stderr tail of /tmp/gateway.log)
Start gateway log streamer (background) and related steps in .github/workflows/nightly-e2e.yaml

These were necessary to find the root cause but add ambient runtime/CI overhead. Cleanup commit will be marked with revert(diag): ….

ericksoa · 2026-04-27T03:37:16Z

Version bisect: bonjour plugin introduced in OpenClaw 2026.4.15

Bisected the dist tarballs from npm:

Version	bonjour plugin
2026.4.9	NOT PRESENT
2026.4.10	NOT PRESENT
2026.4.12	NOT PRESENT
2026.4.13	NOT PRESENT
2026.4.14	NOT PUBLISHED
2026.4.15	PRESENT (`bonjour-discovery-*.js`, `extensions/bonjour/`)
2026.4.20	PRESENT
2026.4.22	PRESENT
2026.4.24	PRESENT

This confirms the architectural cause and answers the user-facing question:

The bonjour-disable fix is needed for any OpenClaw bump from < 2026.4.15 to >= 2026.4.15. It's not specific to 2026.4.24; pinning to 2026.4.13 or 2026.4.10 (still bumping past 2026.4.9) wouldn't trip it. The first version that ships bonjour is the one where TC-SBX-02 starts hanging on every run.
The min_openclaw_version floor in nemoclaw-blueprint/blueprint.yaml should be at or above 2026.4.15 if we want this fix to be load-bearing for all builds. (We're already at 2026.4.24 — no change needed there. But if anyone tries to bump backwards, the fix becomes irrelevant; if anyone bumps forward, it remains applied as long as bonjour is still a bundled plugin in OpenClaw.)
Removal criteria for the fix: drop the plugins.entries.bonjour.enabled = false line if upstream OpenClaw fixes the bonjour plugin to handle netns-restricted environments without throwing async unhandledRejections, OR if upstream removes bonjour from the bundled-plugin set entirely.

Reasoning recap

mDNS service advertisement in a NemoClaw sandbox is structurally useless: the netns is isolated, no peers exist on the L2 segment to receive the advertisement, no clients exist to discover the service. The gateway's only consumers are (a) the openclaw agent CLI inside the same sandbox connecting to 127.0.0.1:18789 directly, and (b) the openshell SSH tunnel that nemoclaw <sandbox> connect opens — neither uses mDNS. Continuing to load bonjour produces zero useful behavior and actively destabilizes other gateway connections by triggering CIAO PROBING CANCELLED rejections in the same event-loop tick as in-flight WS handshakes.

Validation in progress

Re-triggered nightly with the bonjour-disable fix landed: https://github.com/NVIDIA/NemoClaw/actions/runs/24975221024

If the hypothesis is right, sandbox-operations-e2e (TC-SBX-02 specifically) should now pass alongside the other 16 jobs. Will report back when the run completes (~30 min).

ericksoa · 2026-04-27T04:47:51Z

Validation result: bonjour disabled but TC-SBX-02 still hangs

Run 24976192166 confirmed the bonjour-disable fix is applied correctly:

Before: [gateway] ready (6 plugins: acpx, bonjour, browser, device-pair, phone-control, talk-voice)
After:  [gateway] ready (5 plugins: acpx,          browser, device-pair, phone-control, talk-voice)

The captured gateway log has no more CIAO PROBING CANCELLED rejections — that part of the chain is genuinely gone. But TC-SBX-02 still hangs with the same 60s SSH timeout. So bonjour was a contributing trigger, not the sole cause.

What the new log shows

Gateway timeline this run:

04:32:51.799  [gateway] ready (5 plugins, no bonjour)
04:32:51.810  [gateway] starting channels and sidecars...
04:33:03.810  [openclaw] Failed to start CLI: gateway timeout after 10000ms
04:33:05.807  [gateway/ws] closed before connect durationMs=3 code=1006
04:33:13.202  [plugins] embedded acpx runtime backend registered  (← 22s gap)

Two things here:

WebSocket still gets dropped (1006, 3ms after handshake start) during the "starting channels and sidecars" window even without bonjour. Some other plugin/sidecar in startup is also disrupting handshakes.
The streamer stopped capturing at 04:33:13. TC-SBX-02 ran at 04:34:55 — 1m42s later. We have no visibility into the actual TC-SBX-02 hang window because either (a) the gateway logs nothing during that window, or (b) the SSH-based streamer got disconnected before the test reached TC-SBX-02 and my polling logic did not re-establish.

The errors visible in the captured log (04:33:03, 04:33:05) are from the onboard probe, not TC-SBX-02 itself.

Next steps I see

a) Make the streamer self-heal (re-connect SSH if its follower dies, not just on new sandbox name). That gets us visibility into the actual TC-SBX-02 window.

b) Look at acpx more carefully — the 22-second gap between gateway "ready" and acpx "registered" is suspicious; acpx probably retries the npm install for @zed-industries/codex-acp (which the L7 proxy denies with 403) and that retry might be throwing async rejections similar to bonjour.

c) Investigate disabling MORE non-essential plugins (acpx specifically) and see if that finally unblocks TC-SBX-02.

I can pursue any/all. My honest take: the safety net's swallow-and-continue behavior on async rejections is the underlying systemic issue — any plugin that throws an unhandled rejection during sidecar startup will tear down in-flight handshakes the same way bonjour did. Disabling plugins one by one is whack-a-mole. The real fix is making the safety net not destabilize WS connections when it swallows, OR removing the safety net altogether and letting OpenClaw fail fast.

Awaiting direction.

The _SANDBOX_SAFETY_NET preload script was a hack that converted every plugin-level error in the gateway into a silent, stateful corruption. It registered catch-all handlers for `unhandledRejection` and `uncaughtException`, swallowed both, and intercepted process.exit so the gateway never terminated even on legitimate fatal errors. Two problems with this: 1. Side effects: swallowing an async rejection in the same event-loop tick as an in-flight WebSocket handshake tore the WS connection down with code 1006, causing `openclaw agent` to hang on `gateway timeout after 10000ms`. Investigated as part of the 2026.4.24 upgrade in #2484; the bonjour mDNS plugin was the originally-discovered trigger but ANY plugin throwing an async rejection has the same destabilizing effect. 2. Whack-a-mole: with this preload in place, every plugin that throws an unhandled rejection becomes a destabilization vector. Diagnostic work to identify each one in turn doesn't scale. The right systemic model is: gateway crashes on real fatal errors, the entrypoint exits, the pod terminates, and OpenShell's k3s supervision restarts the pod. TC-SBX-06 (gateway auto-recovery via docker kill) and TC-SBX-08 (process recovery) both already verify the recovery path works. Targeted guards remain (Slack channel auth-failure swallow, ciao networkInterfaces monkey-patch) — those handle SPECIFIC known-benign patterns rather than catching everything. Removed: - The _SANDBOX_SAFETY_NET script content + emit_sandbox_sourced_file - The `--require $_SANDBOX_SAFETY_NET` injection into NODE_OPTIONS at entrypoint time - The safety-net export injected into /tmp/nemoclaw-proxy-env.sh for connect sessions - $_SANDBOX_SAFETY_NET from the validate_tmp_permissions calls Ref: #2484

This reverts commit 92debb3.

…patterns The previous safety-net was a catch-all swallow of unhandledRejection plus a process.exit interception. That was a hack: it masked legitimate shutdown signals, hid every error behind the same opaque log line, and the swallow itself had observable side effects on in-flight WebSocket handshakes (TC-SBX-02 regressed even after the bonjour disable landed). New model: - Known-benign error patterns are documented inline. Each entry names the library, why it's safe to absorb in the sandbox context, and where the upstream fix lives. Currently registered: ciao/mDNS (CIAO PROBING CANCELLED, uv_interface_addresses). - Unknown errors do NOT crash the gateway either, but they are logged with full stack so they can be diagnosed and either fixed upstream or added to the allow-list with explicit justification. The gateway is shared sandbox infrastructure; user-initiated actions must not be able to take it down. "Unknown means crash" is the wrong default for shared infrastructure; "unknown means log loudly" is the right default. - No process.exit interception. Removed entirely. - Still gated to OPENSHELL_SANDBOX=1 + argv[2]==='gateway' so CLI processes (agent, doctor, plugins, tui) keep default Node crash behavior and errors surface promptly to short-lived tools. Also drops process.exit(1) from the CIAO guard's non-match path so non-ciao errors fall through to the safety-net listener (which logs and keeps the gateway alive) instead of being killed by the targeted guard.

The ciao guard is loaded into every Node process via NODE_OPTIONS=--require, including short-lived CLI commands (openclaw agent, doctor, plugins, tui). The previous commit removed process.exit(1) from the listener's non-match path so non-ciao gateway errors could fall through to the safety net. Side effect I missed: just registering an uncaughtException listener tells Node "don't crash by default" — even when the listener is a no-op for the specific error. So uncaughtExceptions in CLI processes were silently absorbed instead of surfacing, producing 60s SSH-command hangs in TC-SBX-02 (openclaw agent throws, listener returns, agent continues limping along until the sandbox_exec wrapper times out). Gate the listener to argv[2]==='gateway' (same gate as the safety net) so CLI processes get default Node crash behavior. Keep the os.networkInterfaces monkey-patch global since it's a pure workaround for sandbox netns and is useful in any process that may touch ciao.

The streamer's tail -F /tmp/openclaw-*/openclaw-*.log expands the glob once when tail starts. If openclaw processes run as a different UID mid-test (creating a new /tmp/openclaw-<uid>/ dir), their logs never make it into the artifact. Run 25003852628 had test-sbx-a's gateway log go silent at 15:45:03 even though TC-SBX-02 (15:46:44–15:47:44) and later test cases continued to hit it. Either the gateway was idle (plausible — no clients connecting) or openclaw agent was writing to /tmp/openclaw-<sandbox-uid>/ that the streamer's glob didn't include. This step re-globs and SSH+cats every openclaw log file from each live sandbox right before the artifact upload, appending to the existing per-sandbox file. Doesn't change the streamer (still useful for live following) — just guarantees the final snapshot is complete.

ROOT CAUSE of TC-SBX-02 hang in openclaw 2026.4.24: OpenClaw 2026.4.24 ships ~22 bundled channels with enabledByDefault=true. The gateway tries to load each at startup ("starting channels and sidecars..."). Several (qqbot, feishu, matrix, nostr, whatsapp, etc.) have stageRuntimeDependencies=true, meaning their npm dependencies are NOT in the bundled image — they get installed at first load via npm. In a sandbox, the L7 proxy denies the npm registry URLs for these channel-specific packages (e.g. @tencent-connect/qqbot-connector for qqbot). The npm install retries and times out — qqbot took ~6 minutes in the run we captured. While channel loading is stuck, the gateway accepts WebSocket connections (it's "ready" with 5 plugins loaded) but can't service agent requests, because routing requires channels. When TC-SBX-02 runs `openclaw agent --agent main -m '...'` over SSH, the agent connects to the gateway and waits for a response that never comes because channels are still loading. The 60s SSH timeout fires. Fix: explicitly set channels.<id>.enabled=false for every bundled channel that isn't in msg_channels (the user-configured set). Mirrors the existing bonjour disable. The list is the union of channel ids declared by bundled extensions in openclaw 2026.4.24. This is independent of the safety-net rewrite — both are correct, but this is the one that makes TC-SBX-02 pass.

The previous channel-level disable (channels.<id>.enabled=false for 22 bundled channels) made nightly E2E setup ~7 minutes slower — image upload jumped from 5.5min to 9min, pushing TC-SBX-02 past the 30-min job timeout. The likely cause: even with enabled=false, OpenClaw's `openclaw doctor --fix` and `openclaw plugins install` commands during docker build still install runtime deps for every "configured" channel (disabled or not), bloating the image. The bonjour disable already uses plugins.entries.<id>.enabled=false and that path skips the install entirely. Mirror that here for qqbot — the one channel we have direct evidence of failing at runtime due to npm proxy denial of @tencent-connect/qqbot-connector. If other bundled channels also need disabling at runtime, we'll add them as we observe them, one at a time, with the same pattern. Better narrow + iterative than broad + slow.

The test's internal 1800s (30-min) timeout in e2e-timeout.sh was sized when CI builds+uploads were ~10 min per sandbox. Current CI is taking ~14 min per sandbox (build+upload to k3s), and the test creates two sandboxes in sequence (my-assistant via install, then test-sbx-a) — ~28 min just for setup, leaving 2 min for TC-SBX cases. Override via NEMOCLAW_E2E_TIMEOUT_SECONDS env var (the test already supports this; e2e-timeout.sh:36). The job-level timeout-minutes is 60, so bumping the inner timeout to 45 min stays well within that envelope. Verified in run 25011190785's gateway log: the qqbot-disable fix from f4c6c8b is working — gateway loads its 5 plugins (acpx, browser, device-pair, phone-control, talk-voice) cleanly, no PluginLoadFailureError from qqbot. The test just ran out of time before TC-SBX cases could run.

….4.24 # Conflicts: # Dockerfile

chore: upgrade OpenClaw from 2026.4.9 to 2026.4.24

f3b0dbe

Bump the pinned OpenClaw version across all version-tracking files (Dockerfile.base, blueprint.yaml, manifest.yaml, and version tests) to the latest stable release.

ericksoa added 2 commits April 25, 2026 15:32

Merge branch 'main' into upgrade/openclaw-2026.4.24

c1fe5f4

ericksoa marked this pull request as ready for review April 25, 2026 23:23

olegshilov approved these changes Apr 25, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 26, 2026

View reviewed changes

Comment thread test/e2e/test-sandbox-operations.sh Outdated

ericksoa added 2 commits April 26, 2026 09:11

coderabbitai Bot reviewed Apr 26, 2026

View reviewed changes

ericksoa added 2 commits April 26, 2026 10:49

revert: undo test/e2e/test-sandbox-operations.sh timeout + diagnostics

935a9b4

Reverts 2aacc51 and 1e512b1. The test contract (run openclaw agent via SSH and assert the reply contains the expected token) stays as-is. Real fix belongs in NemoClaw, not the test harness.

coderabbitai Bot reviewed Apr 26, 2026

View reviewed changes

ericksoa added 14 commits April 26, 2026 11:29

Revert "fix: give gateway user write access to plugin-runtime-deps ca…

e685857

…che" This reverts commit 521c599.

ericksoa added 4 commits April 26, 2026 21:59

Revert "fix: remove sandbox-safety-net Node preload entirely"

23d10c5

This reverts commit 92debb3.

wscurran added Docker Support for Docker containerization dependencies Pull requests that update a dependency file labels Apr 27, 2026

ericksoa added 5 commits April 27, 2026 09:04

Merge remote-tracking branch 'origin/main' into upgrade/openclaw-2026…

0ccad30

….4.24 # Conflicts: # Dockerfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: upgrade OpenClaw from 2026.4.9 to 2026.4.24#2484

chore: upgrade OpenClaw from 2026.4.9 to 2026.4.24#2484
ericksoa wants to merge 31 commits intomainfrom
upgrade/openclaw-2026.4.24

ericksoa commented Apr 25, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Apr 25, 2026

Uh oh!

coderabbitai Bot commented Apr 25, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

olegshilov left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

ericksoa commented Apr 27, 2026

Uh oh!

ericksoa commented Apr 27, 2026

Uh oh!

ericksoa commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ericksoa commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fixes in this PR

Status

TC-SBX-02 — what we know

Notable upstream changes (2026.4.9 → 2026.4.24)

User sandbox state migration on rebuild

Test plan

Uh oh!

copy-pr-bot Bot commented Apr 25, 2026

Uh oh!

coderabbitai Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

olegshilov left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

ericksoa commented Apr 27, 2026

Root cause: bonjour mDNS plugin destabilizes WS connections in sandbox netns

Causal chain

Why this surfaces in 2026.4.24 but not 2026.4.9

Why disable bonjour is the right fix

Fix in this PR

Diagnostic infrastructure to remove on green

Uh oh!

ericksoa commented Apr 27, 2026

Version bisect: bonjour plugin introduced in OpenClaw 2026.4.15

Reasoning recap

Validation in progress

Uh oh!

ericksoa commented Apr 27, 2026

Validation result: bonjour disabled but TC-SBX-02 still hangs

What the new log shows

Next steps I see

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ericksoa commented Apr 25, 2026 •

edited

Loading

coderabbitai Bot commented Apr 25, 2026 •

edited

Loading