Skip to content

feat(vm): add openshell-vm microVM gateway backend (opt-in via NEMOCLAW_GATEWAY_BACKEND=vm)#1791

Draft
ericksoa wants to merge 47 commits intomainfrom
feat/openshell-vm-backend
Draft

feat(vm): add openshell-vm microVM gateway backend (opt-in via NEMOCLAW_GATEWAY_BACKEND=vm)#1791
ericksoa wants to merge 47 commits intomainfrom
feat/openshell-vm-backend

Conversation

@ericksoa
Copy link
Copy Markdown
Contributor

@ericksoa ericksoa commented Apr 11, 2026

Summary

Adds openshell-vm (libkrun microVM) as an opt-in alternative gateway backend. Docker remains the default. Users enable the VM backend by setting NEMOCLAW_GATEWAY_BACKEND=vm — no documented flags or CLI changes.

Also bumps the OpenShell version pin from 0.0.26 to 0.0.32, picking up seccomp hardening, Landlock fixes, SSRF protection, deny rules in the policy schema, and standalone binary publishing.

All VM backend code is exercised nightly by the vm-e2e CI job (30/30 passing on ubuntu-latest with KVM). The Docker path is unchanged and all existing E2E jobs continue to pass.

How it works

detectGatewayBackend() checks, in order:

  1. NEMOCLAW_GATEWAY_BACKEND env var — "vm" or "docker" overrides everything
  2. Docker available → use Docker (default)
  3. openshell-vm binary in PATH → use VM
  4. Neither → "unknown" (onboard fails with guidance)

When VM is selected, onboard.ts spawns openshell-vm --name nemoclaw --mem 4096 as a detached process, tracks its PID, polls gRPC health, and manages the full lifecycle (start, health check, resume, cleanup). The sandbox image is built with docker build on the host, exported via docker save, written into the VM rootfs via virtio-fs, and imported into the VM's containerd via ctr images import.

Benefits of the VM backend

Docker (default) VM (NEMOCLAW_GATEWAY_BACKEND=vm)
Isolation Sandboxes share the host kernel via Docker containers Sandboxes run inside a hardware-isolated microVM (KVM). No shared kernel, no Docker socket
Dependencies Requires Docker daemon (or Podman with Docker compat) Requires only /dev/kvm and the openshell-vm binary. No Docker daemon for the gateway
Memory ~8 GiB for Docker daemon sidecar + k3s ~4 GiB for the microVM (configurable via --mem)
Startup ~60s (Docker daemon + k3s bootstrap) ~6s VM boot + ~60s k3s bootstrap inside VM
K8s deployment Requires Docker-in-Docker (DinD sidecar, 3 volumes, init container) Single container with KVM access, no sidecar
GPU inference Works via host inference server + L7 proxy Same — GPU inference is routed through inference.local (OpenShell L7 proxy → host), so the gateway never touches the GPU directly
macOS Docker Desktop or Colima Hypervisor.framework (no Docker needed) — not yet tested with this integration

Tradeoffs and limitations

  • KVM required: The VM backend needs /dev/kvm, which means bare-metal or nested virtualization. Not available in all cloud VMs or container runtimes.
  • Host Docker still needed for sandbox image build: docker build + docker save run on the host to create the sandbox image. This is standard CI usage, not Docker-in-Docker, but it means Docker isn't fully eliminated.
  • Two rootfs workarounds (both self-disabling): see "Workarounds" section below.
  • macOS untested: openshell-vm supports macOS ARM64 via Hypervisor.framework but this NemoClaw integration hasn't been tested on Mac yet.
  • Not the default: Docker is the established, documented, tested-in-production path. The VM backend is new and should bake before becoming a default.

OpenShell version bump: 0.0.26 → 0.0.32

This PR bumps max_openshell_version and the install pin from 0.0.26 to 0.0.32. Key improvements in the new range:

Version What it brings
v0.0.28 VM kernel source fix for CONFIG_POSIX_MQUEUE (d8cf7951) — runtime not yet rebuilt, shim still needed
v0.0.29 Seccomp hardening, Landlock path fixes, symlink resolution
v0.0.30 ComputeDriver refactor, deny rules in network policy schema, streaming fix
v0.0.31 Exclude vm-dev tag from version glob, header allowlist for inference
v0.0.32 Standalone openshell-sandbox binaries published, system CA cert support

Workarounds (both self-disabling)

1. mqueue runc shim (still needed)

The vm-dev kernel runtime artifacts have not been rebuilt since CONFIG_POSIX_MQUEUE=y was added to the source (d8cf7951, 2026-04-10). Without mqueue support, every container inside the VM fails with error mounting "mqueue" to rootfs: no such device.

Fix: Write a containerd config that routes runc through a shim script. The shim edits each container's config.json to use tmpfs instead of mqueue.

Self-disables: The init script tests mount -t mqueue at boot. If the kernel supports it, the shim is never installed. Once the vm-dev runtime is rebuilt with the kernel fix, this becomes a no-op.

2. Supervisor glibc extraction (still needed)

The openshell-sandbox supervisor binary is built against glibc 2.39 (Ubuntu 24.04) but sandbox containers use Ubuntu 22.04 (glibc 2.35). It crashes with GLIBC_2.38 not found.

Fix: Extract the compatible binary from the Docker gateway image at onboard time (~7s, once).

Self-disables: Only runs if Docker is available. Overwrites with an equivalent binary if upstream fixes the build target.

What changed (17 files, ~+1500/-40)

File Purpose
src/lib/platform.ts detectGatewayBackend() — Docker preferred, VM via env override
src/lib/onboard.ts VM gateway lifecycle: spawn, PID tracking, health poll, rootfs patches (mqueue shim + glibc), image import, resume/recovery, DNS proxy skip for VM
src/lib/openshell.ts isOpenshellVmAvailable(), getInstalledOpenshellVmVersion()
src/lib/onboard-session.ts gatewayBackend field in session (persists choice across resume)
src/nemoclaw.ts VM-aware cleanup (stopVmGateway), pruneKnownHostsEntries import
nemoclaw-blueprint/blueprint.yaml gateway_backends: [docker, vm], max version bump to 0.0.32
schemas/blueprint.schema.json Schema update for gateway_backends
scripts/install-openshell.sh Version pin bump to 0.0.32
scripts/brev-launchable-ci-cpu.sh Default OpenShell version bump to v0.0.32
.github/workflows/nightly-e2e.yaml vm-e2e job (KVM, NEMOCLAW_GATEWAY_BACKEND=vm, diagnostics)
test/e2e/test-vm-backend-e2e.sh Full VM E2E: install → onboard → inference → resume → reset (30 checks)
test/openshell-vm.test.ts Unit tests for VM detection, session persistence, lifecycle
test/platform.test.ts Unit tests for detectGatewayBackend() priority and overrides

E2E test phases (30/30 passing)

Phase Description
0 Prerequisites (Linux, KVM, API key, network)
1 Install openshell-vm binary + runtime
2 Install NemoClaw with NEMOCLAW_GATEWAY_BACKEND=vm
3 Post-install verification (registry, list, status, no Docker container)
4 Live inference through VM sandbox (PONG test)
5 Resume after openshell-vm kill
6 Reset — destroy and clean slate re-onboard
7 Final cleanup

Test plan

  • npm test — all tests pass
  • Pre-commit and pre-push hooks pass
  • vm-e2e: 30/30 on GitHub Actions ubuntu-latest (KVM)
  • cloud-e2e: passes (Docker path not regressed)
  • sandbox-survival-e2e: passes
  • hermes-e2e: passes
  • messaging-providers-e2e: passes
  • skip-permissions-e2e: passes
  • security-configuration-hardening test: passes (Docker manifest unchanged)
  • Manual testing on Mac with openshell-vm

Refs: NVIDIA/OpenShell#611

OpenShell v0.0.26 ships openshell-vm, a standalone binary that boots a
hardware-isolated microVM via libkrun. This commit lays the groundwork
for NemoClaw to support both the existing Docker/k3s gateway and the new
microVM gateway as selectable backends.

Phase 1 changes:
- Bump min_openshell_version from 0.0.24 to 0.0.26 across blueprint,
  install script, onboard preflight, CI scripts, E2E tests, and docs
- Add gateway_backends field to blueprint.yaml schema (docker, vm)
- Add isOpenshellVmAvailable() and getInstalledOpenshellVmVersion() to
  openshell.ts for detecting the openshell-vm binary
- Add detectGatewayBackend() to platform.ts with NEMOCLAW_GATEWAY_BACKEND
  env var override, auto-detection preferring VM when available and
  falling back to Docker, and mandatory Docker for GPU workloads
- Add gatewayBackend field to onboard session schema for persisting the
  selected backend across resume cycles
- Add tests for all new functions

The VM backend requires no Docker daemon and provides faster boot, but
has no NVIDIA GPU passthrough (libkrun lacks PCI/VFIO support), so the
Docker backend remains mandatory for local inference on GPU workstations.

Refs: NVIDIA/OpenShell#611
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 11, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 63c73205-7275-4409-a35b-c88b4c732961

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 27.91% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding openshell-vm as a microVM gateway backend with environment variable opt-in, which aligns with the comprehensive set of changes across configuration, code, tests, and workflows.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/openshell-vm-backend

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

🚀 Docs preview ready!

https://NVIDIA.github.io/NemoClaw/pr-preview/pr-1791/

ericksoa added 27 commits April 11, 2026 14:15
Phase 4 of the dual-backend feature (PR #1791):

- New test/e2e/test-vm-backend-e2e.sh: full E2E journey for the VM
  backend — install openshell-vm from release assets, onboard with
  NEMOCLAW_GATEWAY_BACKEND=vm, verify sandbox creation, live inference
  through the microVM gateway, resume after openshell-vm kill, and
  reset to clean slate.

- New vm-e2e job in nightly-e2e.yaml: runs on ubuntu-latest (has
  /dev/kvm), installs openshell-vm, executes the VM backend E2E test.

- New vm-backend test suite in brev-e2e.test.ts: allows running the
  VM backend E2E on ephemeral Brev instances via TEST_SUITE=vm-backend.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
The openshell-vm binary is published on the vm-dev pre-release tag
(not v0.0.26) with gnu libc (not musl). Also downloads the VM runtime
(kernel + rootfs) needed by libkrun, and installs zstd for
decompression.

Asset corrections:
- Tag: vm-dev (not v0.0.26)
- Binary: openshell-vm-*-unknown-linux-gnu.tar.gz (not musl)
- Checksums: vm-binary-checksums-sha256.txt
- Runtime: vm-runtime-linux-*.tar.zst → ~/.local/share/openshell-vm/

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
When detectGatewayBackend() returns "vm", the onboard and runtime
recovery flows now use openshell-vm instead of Docker:

- startGatewayWithOptions() detects the backend and delegates to
  startVmGatewayProcess() for the VM path, which spawns openshell-vm
  as a detached background process with PID tracking
- VM lifecycle helpers: writeVmPidFile, readVmPid, isVmProcessAlive,
  stopVmGateway (SIGTERM→SIGKILL), isVmGatewayHealthy
- destroyGateway() checks session.gatewayBackend and stops the VM
  process instead of Docker volume cleanup when backend is "vm"
- recoverGatewayRuntime() reads session.gatewayBackend to choose
  VM vs Docker recovery path
- recoverNamedGatewayRuntime() in nemoclaw.ts skips Docker-specific
  gateway select commands for VM backend
- cleanupGatewayAfterLastSandbox() stops VM process instead of
  Docker cleanup when backend is "vm"
- Gateway backend is saved to onboard session on step completion
  so resume cycles know which backend to use
- Resume flow checks VM health via isVmGatewayHealthy() instead of
  Docker gateway state when session records VM backend

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Reading the openshell-vm source (NVIDIA/OpenShell crates/openshell-vm)
revealed three bugs in the Phase 2 implementation:

1. openshell-vm was spawned with no arguments. It needs `--name nemoclaw`
   so it extracts the rootfs to the correct instance directory and
   registers gateway metadata under the right identity.

2. openshell-vm prefixes instance names: gateway_name("nemoclaw")
   produces "openshell-vm-nemoclaw". All OPENSHELL_GATEWAY env vars
   and `openshell gateway select` calls for the VM path now use
   VM_GATEWAY_NAME ("openshell-vm-nemoclaw") instead of GATEWAY_NAME.

3. Health poll was too short (15 × 2s = 30s). The VM boots k3s inside
   a microVM with its own 90s internal health check. Increased to
   60 × 3s = 180s to avoid racing the inner bootstrap. Also log the
   last 10 lines from openshell-vm.log on failure for diagnostics.

Additionally: the VM gateway listens on port 30051 (NodePort), not
8080. The openshell-vm binary handles the port mapping internally
(gvproxy host:30051 → VM:30051 → kube-proxy → pod:8080).

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
The GitHub-hosted ubuntu runner has /dev/kvm (crw-rw---- root:kvm) but
the runner user is not in the kvm group. openshell-vm opens /dev/kvm
directly via libkrun and fails with EACCES.

Fix by chmod 666 /dev/kvm in the KVM verification step. Also add the
user to the kvm group for completeness, though the chmod is sufficient
for the current process.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
…tics

Three changes to address gvproxy crash on GitHub Actions:

1. Pass --mem 4096 to openshell-vm. The default 8GB is half the 16GB
   runner memory. With k3s pulling container images inside the VM, host
   processes (gvproxy, gvisor netstack) get starved. 4GB is enough for
   a lightweight gateway without GPU workloads.

2. Detect and use the E2E-downloaded VM runtime via OPENSHELL_VM_RUNTIME_DIR.
   The test script downloads gvproxy/libkrun/libkrunfw from the vm-dev
   release to ~/.local/share/openshell-vm/ but never tells openshell-vm
   to use it. The downloaded runtime may contain fixes not in the binary's
   embedded copy. When the downloaded runtime exists (has gvproxy), set
   OPENSHELL_VM_RUNTIME_DIR to prefer it.

3. Add VM diagnostics step to CI: openshell-vm log, gvproxy log, dmesg
   OOM check, memory stats, and VM console log. This will show the
   actual root cause if gvproxy crashes again.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
The openshell-vm kernel has CONFIG_POSIX_MQUEUE=y but the init script
(openshell-vm-init.sh) never mounts the mqueue filesystem. When k3s
creates a pod, runc tries to mount mqueue at /dev/mqueue inside the
container namespace and gets ENODEV ("no such device") because the
host mount point doesn't exist.

Fix: run `openshell-vm prepare-rootfs` to extract the rootfs, then
patch the init script to mkdir + mount mqueue alongside the existing
devpts/shm mounts. The patch is idempotent — skipped if the init
script already contains /dev/mqueue.

Root cause found by tracing the VM console log:
  runc create failed: error mounting "mqueue" to rootfs at
  "/dev/mqueue": no such device

This should be fixed upstream in the OpenShell init script.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
The previous mqueue patch ran silently when prepare-rootfs failed or
the init script wasn't found. Add verbose logging at each step so CI
output shows exactly what happened: rootfs path, init script location,
whether the string replacement matched, and any errors.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
The vm-dev kernel (built 2026-04-09) predates the CONFIG_POSIX_MQUEUE
addition to the kconfig (2026-04-10 d8cf7951). runc fails with ENODEV
when trying to mount -t mqueue inside container namespaces.

Previous approach (mounting mqueue in init script) didn't work because
the kernel itself lacks the filesystem type — mount returns ENODEV.

New approach: install a runc wrapper shim that strips mqueue mount
entries from the OCI config.json before calling the real runc binary.
The shim is only installed when the kernel actually lacks mqueue
support (tested by attempting mount -t mqueue). When a future kernel
rebuild includes CONFIG_POSIX_MQUEUE, the shim is not installed.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
…pt breakage

The previous approach embedded a heredoc in the init script patch,
which broke the init script (VM died before exec agent started).

New approach: write runc-wrapper.sh as a separate file in
/opt/nemoclaw/ of the rootfs, then patch the init script with a
simple test-and-swap block. The wrapper is a standalone script that
strips mqueue entries from config.json via sed before calling the
real runc binary. The init script patch is minimal — just an if/fi
block that copies the wrapper over runc when the kernel lacks mqueue.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
…shim

Previous approach of wrapping /usr/bin/runc didn't work because k3s
bundles its own runc in /var/lib/rancher/k3s/data/<hash>/bin/, extracted
fresh at each startup. Our host-side wrapper was never called.

New approach: write /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl
in the init script, which k3s reads before generating its containerd config.
The template sets BinaryName to /opt/nemoclaw/runc-shim, which strips
mqueue mount entries from config.json before calling the k3s-bundled runc.

This routes ALL runc invocations through the shim, correctly handling the
k3s data directory extraction timing.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
When the session's gatewayBackend is "vm", the standard `openshell
sandbox create --from Dockerfile` path fails because the openshell CLI
internally uses Docker put_archive to push the image into the gateway
container — which doesn't exist for VM gateways.

Fix: detect VM backend during sandbox creation and:
1. Build the image locally with Docker (same Dockerfile)
2. Export with docker save to the VM's rootfs via virtio-fs
   (host can write directly to the rootfs/tmp/ directory)
3. Import into VM containerd via openshell-vm exec + ctr images import
4. Pass --from <image-ref> instead of --from <Dockerfile> so the
   openshell CLI treats it as a pre-existing image (skips Docker push)

Falls back to the standard Dockerfile path if any step fails, so Docker
backend behavior is unchanged.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
…erlay)

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
…g issues

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Two post-sandbox fixes for VM backend:

1. DNS proxy: setup-dns-proxy.sh uses docker exec to reach kubectl
   inside the gateway container. VM backend has no Docker container.
   Skip the DNS proxy — gvproxy provides NAT networking with working
   DNS through the gateway IP (192.168.127.1).

2. Sandbox readiness: the sandbox pod may briefly flip Ready→NotReady
   during init container restarts in the VM. Add a 30s wait loop in
   setupOpenclaw before attempting sandbox connect, preventing the
   FailedPrecondition "sandbox is not ready" error.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
The vm-dev openshell-sandbox binary was built with cargo-zigbuild on
a host with glibc 2.39, but sandbox containers use Ubuntu 22.04
(glibc 2.35). The binary crashes at startup:
  GLIBC_2.38 not found (required by /opt/openshell/bin/openshell-sandbox)

Fix: extract the openshell-sandbox binary from the Docker gateway
image (ghcr.io/nvidia/openshell/cluster:<version>), which was built
with rust:1.88-slim (Debian bookworm, glibc 2.36 — compatible with
Ubuntu 22.04's glibc 2.35). Replace the VM rootfs copy before boot.

The supervisor is side-loaded into sandbox pods via hostPath volume
from /opt/openshell/bin on the k3s node, so this fix propagates to
all sandbox pods automatically.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
waitForSandboxReady() used `openshell doctor exec -- kubectl` which
runs docker exec inside the gateway container. This is Docker-specific
and fails silently for VM gateways (no Docker container exists).

For VM backend, use `openshell sandbox list` + isSandboxReady() which
goes through the gRPC API and works for both Docker and VM gateways.
Docker backend behavior is unchanged.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
…ateway

When openshell-vm is killed after a successful onboard, the session is
marked complete (resumable=false). Running `nemoclaw onboard --resume`
correctly reports "No resumable session found" — but the E2E test
expects resume to restart the gateway and reconnect the sandbox.

Fix: when --resume finds a completed session with gatewayBackend=vm
and the VM gateway is dead, restart openshell-vm and check if the
sandbox is still alive. If the sandbox reconnects after gateway
restart, exit 0 (recovery successful). If not, mark the session
resumable and fall through to the normal re-onboard path.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
…ng resume

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
GPU inference is routed through inference.local (OpenShell L7 proxy →
host inference server). The sandbox and gateway never need direct GPU
access — the GPU is used by the host-side Ollama/vLLM/NIM process.
The VM backend works for all scenarios including local GPU inference.

Remove the gpuRequested parameter and the conditional that forced
Docker when GPU was detected. VM is now preferred whenever openshell-vm
is available, regardless of GPU presence.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa ericksoa changed the title feat(vm): dual-backend foundation for openshell-vm microVM gateway feat(vm): openshell-vm microVM gateway backend (full lifecycle, E2E 30/30) Apr 12, 2026
Replace the two-container DinD architecture (dockerd sidecar + workspace)
with a single container using openshell-vm. Removes:

- docker:24-dind sidecar container (8Gi RAM, 2 CPU)
- Docker socket shared volume
- Docker storage volume
- Docker config init container (cgroup v2 workaround)
- docker.io apt package install
- Docker daemon wait loop

The workspace container now installs openshell-vm via the upstream
install script and sets NEMOCLAW_GATEWAY_BACKEND=vm. Requires /dev/kvm
on the k8s node (bare-metal or nested virt enabled).

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa
Copy link
Copy Markdown
Contributor Author

What dropping Docker-in-Docker means in practice

The k8s manifest (k8s/nemoclaw-k8s.yaml) went from a two-container DinD architecture to a single container using openshell-vm. Here's what changes:

Security

  • No more privileged Docker daemon. The DinD sidecar ran dockerd with privileged: true, which gives the container full root access to the host kernel — it can load kernel modules, access all devices, and escape the container. The VM backend still needs privileged: true for /dev/kvm, but KVM is a much narrower capability than a full Docker daemon. The attack surface shrinks from "arbitrary container execution engine" to "hardware virtualization API."
  • No Docker socket sharing. The old manifest shared /var/run/docker.sock between containers via an emptyDir volume. Any process with access to the Docker socket can run arbitrary containers on the host. That shared volume is gone.
  • Stronger isolation. Sandboxes run inside a hardware-isolated microVM (KVM + libkrun), not inside Docker containers sharing the host kernel. A sandbox escape would need to break out of the VM's hardware boundary, not just a Linux namespace.

Resource usage

  • ~8Gi less memory requested. The DinD sidecar requested 8Gi RAM + 2 CPU just for the Docker daemon. The VM backend uses 4Gi for the microVM (configurable via --mem) and doesn't need a separate daemon process. Total pod request drops from 12Gi to 8Gi.
  • One fewer container. No Docker daemon to start, health-check, or restart. Fewer moving parts means fewer failure modes.
  • No Docker image layer storage. The DinD sidecar used an emptyDir for /var/lib/docker which could grow unbounded as images were pulled and built. The VM uses a fixed-size state disk (32 GiB sparse file, allocated on demand).

Startup time

  • No Docker daemon wait. The old flow spent up to 60 seconds waiting for dockerd to become ready (docker info poll loop). The VM boots in ~6 seconds to first health check.
  • No Docker image pulls inside Docker. DinD needed to pull the OpenShell cluster image inside the Docker daemon, which was slow (no layer cache on first run). The VM has the k3s cluster embedded in its rootfs.

Operational simplicity

  • 122 lines → 56 lines. The manifest is less than half the size.
  • No init container. The cgroup v2 Docker daemon config hack (init-docker-config) is gone.
  • No shared volumes. Three emptyDir volumes (docker-storage, docker-socket, docker-config) are gone.
  • Easier debugging. One container, one process. kubectl logs nemoclaw gives you everything.

Requirements

  • KVM on the node. The k8s node must have /dev/kvm available. Bare-metal nodes and VMs with nested virtualization enabled have this. Standard cloud VMs on GKE/EKS may not — check your node pool configuration.
  • Docker still needed on the host for building the sandbox image (docker build + docker save). This is a build-time dependency, not a runtime one. Future work could eliminate this by building inside the VM's containerd or using podman.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
scripts/brev-launchable-ci-cpu.sh (1)

31-31: ⚠️ Potential issue | 🟡 Minor

Stale comment: default version mismatch.

Line 31 documents the default as v0.0.20, but line 43 sets the actual default to v0.0.26. Update the comment to match:

-#   OPENSHELL_VERSION     — OpenShell CLI release tag (default: v0.0.20)
+#   OPENSHELL_VERSION     — OpenShell CLI release tag (default: v0.0.26)

Also applies to: 43-43

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/brev-launchable-ci-cpu.sh` at line 31, Update the stale comment that
documents the default OpenShell CLI release tag so it matches the actual
default; change the commented line referencing "OPENSHELL_VERSION — OpenShell
CLI release tag (default: v0.0.20)" to reflect the current default v0.0.26 used
where OPENSHELL_VERSION is set, ensuring the comment and the OPENSHELL_VERSION
default value are consistent.
🧹 Nitpick comments (6)
test/e2e/test-vm-backend-e2e.sh (3)

34-34: Consider adding set -e for fail-fast behavior.

The script uses set -uo pipefail but omits -e (errexit). This means commands can fail silently without stopping the test. While the explicit pass/fail tracking handles some cases, intermediate commands (like cd, tar, install) could fail and the script would continue, potentially producing misleading results.

If intentional (to allow the test framework to track each failure), consider documenting why -e is omitted.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-vm-backend-e2e.sh` at line 34, Update the shell options line
that currently reads "set -uo pipefail" to include "-e" so the script uses "set
-euo pipefail" to enable fail-fast behavior (or, if omission is intentional, add
an explicit comment above the "set -uo pipefail" line explaining why "-e" is
omitted and how failures are tracked); modify the single-line "set -uo pipefail"
occurrence in test/e2e/test-vm-backend-e2e.sh (the shell options line)
accordingly to ensure intermediate command failures stop the script or are
clearly documented.

291-295: apt-get usage assumes Debian/Ubuntu.

While the script targets GitHub ubuntu-latest runners, users running locally on other Linux distributions (Fedora, Arch) would see failures. Consider adding a fallback or clearer error message.

Suggested defensive check
 # zstd may not be installed — install it if needed
 if ! command -v zstd >/dev/null 2>&1; then
   info "Installing zstd for runtime decompression..."
-  sudo apt-get update -qq && sudo apt-get install -y -qq zstd >/dev/null 2>&1
+  if command -v apt-get >/dev/null 2>&1; then
+    sudo apt-get update -qq && sudo apt-get install -y -qq zstd >/dev/null 2>&1
+  elif command -v dnf >/dev/null 2>&1; then
+    sudo dnf install -y -q zstd >/dev/null 2>&1
+  else
+    fail "zstd not found and no supported package manager available"
+    exit 1
+  fi
 fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-vm-backend-e2e.sh` around lines 291 - 295, The current zstd
install block (the if ! command -v zstd conditional) assumes apt-get is
available; replace it with a defensive install routine that first tries apt-get,
then falls back to common package managers (dnf, yum, pacman) and finally prints
a clear error/usage message if none are present; update the block around the
"Installing zstd for runtime decompression..." message to attempt each package
manager in order, and on failure exit with an informative message telling the
user to install zstd manually.

488-495: The pkill -f "openshell-vm" pattern may be overly broad.

This pattern matches any process with "openshell-vm" anywhere in its command line, which could unintentionally kill unrelated processes in shared environments. Consider using the PID file that onboard.ts writes (~/.nemoclaw/openshell-vm.pid) for targeted termination.

Suggested alternative using PID file
 info "Killing openshell-vm process to simulate crash..."
-# Find and kill any openshell-vm gateway process
-if pkill -f "openshell-vm" 2>/dev/null; then
+# Kill via PID file for targeted termination
+VM_PID_FILE="$HOME/.nemoclaw/openshell-vm.pid"
+if [ -f "$VM_PID_FILE" ] && kill -0 "$(cat "$VM_PID_FILE")" 2>/dev/null; then
+  kill "$(cat "$VM_PID_FILE")" 2>/dev/null
   pass "openshell-vm process killed"
   sleep 3
 else
   info "No openshell-vm process found to kill (may already be stopped)"
 fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-vm-backend-e2e.sh` around lines 488 - 495, Replace the broad
pkill usage with a PID-file based shutdown: read the PID from
~/.nemoclaw/openshell-vm.pid if it exists, verify the PID corresponds to a
running process whose command line contains "openshell-vm" (to avoid killing
unrelated PIDs), send a targeted kill to that PID and remove the pidfile; if the
pidfile is missing or verification fails, fall back to the existing pkill -f
"openshell-vm" behavior as a last resort and log appropriate info messages. This
change should be applied where pkill -f "openshell-vm" is used in the
test/e2e/test-vm-backend-e2e.sh script.
src/lib/onboard.ts (3)

2141-2235: Complex rootfs patching logic is well-documented but could benefit from extraction.

The mqueue fix (lines 2141-2235) is a significant workaround with detailed comments explaining the kernel limitation. While the documentation is excellent, this ~95-line block could be extracted into a separate helper function (e.g., patchVmRootfsForMqueue()) to improve readability and testability.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 2141 - 2235, Extract the mqueue workaround
block into a new helper function named patchVmRootfsForMqueue(rootfsDir, vmEnv)
and call it where the current block lives; move all logic that creates
shimDir/shimPath, writes the runc-shim, patches the init script (referencing
initScript, shimDir, shimPath, and the K3S_ARGS replacement) and the related fs
operations and logging into that function so the main flow around
spawnSync(prepResult) stays concise; ensure the helper receives rootfsDir (from
prepOutput) and returns success/failure or throws so existing console logs
(e.g., "Patched VM init..." / "Init script not found...") remain consistent and
tests can target patchVmRootfsForMqueue independently.

2815-2849: Import script approach is clever but embeds shell code in TypeScript.

The import script (lines 2817-2833) written to the rootfs is a reasonable workaround for the exec agent limitations. However, the shell script embedded as a JavaScript array is hard to maintain and test.

Consider extracting this to a separate .sh file that gets copied to the rootfs, similar to how other scripts are handled in the scripts/ directory.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 2815 - 2849, The embedded shell script
written to importScript makes maintenance hard; move the script text into a
standalone file (e.g., scripts/import-image.sh) and have onboard.ts copy that
file into the VM rootfs instead of constructing the array inline. Update the
code around importScript, fs.writeFileSync, and fs.chmodSync to locate the new
script (use path.join(__dirname, "..", "scripts", "import-image.sh") or
equivalent), copy it into rootfs (fs.copyFileSync or streaming copy), then set
executable bits and keep the existing run(...) and runCapture(...) calls that
reference /opt/nemoclaw/import-image.sh; ensure any template values (if needed)
are either left static or replaced before copying and adjust tests accordingly.

2245-2265: Two functions named getInstalledOpenshellVersion exist with different signatures—consider renaming for clarity.

Line 2246 correctly calls the imported version from openshell.ts with signature (binary: string, opts: CaptureOpenshellOptions). However, a local version at line 393 exists with signature (versionOutput = null) that serves a different purpose (parsing pre-captured output). While the code uses the correct function, having two identically-named functions with different signatures and purposes invites confusion during maintenance.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 2245 - 2265, There are two different
functions named getInstalledOpenshellVersion: one imported from "./openshell"
(signature (binary: string, opts: CaptureOpenshellOptions)) and a local helper
that parses pre-captured output (signature (versionOutput = null)); rename the
local parser to something unambiguous like parseOpenshellVersionOutput (or
parseCapturedOpenshellVersion), update all internal call sites to that new name,
and leave the imported getInstalledOpenshellVersion unchanged so callers like
the gatewayImage construction continue to use the correct function.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/platform.ts`:
- Around line 123-135: The detectGatewayBackend function currently defaults
opts.vmAvailable to false which causes dual-capable hosts to prefer "docker"
silently; update detectGatewayBackend to either (A) perform VM detection when
opts.vmAvailable is undefined (mirror docker detection by calling an appropriate
VM detection helper, e.g., detectVMHost or similar, and set vmAvailable based on
its result) or (B) make opts.vmAvailable required and throw a clear error if
it's not provided; ensure you update the vmAvailable initialization (and any
callers) instead of hard-coding false and keep the dockerAvailable logic that
calls detectDockerHost unchanged.

In `@src/nemoclaw.ts`:
- Around line 523-543: The VM branch constructs vmGatewayName but calls
getNamedGatewayLifecycleState() with no name; update both lifecycle checks to
call getNamedGatewayLifecycleState(vmGatewayName) so the code inspects the VM
gateway (before and after), keep runOpenshell(["gateway","select",
vmGatewayName]) and setting process.env.OPENSHELL_GATEWAY as-is, and leave
startGatewayForRecovery() behavior unchanged unless it also needs a gateway name
in future.

In `@test/openshell-vm.test.ts`:
- Around line 129-164: The VM tests are reading the real PID file and process
table causing flakiness; update the helpers or tests so they don't touch real
system state: modify readVmPid to accept an optional pidFilePath (or add an
injectable function) and make isVmGatewayHealthy accept a liveness-check
function (or inject isVmProcessAlive), then in the tests call these helpers with
a temp/test PID file and a mocked liveness probe, or alternatively mock fs
(e.g., fs.readFileSync/existsSync) and process checks used by
isVmProcessAlive/isVmGatewayHealthy; adjust tests to create controlled PID
contents and to stub process liveness so assertions are deterministic.

---

Outside diff comments:
In `@scripts/brev-launchable-ci-cpu.sh`:
- Line 31: Update the stale comment that documents the default OpenShell CLI
release tag so it matches the actual default; change the commented line
referencing "OPENSHELL_VERSION — OpenShell CLI release tag (default: v0.0.20)"
to reflect the current default v0.0.26 used where OPENSHELL_VERSION is set,
ensuring the comment and the OPENSHELL_VERSION default value are consistent.

---

Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 2141-2235: Extract the mqueue workaround block into a new helper
function named patchVmRootfsForMqueue(rootfsDir, vmEnv) and call it where the
current block lives; move all logic that creates shimDir/shimPath, writes the
runc-shim, patches the init script (referencing initScript, shimDir, shimPath,
and the K3S_ARGS replacement) and the related fs operations and logging into
that function so the main flow around spawnSync(prepResult) stays concise;
ensure the helper receives rootfsDir (from prepOutput) and returns
success/failure or throws so existing console logs (e.g., "Patched VM init..." /
"Init script not found...") remain consistent and tests can target
patchVmRootfsForMqueue independently.
- Around line 2815-2849: The embedded shell script written to importScript makes
maintenance hard; move the script text into a standalone file (e.g.,
scripts/import-image.sh) and have onboard.ts copy that file into the VM rootfs
instead of constructing the array inline. Update the code around importScript,
fs.writeFileSync, and fs.chmodSync to locate the new script (use
path.join(__dirname, "..", "scripts", "import-image.sh") or equivalent), copy it
into rootfs (fs.copyFileSync or streaming copy), then set executable bits and
keep the existing run(...) and runCapture(...) calls that reference
/opt/nemoclaw/import-image.sh; ensure any template values (if needed) are either
left static or replaced before copying and adjust tests accordingly.
- Around line 2245-2265: There are two different functions named
getInstalledOpenshellVersion: one imported from "./openshell" (signature
(binary: string, opts: CaptureOpenshellOptions)) and a local helper that parses
pre-captured output (signature (versionOutput = null)); rename the local parser
to something unambiguous like parseOpenshellVersionOutput (or
parseCapturedOpenshellVersion), update all internal call sites to that new name,
and leave the imported getInstalledOpenshellVersion unchanged so callers like
the gatewayImage construction continue to use the correct function.

In `@test/e2e/test-vm-backend-e2e.sh`:
- Line 34: Update the shell options line that currently reads "set -uo pipefail"
to include "-e" so the script uses "set -euo pipefail" to enable fail-fast
behavior (or, if omission is intentional, add an explicit comment above the "set
-uo pipefail" line explaining why "-e" is omitted and how failures are tracked);
modify the single-line "set -uo pipefail" occurrence in
test/e2e/test-vm-backend-e2e.sh (the shell options line) accordingly to ensure
intermediate command failures stop the script or are clearly documented.
- Around line 291-295: The current zstd install block (the if ! command -v zstd
conditional) assumes apt-get is available; replace it with a defensive install
routine that first tries apt-get, then falls back to common package managers
(dnf, yum, pacman) and finally prints a clear error/usage message if none are
present; update the block around the "Installing zstd for runtime
decompression..." message to attempt each package manager in order, and on
failure exit with an informative message telling the user to install zstd
manually.
- Around line 488-495: Replace the broad pkill usage with a PID-file based
shutdown: read the PID from ~/.nemoclaw/openshell-vm.pid if it exists, verify
the PID corresponds to a running process whose command line contains
"openshell-vm" (to avoid killing unrelated PIDs), send a targeted kill to that
PID and remove the pidfile; if the pidfile is missing or verification fails,
fall back to the existing pkill -f "openshell-vm" behavior as a last resort and
log appropriate info messages. This change should be applied where pkill -f
"openshell-vm" is used in the test/e2e/test-vm-backend-e2e.sh script.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: accc1d5d-64cf-43eb-8220-b2e5190b92f1

📥 Commits

Reviewing files that changed from the base of the PR and between fcb5fd6 and 5f2db18.

📒 Files selected for processing (18)
  • .agents/skills/nemoclaw-user-reference/references/troubleshooting.md
  • .github/workflows/nightly-e2e.yaml
  • docs/reference/troubleshooting.md
  • k8s/nemoclaw-k8s.yaml
  • nemoclaw-blueprint/blueprint.yaml
  • schemas/blueprint.schema.json
  • scripts/brev-launchable-ci-cpu.sh
  • scripts/install-openshell.sh
  • src/lib/onboard-session.ts
  • src/lib/onboard.ts
  • src/lib/openshell.ts
  • src/lib/platform.ts
  • src/nemoclaw.ts
  • test/e2e/brev-e2e.test.ts
  • test/e2e/test-sandbox-survival.sh
  • test/e2e/test-vm-backend-e2e.sh
  • test/openshell-vm.test.ts
  • test/platform.test.ts

Comment thread src/lib/platform.ts
Comment on lines +123 to +135
function detectGatewayBackend(opts = {}) {
const env = opts.env ?? process.env;
const override = env.NEMOCLAW_GATEWAY_BACKEND;
if (override === "vm" || override === "docker") return override;

const vmAvailable =
typeof opts.vmAvailable === "boolean"
? opts.vmAvailable
: false; // caller should pass actual detection result
const dockerAvailable =
typeof opts.dockerAvailable === "boolean"
? opts.dockerAvailable
: detectDockerHost(opts) !== null;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't hard-code the default VM availability to false.

When callers omit opts.vmAvailable, this helper still resolves a dual-capable host to "docker", so the new default silently stops preferring VM. Either detect VM availability here or make vmAvailable a required input for this API.

💡 One way to keep the default preference correct
+const { isOpenshellVmAvailable } = require("./openshell");
+
 function detectGatewayBackend(opts = {}) {
   const env = opts.env ?? process.env;
   const override = env.NEMOCLAW_GATEWAY_BACKEND;
   if (override === "vm" || override === "docker") return override;
 
   const vmAvailable =
     typeof opts.vmAvailable === "boolean"
       ? opts.vmAvailable
-      : false; // caller should pass actual detection result
+      : isOpenshellVmAvailable();
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
function detectGatewayBackend(opts = {}) {
const env = opts.env ?? process.env;
const override = env.NEMOCLAW_GATEWAY_BACKEND;
if (override === "vm" || override === "docker") return override;
const vmAvailable =
typeof opts.vmAvailable === "boolean"
? opts.vmAvailable
: false; // caller should pass actual detection result
const dockerAvailable =
typeof opts.dockerAvailable === "boolean"
? opts.dockerAvailable
: detectDockerHost(opts) !== null;
const { isOpenshellVmAvailable } = require("./openshell");
function detectGatewayBackend(opts = {}) {
const env = opts.env ?? process.env;
const override = env.NEMOCLAW_GATEWAY_BACKEND;
if (override === "vm" || override === "docker") return override;
const vmAvailable =
typeof opts.vmAvailable === "boolean"
? opts.vmAvailable
: isOpenshellVmAvailable();
const dockerAvailable =
typeof opts.dockerAvailable === "boolean"
? opts.dockerAvailable
: detectDockerHost(opts) !== null;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/platform.ts` around lines 123 - 135, The detectGatewayBackend
function currently defaults opts.vmAvailable to false which causes dual-capable
hosts to prefer "docker" silently; update detectGatewayBackend to either (A)
perform VM detection when opts.vmAvailable is undefined (mirror docker detection
by calling an appropriate VM detection helper, e.g., detectVMHost or similar,
and set vmAvailable based on its result) or (B) make opts.vmAvailable required
and throw a clear error if it's not provided; ensure you update the vmAvailable
initialization (and any callers) instead of hard-coding false and keep the
dockerAvailable logic that calls detectDockerHost unchanged.

Comment thread src/nemoclaw.ts
Comment on lines +523 to +543
if (session?.gatewayBackend === "vm") {
// openshell-vm registers as "openshell-vm-nemoclaw", not "nemoclaw"
const vmGatewayName = `openshell-vm-${NEMOCLAW_GATEWAY_NAME}`;
runOpenshell(["gateway", "select", vmGatewayName], { ignoreError: true });
const before = getNamedGatewayLifecycleState();
if (before.state === "healthy_named") {
process.env.OPENSHELL_GATEWAY = vmGatewayName;
return { recovered: true, before, after: before, attempted: false };
}
try {
await startGatewayForRecovery();
} catch {
/* fall through */
}
runOpenshell(["gateway", "select", vmGatewayName], { ignoreError: true });
const after = getNamedGatewayLifecycleState();
if (after.state === "healthy_named") {
process.env.OPENSHELL_GATEWAY = vmGatewayName;
return { recovered: true, before, after, attempted: true, via: "start" };
}
return { recovered: false, before, after, attempted: true };
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Parameterize the lifecycle check with the VM gateway name.

This branch selects openshell-vm-nemoclaw, but getNamedGatewayLifecycleState() still queries and compares against "nemoclaw". As written, the VM path can never observe healthy_named, so recovery falls through as failed even when the VM gateway is already up.

🛠️ Suggested shape of the fix
-function getNamedGatewayLifecycleState() {
+function getNamedGatewayLifecycleState(gatewayName = NEMOCLAW_GATEWAY_NAME) {
   const status = captureOpenshell(["status"]);
-  const gatewayInfo = captureOpenshell(["gateway", "info", "-g", "nemoclaw"]);
+  const gatewayInfo = captureOpenshell(["gateway", "info", "-g", gatewayName]);
   const cleanStatus = stripAnsi(status.output);
   const activeGateway = getActiveGatewayName(status.output);
   const connected = /^\s*Status:\s*Connected\b/im.test(cleanStatus);
-  const named = hasNamedGateway(gatewayInfo.output);
+  const named = stripAnsi(gatewayInfo.output).includes(`Gateway: ${gatewayName}`);
   const refusing = /Connection refused|client error \(Connect\)|tcp connect error/i.test(
     cleanStatus,
   );
-  if (connected && activeGateway === "nemoclaw" && named) {
+  if (connected && activeGateway === gatewayName && named) {
     return { state: "healthy_named", status: status.output, gatewayInfo: gatewayInfo.output };
   }
   ...
 }

And in this branch:

-const before = getNamedGatewayLifecycleState();
+const before = getNamedGatewayLifecycleState(vmGatewayName);
...
-const after = getNamedGatewayLifecycleState();
+const after = getNamedGatewayLifecycleState(vmGatewayName);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/nemoclaw.ts` around lines 523 - 543, The VM branch constructs
vmGatewayName but calls getNamedGatewayLifecycleState() with no name; update
both lifecycle checks to call getNamedGatewayLifecycleState(vmGatewayName) so
the code inspects the VM gateway (before and after), keep
runOpenshell(["gateway","select", vmGatewayName]) and setting
process.env.OPENSHELL_GATEWAY as-is, and leave startGatewayForRecovery()
behavior unchanged unless it also needs a gateway name in future.

Comment thread test/openshell-vm.test.ts
Comment on lines +129 to +164
describe("readVmPid", () => {
it("returns null when PID file does not exist", () => {
// readVmPid reads from the real path (~/.nemoclaw/openshell-vm.pid)
// but we can test the isVmProcessAlive helper with known PIDs
expect(readVmPid()).toSatisfy((v) => v === null || typeof v === "number");
});
});

describe("isVmProcessAlive", () => {
it("returns false for null PID", () => {
expect(isVmProcessAlive(null)).toBe(false);
});

it("returns false for PID 0", () => {
expect(isVmProcessAlive(0)).toBe(false);
});

it("returns false for negative PID", () => {
expect(isVmProcessAlive(-1)).toBe(false);
});

it("returns true for current process PID", () => {
expect(isVmProcessAlive(process.pid)).toBe(true);
});

it("returns false for non-existent PID", () => {
// Use a very high PID that's unlikely to exist
expect(isVmProcessAlive(4_000_000)).toBe(false);
});
});

describe("isVmGatewayHealthy", () => {
it("returns false when no VM process is running", () => {
// With no PID file, there's no VM process to check
expect(isVmGatewayHealthy()).toBe(false);
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make these VM lifecycle tests hermetic.

readVmPid() and isVmGatewayHealthy() here read the real ~/.nemoclaw/openshell-vm.pid and process table, so the assertions change based on whatever is already running on the host. That makes this unit suite flaky on dev machines and shared CI workers. Inject the pid-file path / liveness probe into the helpers, or mock fs and the process checks for these cases instead. As per coding guidelines: test/**/*.test.{js,ts}: Mock external dependencies in unit tests; do not call real NVIDIA APIs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/openshell-vm.test.ts` around lines 129 - 164, The VM tests are reading
the real PID file and process table causing flakiness; update the helpers or
tests so they don't touch real system state: modify readVmPid to accept an
optional pidFilePath (or add an injectable function) and make isVmGatewayHealthy
accept a liveness-check function (or inject isVmProcessAlive), then in the tests
call these helpers with a temp/test PID file and a mocked liveness probe, or
alternatively mock fs (e.g., fs.readFileSync/existsSync) and process checks used
by isVmProcessAlive/isVmGatewayHealthy; adjust tests to create controlled PID
contents and to stub process liveness so assertions are deterministic.

@wscurran wscurran added enhancement New feature or request Docker Support for Docker containerization labels Apr 13, 2026
jyaunches pushed a commit that referenced this pull request Apr 14, 2026
… version field

Refactor test/e2e/brev-e2e.test.ts for issue #1390:

- Extract named helpers from the ~350-line monolithic beforeAll:
  cleanupLeftoverInstance(), createBrevInstance(), refreshAndWaitForSsh(),
  bootstrapLaunchable(), pollForSandboxReady(), writeManualRegistry().
  The beforeAll now reads as a high-level orchestration (~50 lines).

- Deduplicate brev create + error recovery: the deploy-cli and launchable
  paths shared duplicated brev refresh + waitForSsh patterns, now
  consolidated into refreshAndWaitForSsh().

- Remove phantom version: 1 from the manual registry write. The
  SandboxRegistry interface in src/lib/registry.ts has no version field;
  registerSandbox() doesn't write one either.

- bootstrapLaunchable() returns { remoteDir, needsOnboard } instead of
  mutating module-level state as a hidden side-effect.

- instanceCreated is set at call sites in beforeAll, not hidden inside
  createBrevInstance().

- Remove dead sleep() helper (defined but never called).

Pure refactoring — no behavior changes. // @ts-nocheck pragma preserved.

Note: PR #1791 (openshell-vm microVM) also touches this file — whoever
merges second will need a rebase.
ericksoa added a commit that referenced this pull request Apr 16, 2026
… version field

Refactor test/e2e/brev-e2e.test.ts for issue #1390:

- Extract named helpers from the ~350-line monolithic beforeAll:
  cleanupLeftoverInstance(), createBrevInstance(), refreshAndWaitForSsh(),
  bootstrapLaunchable(), pollForSandboxReady(), writeManualRegistry().
  The beforeAll now reads as a high-level orchestration (~50 lines).

- Deduplicate brev create + error recovery: the deploy-cli and launchable
  paths shared duplicated brev refresh + waitForSsh patterns, now
  consolidated into refreshAndWaitForSsh().

- Remove phantom version: 1 from the manual registry write. The
  SandboxRegistry interface in src/lib/registry.ts has no version field;
  registerSandbox() doesn't write one either.

- bootstrapLaunchable() returns { remoteDir, needsOnboard } instead of
  mutating module-level state as a hidden side-effect.

- instanceCreated is set at call sites in beforeAll, not hidden inside
  createBrevInstance().

- Remove dead sleep() helper (defined but never called).

Pure refactoring — no behavior changes. // @ts-nocheck pragma preserved.

Note: PR #1791 (openshell-vm microVM) also touches this file — whoever
merges second will need a rebase.
ericksoa added a commit that referenced this pull request Apr 16, 2026
…nstaller detection (#1888)

## Summary

Addresses all three items from #1390, plus a bonus installer fix
discovered while running the pre-push hooks.

### Issue #1390 — Brev E2E cleanup

- **Extract helpers from the monolithic beforeAll**: The ~350-line
`beforeAll` block is now ~50 lines of high-level orchestration calling
named helpers: `cleanupLeftoverInstance()`, `createBrevInstance()`,
`refreshAndWaitForSsh()`, `bootstrapLaunchable()`,
`pollForSandboxReady()`, `writeManualRegistry()`.

- **Deduplicate brev create + error recovery**: The deploy-cli and
launchable paths shared duplicated `brev refresh` + `waitForSsh`
patterns, now consolidated into `refreshAndWaitForSsh()`.

- **Remove phantom `version: 1`** from the manual registry write. The
`SandboxRegistry` interface in `src/lib/registry.ts` has no version
field; `registerSandbox()` doesn't write one either.

Additional cleanup:
- `bootstrapLaunchable()` returns `{ remoteDir, needsOnboard }` instead
of mutating module-level state as a hidden side-effect.
- `instanceCreated` is set at call sites in `beforeAll`, not hidden
inside `createBrevInstance()`.
- Dead `sleep()` helper removed.

Pure refactoring — no behavior changes. `// @ts-nocheck` pragma
preserved.

### Installer worktree fix

`scripts/install.sh` used `-d "${repo_root}/.git"` to detect source
checkouts, but in a git worktree `.git` is a file, not a directory. This
caused `is_source_checkout()` to return false, falling through to the
GitHub clone path. Fixed by using `-e` (exists) instead of `-d` (is
directory). This resolved 12 pre-existing test failures in
`test/install-preflight.test.ts`.

## Related Issue
Closes #1390

## Note
PR #1791 also touches `test/e2e/brev-e2e.test.ts` (appends a new
`vm-backend` test case). Clean merge — whoever merges second rebases
trivially.

## Type of Change
- [x] Code change for a new feature, bug fix, or refactor.

## Testing
- [x] `npx prek run --all-files` passes (all pre-commit and pre-push
hooks green).
- [x] `npx vitest run --project cli` — 1213 passed, 0 failed (was 12
failed before installer fix).


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Tests**
* Enhanced end-to-end test infrastructure and setup orchestration for
improved test reliability.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Aaron Erickson <aerickson@nvidia.com>
Merge main into feat/openshell-vm-backend, keeping both:
- VM backend features (openshell-vm lifecycle, K8s manifest, E2E job)
- Main branch additions (sandbox-operations, inference-routing, snapshot,
  shields-config, rebuild, upgrade-stale-sandbox, hermes rebuild E2E jobs,
  stale gateway container detection, LOCAL_INFERENCE_PROVIDERS, version
  pinning, pruneKnownHostsEntries)
The VM backend K8s manifest used privileged: true and curl|sh, which
fails the security-configuration-hardening test. Restore main's
Docker-in-Docker manifest with proper security context and add a
comment pointing to VM backend docs.
Revert the Docker-based manifest restoration — the point of this PR is
the microVM backend. Restore the VM manifest with pod-level hardening
from main (automountServiceAccountToken, enableServiceLinks, secretKeyRef
for COMPATIBLE_API_KEY, suggested policy mode).

Update the security-configuration-hardening test to branch on backend
type: VM backend validates privileged: true for KVM, Docker backend
validates the full container-level lockdown and secure download pattern.
Swap detectGatewayBackend priority so Docker is preferred when both
runtimes are available. Users can opt into the VM backend by setting
NEMOCLAW_GATEWAY_BACKEND=vm — the env override is already wired
through detectGatewayBackend and the E2E job sets it.

Restore the Docker-based K8s manifest and security test since Docker
is the default path. All VM backend code remains intact and exercised
by the vm-e2e nightly job.
@ericksoa ericksoa changed the title feat(vm): openshell-vm microVM gateway backend (full lifecycle, E2E 30/30) feat(vm): add openshell-vm microVM gateway backend (opt-in via NEMOCLAW_GATEWAY_BACKEND=vm) Apr 20, 2026
ericksoa added 11 commits April 19, 2026 21:09
OpenShell v0.0.28+ includes CONFIG_POSIX_MQUEUE=y in the VM kernel
(commit d8cf7951), so the runc shim that replaced mqueue with tmpfs
is no longer needed. Remove ~80 lines of workaround code.

The glibc supervisor workaround stays — upstream still builds
openshell-sandbox against glibc 2.39 (Ubuntu 24.04) but sandbox
containers use Ubuntu 22.04 (glibc 2.35).

Version pin changes:
- max_openshell_version: 0.0.26 → 0.0.32
- install-openshell.sh MAX_VERSION: 0.0.26 → 0.0.32
- brev-launchable-ci-cpu.sh default: v0.0.26 → v0.0.32
- test-vm-backend-e2e.sh MIN_OPENSHELL: 0.0.26 → 0.0.32
- glibc fallback image tag: 0.0.26 → 0.0.32
The vm-dev runtime artifacts still ship without CONFIG_POSIX_MQUEUE
despite the source fix (d8cf7951). Every container inside the VM
fails with "error mounting mqueue: no such device" without the shim.

Restore the self-disabling workaround. The version bump to 0.0.32
is kept — it brings security improvements and the shim will no-op
once the vm-dev kernel is actually rebuilt.
A cancelled worktree agent extracted provider code into a new
onboard-providers.ts module and rewrote imports in onboard.ts. This
introduced a duplicate `getSandboxInferenceConfig` identifier (imported
from the new module AND defined locally) causing a runtime SyntaxError
that broke `nemoclaw --version` and all E2E jobs.

Fix: restore onboard.ts from the clean merge commit (878ef15) and
re-apply only the glibc fallback version bump (0.0.26 → 0.0.32).
The mqueue shim and all VM lifecycle code are intact.
The inference route can take a few seconds to come back after the VM
gateway restarts. Add a 3-attempt retry with 10s pauses to avoid
flaky failures on the post-resume PONG check.
When the VM gateway is killed and --resume recreates it, the new
gateway has no provider/route state from the old one. But the resume
logic was checking isInferenceRouteReady() which could return stale
metadata, causing the inference step to be skipped. The sandbox then
has no working inference.local route.

Track when the gateway is recreated during resume and force inference
re-registration in that case. Also add diagnostics to the E2E test
(SSH check, raw response, route/provider state on failure).
When the VM gateway dies and --resume recreates it, the old sandbox
pod state persists on the VM's disk. openshell sandbox list reports
it as "ready" (from recovered k8s metadata) but the pod isn't
actually functional — SSH fails with "sandbox is not ready".

Extend the gatewayRecreatedDuringResume guard to also skip the
sandbox reuse check, forcing full sandbox recreation alongside
the inference provider re-registration.
k3s bootstrap inside the microVM can fail due to internal race
conditions (e.g. cloud-controller-manager starting before required
configmaps exist). When this happens the VM process dies and gvproxy
loses its virtio-net socket.

Instead of failing immediately, retry up to 3 times (configurable
via NEMOCLAW_VM_START_ATTEMPTS). Each retry stops the dead VM
process and spawns a fresh one. Log tail is printed between
attempts for diagnostics.
setupInference hardcodes `gateway select nemoclaw` which resets the
active gateway away from `openshell-vm-nemoclaw` when the VM backend
is in use. This causes provider creation to fail with "No gateway
metadata found for 'nemoclaw'" and leaves the sandbox with no
inference route, no providers, and broken SSH.

Add getEffectiveGatewayName() that returns the correct gateway name
based on the active backend. Use it in setupInference instead of the
hardcoded GATEWAY_NAME constant.
The VM resume had a fast-path that short-circuited the entire resume
flow: restart VM, check if sandbox metadata says "ready", wait for
inference route, exit 0. This assumed k3s state survives an unclean
VM shutdown — but it doesn't reliably. Providers, routes, and sandbox
pod readiness are all stale after VM kill.

Replace with: mark all session steps as pending and fall through to
the normal resume path, which re-runs gateway startup, inference
provider registration, and sandbox creation from scratch. This is
the same path used for Docker gateway recovery and is known to work.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Docker Support for Docker containerization enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants