Skip to content

feat(openclaw): lifecycle progress banner + live podman readiness#772

Merged
Nikhil (shadowfax92) merged 3 commits intodevfrom
fix/podman-2
Apr 21, 2026
Merged

feat(openclaw): lifecycle progress banner + live podman readiness#772
Nikhil (shadowfax92) merged 3 commits intodevfrom
fix/podman-2

Conversation

@shadowfax92
Copy link
Copy Markdown
Contributor

Summary

  • Remove the process-local PodmanRuntime.machineReady cache so every lifecycle action re-validates live Podman machine state; cache drift no longer causes a stale "ready" bit to outlive the real machine.
  • Surface one neutral in-flight banner on the Agents page for gateway lifecycle work (setup, start, stop, restart, reconnect). Suppress destructive error UI and degraded control-plane alerts while the action is pending; show them again once the action settles.
  • Disable mutating gateway and agent controls while a gateway lifecycle action is in flight; keep read-only UI visible.

Design

Spec: packages/browseros-agent/docs/plans/2026-04-20-openclaw-lifecycle-progress-design.md.

Chose the minimal local-progress design (Option 1 in the spec). Backend fix is scoped to removing the stale readiness cache; the existing in-process lifecycle lock on OpenClawService is preserved. The UI derives pendingGatewayAction purely from React Query mutation state — no new server API surface. Progress state is per-session only; if the user reloads mid-action the banner is lost until normal status settles, which is acceptable for this iteration. start() stays idempotent (probes the existing gateway before recreating) and ensureGatewayPortAllocated() keeps the persisted port when the gateway is already reachable.

Test plan

  • bun run typecheck in apps/server and apps/agent — both clean.
  • bun test tests/api/services/openclaw/ in apps/server — 80/80 pass; includes 4 new ensureReady tests covering live-state re-check, init-when-uninitialized, start-when-stopped, external-stop detection on next call.
  • Manual: trigger OpenClaw setup → banner shows "Setting up OpenClaw..." and neither the destructive error alert nor the degraded control-plane alert appears during setup.
  • Manual: stop the Podman machine externally, then click any lifecycle action → backend re-initializes/starts the machine instead of short-circuiting on a cached "ready" bit.
  • Manual: during setup, New Agent / Create Agent / Chat / Delete buttons are disabled; after setup settles they re-enable normally.
  • Manual: if setup rejects, the destructive error banner appears once lifecyclePending clears.

@shadowfax92
Copy link
Copy Markdown
Contributor Author

Greptile (@greptileai) review

@shadowfax92
Copy link
Copy Markdown
Contributor Author

Claude (@claude) review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 21, 2026

Claude finished Nikhil (@shadowfax92)'s task in 3m 18s —— View job


Code Review

  • Gather context and read changed files
  • Review openclaw-service.ts (core logic changes)
  • Review podman-runtime.ts (cache removal)
  • Review AgentsPage.tsx (UI changes)
  • Review useOpenClaw.ts (mutation state hook)
  • Review test files
  • Compile findings and post review

Summary

Solid, well-scoped PR. The core change (removing machineReady cache from PodmanRuntime) is the right fix and the new tests cover it well. The lifecycle-banner approach using React Query mutation state is clean with no new server API surface. A few issues worth addressing before merging.


Issues

1. restart() skips ensureReady() — inconsistent with start()

openclaw-service.ts:348-385start() calls this.runtime.ensureReady(logProgress) at line 285 to bring up the Podman machine if it's stopped. restart() does not. The PR description says the goal is live re-validation of machine state on every lifecycle action, but Restart silently skips this — if the user externally stops the Podman machine and then clicks Restart, the call will fail at the container level rather than recovering via ensureReady.

There's also no test asserting ensureReady is called during restart, while the start test explicitly verifies it (expect(ensureReady).toHaveBeenCalledTimes(1)).

Fix this →


2. Polling errors surface as sticky "OpenClaw action failed" alert

AgentsPage.tsx:422-424

const inlineError = lifecyclePending
  ? null
  : (error ?? statusError?.message ?? agentsError?.message ?? null)

When the user dismisses the alert, setError(null) clears only local error state. If statusError is a React Query error from the background poll, it re-populates inlineError on the next poll cycle (5 seconds), making the alert reappear immediately after dismissal. A brief network blip during the status poll would show "OpenClaw action failed" indefinitely. The alert title is also misleading for poll errors vs. user-initiated action errors.

Consider either: (a) only showing error (the local mutation error) in the inlineError banner, and having a separate subtle indicator for poll failures, or (b) tracking statusError separately with its own dismiss state.

Fix this →


3. Degraded control-plane alert fires briefly during normal reconnect after lifecycle action

AgentsPage.tsx:457-458

const showControlPlaneDegraded =
  !lifecyclePending && gatewayUiState.controlPlaneDegraded

controlPlaneDegraded is true when status === 'running' && controlPlaneStatus !== 'connected'. After a lifecycle action settles (e.g. start finishes), lifecyclePending becomes false. But the server status still transitions through 'connecting''connected' before the next poll confirms connected. In the window between lifecycle-done and the next status poll, the degraded alert briefly appears with the "Gateway Disconnected / WifiOff" icon, then disappears. The 5-second polling interval makes this visible.

controlPlaneBusy already identifies connecting/reconnecting/recovering as transient states — but it's only used to disable the Reconnect button, not to gate the degraded alert. Consider adding !gatewayUiState.controlPlaneBusy to the showControlPlaneDegraded condition to suppress the alert during transient states.

Fix this →


Minor / Nits

openclaw-service.ts:772-782 — Unnecessary duck-type cast for isHealthy

const runtime = this.runtime as {
  isHealthy?: (port: number) => Promise<boolean>
}
if (runtime.isHealthy) {
  return runtime.isHealthy(hostPort)
}

ContainerRuntime already declares isHealthy as a public method (container-runtime.ts:118). The cast exists because test mocks replace this.runtime with plain object literals that omit isHealthy. The right fix here is either: call (this.runtime as ContainerRuntime).isHealthy(hostPort) directly, or add isHealthy to the test mock interface MutableOpenClawService['runtime']. As written, this looks like dead code to readers unfamiliar with the test structure.

openclaw-service.tsREADY_TIMEOUT_MS should use shared constants

Per the browseros-agent CLAUDE.md, timeout values should come from @browseros/shared/constants/timeouts rather than be defined as module-level constants. The 30s value at line 53 should live in TIMEOUTS.

useOpenClaw.ts:234-239pendingGatewayAction priority order

restart is prioritized above stop and start. If somehow both stop and restart mutations were pending simultaneously (edge case, practically impossible due to server lock), the banner would show "Restarting..." over "Stopping...". Consider documenting the intent of this ordering, or just noting it's arbitrary since the server lock prevents simultaneous mutations.


Tests

  • The 4 new ensureReady tests in podman-runtime.test.ts are well-structured and directly cover the cache-removal regression. The FakePodmanRuntime subclass is a cleaner pattern than mock injection here. ✓
  • The concurrent start serialization test ('serializes concurrent start calls...') is a nice addition that would have caught any lifecycle-lock regression. ✓
  • Gap: No test for restart() being called when the Podman machine is stopped. This is the highest-value missing case given the PR's stated goal.

What's Working Well

  • The withLifecycleLock pattern cleanly serializes operations without needing async-mutex or similar. The await previous.catch(() => undefined) idiom correctly drains the previous operation before proceeding.
  • pendingGatewayAction derived purely from React Query mutation state is elegant — no new server API surface, survives hot reload.
  • The ensureGatewayPortAllocated idempotency fix (returning early when the current gateway is still reachable) is the right guard to prevent port churn on restart.
  • Test coverage is comprehensive for the existing behavior; 80/80 passing is a good signal.

@github-actions
Copy link
Copy Markdown
Contributor

✅ Tests passed — 794/797

Suite Passed Failed Skipped
agent-sdk 44/44 0 0
agent 20/20 0 0
build 7/7 0 0
eval 8/8 0 0
server-agent 261/261 0 0
server-api 134/134 0 0
server-browser 3/3 0 0
server-integration 9/9 0 0
server-root 35/38 0 3
server-sdk 11/11 0 0
server-skills 31/31 0 0
server-tools 231/231 0 0

View workflow run

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 21, 2026

Greptile Summary

This PR removes the process-local machineReady cache from PodmanRuntime.ensureReady() so each lifecycle call re-validates live Podman machine state, and adds a per-session in-flight progress banner on the Agents page that suppresses destructive error UI while a gateway lifecycle action is pending. Controls are disabled via the existing actionInProgress flag derived from React Query mutation state, and no new server API surface is added.

Confidence Score: 5/5

Safe to merge — no P0/P1 issues found; all changes are well-scoped, tested, and logically correct.

The backend cache removal is a clean single-concern change with 4 targeted tests verifying live re-check, init, start, and external-stop detection. The UI progress banner correctly derives state from TanStack Query mutation flags without new server API surface, and the mutual-exclusion of the actionInProgress flag ensures controls cannot be double-triggered. All remaining observations are P2 style notes.

No files require special attention.

Important Files Changed

Filename Overview
packages/browseros-agent/apps/server/src/api/services/openclaw/podman-runtime.ts Removed the stale machineReady boolean cache; ensureReady() now calls getMachineStatus() on every invocation and conditionally inits/starts. Logic is correct and fully covered by 4 new tests.
packages/browseros-agent/apps/agent/entrypoints/app/agents/useOpenClaw.ts Adds GatewayLifecycleAction type and derives pendingGatewayAction from TanStack Query mutation state; well-structured with correct mutual-exclusion guarantees via actionInProgress.
packages/browseros-agent/apps/agent/entrypoints/app/agents/AgentsPage.tsx Adds lifecycle progress banner, suppresses destructive error/control-plane-degraded alerts during pending gateway actions, disables mutating controls via actionInProgress; logic is sound.
packages/browseros-agent/apps/server/src/api/services/openclaw/openclaw-service.ts getStatus() now calls getMachineStatus() live instead of relying on a cached field; lifecycle lock is preserved; all paths look correct.
packages/browseros-agent/apps/server/tests/api/services/openclaw/podman-runtime.test.ts Four new ensureReady tests cover live re-check, init-when-uninitialized, start-when-stopped, and external-stop detection — all passing the described scenarios.
packages/browseros-agent/apps/server/tests/api/services/openclaw/openclaw-service.test.ts Adds serialization and idempotent-start tests; all runtime mocks updated to include getMachineStatus where needed. Coverage looks thorough.

Sequence Diagram

sequenceDiagram
    participant UI as AgentsPage
    participant Hook as useOpenClawMutations
    participant API as BrowserOS Server
    participant OCS as OpenClawService
    participant PR as PodmanRuntime

    UI->>Hook: handleSetup() → setupOpenClaw()
    Note over Hook: setupMutation.isPending=true<br/>pendingGatewayAction='setup'
    Hook-->>UI: lifecyclePending=true (banner shown, errors suppressed)
    Hook->>API: POST /claw/setup
    API->>OCS: setup() [withLifecycleLock]
    OCS->>PR: ensureReady() [live getMachineStatus()]
    PR->>PR: getMachineStatus() — no cache
    alt machine not initialized
        PR->>PR: initMachine()
    end
    alt machine not running
        PR->>PR: startMachine()
    end
    PR-->>OCS: machine ready
    OCS->>OCS: startGateway / waitForReady / probe
    OCS-->>API: done
    API-->>Hook: 200 OK
    Note over Hook: setupMutation.isPending=false<br/>pendingGatewayAction=null
    Hook->>API: invalidateQueries (status + agents)
    Hook-->>UI: lifecyclePending=false (banner gone, normal UI restored)
Loading

Reviews (1): Last reviewed commit: "fix: address review comments for opencla..." | Re-trigger Greptile

@shadowfax92 Nikhil (shadowfax92) merged commit 5ccdbaf into dev Apr 21, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant