fix: harden OpenClaw Podman runtime flow#770
fix: harden OpenClaw Podman runtime flow#770Nikhil (shadowfax92) wants to merge 20 commits intodevfrom
Conversation
|
Claude (@claude) review |
|
Greptile (@greptileai) review |
|
Claude finished Nikhil (@shadowfax92)'s task in 2m 57s —— View job Code Review: fix/harden OpenClaw Podman runtime flow
Overall this is a solid, well-structured hardening pass. The separation between the new Bug:
|
❌ Tests failed — 4/830 failed
Failed tests
|
Greptile SummaryThis PR hardens the OpenClaw Podman runtime by introducing a dedicated named machine ( Two P1 issues need attention before merging:
Confidence Score: 3/5Two P1 bugs in the recovery path should be fixed before merging. The two P1 issues both affect the exact failure scenarios this PR is designed to harden:
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[tryAutoStart / setup / start / restart] --> B{Podman available?}
B -- No --> Z[return uninitialized]
B -- Yes --> C[PodmanRuntime.ensureReady]
C --> D{machine initialized?}
D -- No --> E[initMachine]
E --> F[startMachine]
D -- Yes --> G{machine running?}
G -- No --> F
G -- Yes --> H[ContainerRuntime.startGateway]
F --> H
H --> I{preferred port free?}
I -- Yes --> J[podman run -p preferredPort]
I -- No --> K[allocateDistinctEphemeralPort]
K --> J
J --> L{exit 0?}
L -- No, bind conflict < 3 attempts --> K
L -- No, other error --> ERR[throw]
L -- Yes --> M[applyGatewayPort / waitForReady]
M --> N{ready?}
N -- No --> ERR
N -- Yes --> O[runControlPlaneCall probe]
O --> P[recordSuccessfulGatewayStart - save runtime-state.json]
P --> Q[status: running / connected]
Q -- restart error + machine corruption --> R[repairRuntime]
R --> R1[stopGateway best-effort]
R1 --> R2[stopMachineIfSafe best-effort]
R2 --> R3[ensureReady]
R3 --> R4[startGateway]
R4 --> R5[save repairGeneration++]
Q -- UI reset --> S[resetRuntime]
S --> S1[stopGateway NO try/catch]
S1 -- throws if machine down --> ERR2[reset fails - state unchanged]
S1 -- ok --> S2[clearState + save defaultRuntimeState]
|
| private hasFailureState(): boolean { | ||
| return ( | ||
| this.controlPlaneStatus === 'failed' || | ||
| this.controlPlaneStatus === 'recovering' || | ||
| this.lastGatewayError !== null || | ||
| this.lastError !== null || | ||
| this.lastRecoveryReason !== null | ||
| ) | ||
| } |
There was a problem hiding this comment.
hasFailureState() treats recovering as an error, masking repair progress
hasFailureState() returns true when controlPlaneStatus === 'recovering', so resolveStatus() always returns 'error' while repairRuntime is active (which sets the status to 'recovering'). The intended 'starting' branch in resolveStatus is never reached for 'recovering' because hasFailureState short-circuits it. On the client side, getOpenClawOperatorState maps status === 'error' to 'needs-attention', so the UI shows an error card with repair/reset controls during an ongoing repair — the user can re-trigger repair while one is already running.
| private hasFailureState(): boolean { | |
| return ( | |
| this.controlPlaneStatus === 'failed' || | |
| this.controlPlaneStatus === 'recovering' || | |
| this.lastGatewayError !== null || | |
| this.lastError !== null || | |
| this.lastRecoveryReason !== null | |
| ) | |
| } | |
| private hasFailureState(): boolean { | |
| return ( | |
| this.controlPlaneStatus === 'failed' || | |
| this.lastGatewayError !== null || | |
| this.lastError !== null || | |
| this.lastRecoveryReason !== null | |
| ) | |
| } |
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/browseros-agent/apps/server/src/api/services/openclaw/openclaw-service.ts
Line: 1128-1136
Comment:
**`hasFailureState()` treats `recovering` as an error, masking repair progress**
`hasFailureState()` returns `true` when `controlPlaneStatus === 'recovering'`, so `resolveStatus()` always returns `'error'` while `repairRuntime` is active (which sets the status to `'recovering'`). The intended `'starting'` branch in `resolveStatus` is never reached for `'recovering'` because `hasFailureState` short-circuits it. On the client side, `getOpenClawOperatorState` maps `status === 'error'` to `'needs-attention'`, so the UI shows an error card with repair/reset controls during an ongoing repair — the user can re-trigger repair while one is already running.
```suggestion
private hasFailureState(): boolean {
return (
this.controlPlaneStatus === 'failed' ||
this.lastGatewayError !== null ||
this.lastError !== null ||
this.lastRecoveryReason !== null
)
}
```
How can I resolve this? If you propose a fix, please make it concise.| async resetRuntime(): Promise<void> { | ||
| try { | ||
| this.stopGatewayLogTail() | ||
| await this.runtime.stopGateway() | ||
| await this.runtime.stopMachineIfSafe() | ||
| this.controlPlaneStatus = 'disconnected' | ||
| this.tokenLoaded = false | ||
| this.lastGatewayError = null | ||
| this.lastRecoveryReason = null | ||
| this.lastError = null | ||
| this.applyGatewayPort(OPENCLAW_GATEWAY_PORT) | ||
| await this.saveRuntimeState(this.defaultRuntimeState()) | ||
| logger.info('OpenClaw runtime reset', { port: this.port }) | ||
| } catch (error) { | ||
| this.controlPlaneStatus = 'failed' | ||
| this.lastError = error instanceof Error ? error.message : String(error) | ||
| this.lastGatewayError = this.lastError | ||
| this.lastRecoveryReason = this.classifyControlPlaneError(error) | ||
| throw error | ||
| } | ||
| } |
There was a problem hiding this comment.
resetRuntime lets stopGateway throw, blocking reset when the gateway is already dead
Unlike repairRuntime, which wraps both stopGateway and stopMachineIfSafe in their own try/catch (best-effort), resetRuntime calls stopGateway() bare. If the Podman machine is down and the Podman CLI itself fails (exits non-zero for a reason other than a missing container), stopGateway throws and resetRuntime re-throws without ever clearing controlPlaneStatus, lastError, or the persisted runtime state. A user hitting "Reset" as a last resort after a machine failure will find that reset itself fails for exactly the same reason.
// Suggested fix: mirror the pattern from repairRuntime
async resetRuntime(): Promise<void> {
try {
this.stopGatewayLogTail()
try {
await this.runtime.stopGateway()
} catch {
// Best effort — gateway may already be gone
}
try {
await this.runtime.stopMachineIfSafe()
} catch {
// Best effort
}
this.controlPlaneStatus = 'disconnected'
// ... rest of the resets
}
}Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/browseros-agent/apps/server/src/api/services/openclaw/openclaw-service.ts
Line: 442-462
Comment:
**`resetRuntime` lets `stopGateway` throw, blocking reset when the gateway is already dead**
Unlike `repairRuntime`, which wraps both `stopGateway` and `stopMachineIfSafe` in their own `try/catch` (best-effort), `resetRuntime` calls `stopGateway()` bare. If the Podman machine is down and the Podman CLI itself fails (exits non-zero for a reason other than a missing container), `stopGateway` throws and `resetRuntime` re-throws without ever clearing `controlPlaneStatus`, `lastError`, or the persisted runtime state. A user hitting "Reset" as a last resort after a machine failure will find that reset itself fails for exactly the same reason.
```typescript
// Suggested fix: mirror the pattern from repairRuntime
async resetRuntime(): Promise<void> {
try {
this.stopGatewayLogTail()
try {
await this.runtime.stopGateway()
} catch {
// Best effort — gateway may already be gone
}
try {
await this.runtime.stopMachineIfSafe()
} catch {
// Best effort
}
this.controlPlaneStatus = 'disconnected'
// ... rest of the resets
}
}
```
How can I resolve this? If you propose a fix, please make it concise.
Summary
Design
This change keeps BrowserOS in control of a dedicated OpenClaw runtime without exposing container or VM concepts in the UI. The server now owns named-machine lifecycle, discovers and retries host-port binding safely, persists the chosen host port plus repair metadata, and uses runtime inspection/status to drive service recovery. The agent UI collapses raw runtime/control-plane details into a simpler operator state with explicit repair/reset actions while preserving setup, agent management, and Podman override flows.
Test plan
bun test apps/server/tests/api/services/openclaw/openclaw-runtime-state.test.ts apps/server/tests/api/services/openclaw/podman-runtime.test.ts apps/server/tests/api/services/openclaw/container-runtime.test.ts apps/server/tests/api/services/openclaw/openclaw-service.test.ts apps/server/tests/api/routes/openclaw.test.tsbun test apps/agent/entrypoints/app/agents/openclaw-operator-state.test.tsbun run --filter @browseros/server typecheckbun run --filter @browseros/agent typecheck