fix(onboard): retry and fallback for container reachability check#2453
fix(onboard): retry and fallback for container reachability check#2453
Conversation
) Add timeout, retry with backoff, host-side fallback, and improved diagnostics to the container reachability check during onboarding. On Brev cloud GPU instances, Docker's --add-host host-gateway does not route correctly, causing a false-negative failure that blocks onboarding even though the Ollama auth proxy is healthy on the host. The sandbox uses a different networking stack (k3s CoreDNS + NodeHosts) and would succeed. Changes: - Add --connect-timeout 5 --max-time 10 to container curl command - Retry the container check up to 3 times with 2s backoff - Collect HTTP status and host-gateway resolution diagnostics on failure - Fall back to host-side health check before exiting fatally - Replace misleading error messages with actionable diagnostics - Add diagnostic field to ValidationResult interface Fixes #2425
📝 WalkthroughWalkthrough📝 WalkthroughThe changes add retrying Docker reachability checks with timeouts, surface container diagnostics on failure, inject a sleep function into validation, introduce an Ollama auth-proxy module, and alter onboarding to optionally bypass local-provider validation when host/proxy health checks indicate host-side availability. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as "CLI / onboard"
participant Validator as "local-inference\nvalidateLocalProvider"
participant Docker as "Container (curl via docker run)"
participant Host as "Host service (/api/tags)"
participant Proxy as "ollama-proxy module\n(auth proxy)"
CLI->>Validator: request validateLocalProvider(provider)
Validator->>Docker: run docker curl (with timeouts)
Note right of Docker: retry up to 3 times\nsleepFn between attempts
Docker-->>Validator: success OR failure
alt Docker success
Validator-->>CLI: Validation OK
else Docker failure
Validator->>Docker: gather diagnostics (HTTP status, /etc/hosts)
Validator-->>CLI: Validation NOT OK + diagnostic
CLI->>Host: getLocalProviderHealthCheck(provider)
Host-->>CLI: host responds? (yes/no)
alt Host responds
CLI->>Proxy: isProxyHealthy() [for ollama]
Proxy-->>CLI: proxy healthy? (yes/no)
alt allowed
CLI->>CLI: continue onboarding (bypass)
else not allowed
CLI->>CLI: abort with diagnostic + message
end
else Host doesn't respond
CLI->>CLI: abort with diagnostic + message
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 5517-5545: The current bypass logic relies on a ps check
(isOllamaProxyProcess(loadPersistedProxyPid())) and an Ollama /api/tags health
check (getLocalProviderHealthCheck + runCapture), which can miss a wedged/stale
auth proxy; instead, before deciding to continue or exit, call
ensureOllamaAuthProxy() (to start/restart the proxy if needed) and then perform
an actual request against the proxy health endpoint (not Ollama's /api/tags)
using the same runCapture helper; replace the proxyAlive/hostResponding logic
with a probe that first invokes ensureOllamaAuthProxy(), then hits the proxy
health URL and only treats the proxy as healthy if that probe returns success
(fall back to existing diagnostic logging and process.exit(1) if the proxy
endpoint remains unhealthy).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 76683991-0e05-4c63-abbb-78954b4ed90d
📒 Files selected for processing (3)
src/lib/local-inference.test.tssrc/lib/local-inference.tssrc/lib/onboard.ts
| // Before giving up, check whether the proxy is actually healthy on the host. | ||
| // The container reachability check uses Docker's --add-host host-gateway, | ||
| // which may not work on all Docker configurations (e.g., Brev, rootless Docker). | ||
| // The real sandbox uses a different networking stack (k3s CoreDNS + NodeHosts). | ||
| const proxyAlive = !isWsl() && isOllamaProxyProcess(loadPersistedProxyPid()); | ||
| const hostCheck = getLocalProviderHealthCheck(provider); | ||
| const hostResponding = hostCheck ? !!runCapture(hostCheck, { ignoreError: true }) : false; | ||
|
|
||
| if (proxyAlive && hostResponding) { | ||
| console.warn(` ⚠ ${validation.message}`); | ||
| if (validation.diagnostic) { | ||
| console.warn(` Diagnostic: ${validation.diagnostic}`); | ||
| } | ||
| console.warn( | ||
| " The auth proxy is healthy on the host — continuing. " + | ||
| "The sandbox uses a different network path and may work correctly.", | ||
| ); | ||
| } else { | ||
| console.error(` ${validation.message}`); | ||
| if (validation.diagnostic) { | ||
| console.error(` Diagnostic: ${validation.diagnostic}`); | ||
| } | ||
| if (process.platform === "darwin") { | ||
| console.error( | ||
| " On macOS, local inference also depends on OpenShell host routing support.", | ||
| ); | ||
| } | ||
| process.exit(1); | ||
| } |
There was a problem hiding this comment.
Probe or restart the Ollama auth proxy before deciding to continue/exit.
This gate proves that Ollama itself responds, but it never proves that the auth proxy endpoint is healthy: getLocalProviderHealthCheck() only hits Ollama’s /api/tags, while proxyAlive is just a ps check. That means a wedged ollama-auth-proxy.js process can incorrectly pass the bypass, and a missing/stale PID can incorrectly fail even though ensureOllamaAuthProxy() could recover it. Please key this decision off the proxy endpoint itself, or restart/check the proxy before aborting.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/lib/onboard.ts` around lines 5517 - 5545, The current bypass logic relies
on a ps check (isOllamaProxyProcess(loadPersistedProxyPid())) and an Ollama
/api/tags health check (getLocalProviderHealthCheck + runCapture), which can
miss a wedged/stale auth proxy; instead, before deciding to continue or exit,
call ensureOllamaAuthProxy() (to start/restart the proxy if needed) and then
perform an actual request against the proxy health endpoint (not Ollama's
/api/tags) using the same runCapture helper; replace the
proxyAlive/hostResponding logic with a probe that first invokes
ensureOllamaAuthProxy(), then hits the proxy health URL and only treats the
proxy as healthy if that probe returns success (fall back to existing diagnostic
logging and process.exit(1) if the proxy endpoint remains unhealthy).
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/lib/onboard.ts (1)
5521-5533:⚠️ Potential issue | 🟠 MajorProbe the auth proxy endpoint, not just the Ollama process.
This still keys the bypass off “PID exists + Ollama
/api/tagsresponds”. A wedgedollama-auth-proxy.json portOLLAMA_PROXY_PORTwill pass this gate, and the laterensureOllamaAuthProxy()call also only checks the PID, so onboarding can continue with a dead proxy. Probe the proxy endpoint you actually hand to OpenShell afterensureOllamaAuthProxy()before downgrading this to a warning.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 5521 - 5533, The current bypass only checks PID + host provider health (getLocalProviderHealthCheck/runCapture) but not the actual auth-proxy HTTP endpoint, so replace or augment hostResponding to perform an HTTP probe against the real auth proxy URL/port used by OpenShell (the same endpoint that ensureOllamaAuthProxy registers, e.g. http://localhost:OLLAMA_PROXY_PORT/<health-or-api-path>) and treat success/failure from that probe when deciding whether to downgrade to a warning; keep the existing isOllamaProxyProcess(loadPersistedProxyPid()) check but require the proxy HTTP probe to succeed (using the same path used by ensureOllamaAuthProxy) before allowing the bypass.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 5475-5476: The health check currently interprets the string result
of runCapture() as a boolean (hostResponding = !!runCapture(...)), which can be
truthy even when curl failed; instead call runCapture and inspect its exit
status (e.g., the returned object's exitCode or status) and set hostResponding =
true only when that exit code === 0 (or use the appropriate runCapture variant
that returns an exit/status code). Update the hostCheck/hostResponding
assignment using getLocalProviderHealthCheck and runCapture so you check the
command's exit code, and apply the same change to the other occurrence at the
region around the other hostResponding usage (the block referenced at lines
~5522-5523).
---
Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 5521-5533: The current bypass only checks PID + host provider
health (getLocalProviderHealthCheck/runCapture) but not the actual auth-proxy
HTTP endpoint, so replace or augment hostResponding to perform an HTTP probe
against the real auth proxy URL/port used by OpenShell (the same endpoint that
ensureOllamaAuthProxy registers, e.g.
http://localhost:OLLAMA_PROXY_PORT/<health-or-api-path>) and treat
success/failure from that probe when deciding whether to downgrade to a warning;
keep the existing isOllamaProxyProcess(loadPersistedProxyPid()) check but
require the proxy HTTP probe to succeed (using the same path used by
ensureOllamaAuthProxy) before allowing the bypass.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c8c5e1a9-256d-4d6b-b46f-d800642ac37e
📒 Files selected for processing (1)
src/lib/onboard.ts
…roxyHealthy (#767) Extract the entire Ollama auth proxy subsystem (token persistence, PID tracking, process spawn/kill, health probing) from onboard.ts into a dedicated ollama-proxy.ts module. Add isProxyHealthy() which performs an actual HTTP probe against the proxy endpoint (not just a PID check), addressing CodeRabbit's review feedback about wedged/stale proxy processes. Simplify the setupInference fallback to use isProxyHealthy() instead of inline PID + host check logic. - Reduces onboard.ts by 161 lines (7910 → 7749) - New ollama-proxy.ts: 223 lines - Advances #767 (split onboarding provider logic)
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
src/lib/onboard.ts (2)
5360-5384:⚠️ Potential issue | 🟠 MajorRepair or reprobe the Ollama auth proxy before aborting.
This branch exits on
isProxyHealthy()beforeensureOllamaAuthProxy()runs at Line 5389. A stale or missing proxy that is recoverable here still becomes a hard false-negative.Suggested fix
+ let proxyReady = false; if (!validation.ok) { // The container reachability check uses Docker's --add-host host-gateway, // which may not work on all Docker configurations (e.g., Brev, rootless). // The real sandbox uses k3s CoreDNS + NodeHosts — a different path. // Probe the actual auth proxy endpoint before giving up. - if (!isWsl() && isProxyHealthy()) { + proxyReady = !isWsl() && ensureOllamaAuthProxy() && isProxyHealthy(); + if (proxyReady) { console.warn(` ⚠ ${validation.message}`); if (validation.diagnostic) { console.warn(` Diagnostic: ${validation.diagnostic}`); @@ - if (!isWsl()) { - ensureOllamaAuthProxy(); + if (!isWsl()) { + if (!proxyReady) ensureOllamaAuthProxy(); const proxyToken = getOllamaProxyToken();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 5360 - 5384, The code exits immediately when isProxyHealthy() is false even though a recoverable proxy could be created; modify the branch handling around isProxyHealthy() so that when the proxy appears unhealthy you call ensureOllamaAuthProxy() (and recheck with isProxyHealthy()) before logging the fatal error and calling process.exit(1); in practice, invoke ensureOllamaAuthProxy() when !isProxyHealthy(), handle any thrown error or returned failure, re-evaluate validation (and validation.diagnostic) and only log the error and call process.exit(1) if the reprobe/recovery still reports an unhealthy proxy.
5318-5319:⚠️ Potential issue | 🟠 MajorUse curl exit status for the vLLM host-health bypass.
Line 5319 still converts
runCapture()output into a boolean. A failedcurl -sfcan still leave captured text behind, which would incorrectly downgrade a real host failure into the warning-and-continue path.Suggested fix
- const hostResponding = hostCheck ? !!runCapture(hostCheck, { ignoreError: true }) : false; + const hostResponding = hostCheck + ? run(hostCheck, { ignoreError: true, suppressOutput: true }).status === 0 + : false;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 5318 - 5319, The host-health check currently sets hostResponding by coercing runCapture(...) output to boolean, which can be non-empty even when curl failed; update the logic in onboard.ts where getLocalProviderHealthCheck(provider) is assigned to hostCheck so that you inspect the command exit status instead of the captured stdout. Call the runner in a way that returns the exit code (e.g., use the run variant or runCapture result that includes an exitCode/exitStatus) and set hostResponding = hostCheck ? (result.exitCode === 0) : false, keeping the ignoreError behavior but basing success on the process exit code; reference getLocalProviderHealthCheck, runCapture, hostCheck, and hostResponding to locate and change the code.
🧹 Nitpick comments (1)
src/lib/onboard.ts (1)
5316-5384: Extract the local-provider fallback path into a helper.These new vLLM/Ollama branches duplicate the same diagnostic-and-bypass flow inside an already complexity-suppressed function. Pulling that decision into a small helper would keep
setupInference()easier to maintain.As per coding guidelines,
{src,nemoclaw/src,scripts}/**/*.{js,ts,tsx}: TypeScript files must maintain cyclomatic complexity limit of 20, ratcheting down to 15. Enforced by ESLint.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 5316 - 5384, The duplicated diagnostic-and-bypass logic in the vllm-local and ollama-local branches should be refactored into a small helper to reduce cyclomatic complexity: create a helper (e.g., ensureLocalProviderOrExit or handleLocalProviderFallback) that takes the provider name and the validation result and encapsulates the existing checks (for vllm-local: getLocalProviderHealthCheck + runCapture; for ollama-local: isWsl + isProxyHealthy) and the three logging paths (warn + optional diagnostic + bypass message, or error + optional diagnostic + platform-specific hint + process.exit). Replace the duplicated blocks around validateLocalProvider in setupInference (or where validateLocalProvider is called) to call this helper before continuing to upsertProvider / runOpenshell.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/ollama-proxy.ts`:
- Around line 145-165: The code sets ollamaProxyToken before verifying the proxy
is alive; update startOllamaAuthProxy so it generates a local proxyToken, calls
spawnOllamaAuthProxy(pid), waits and checks isOllamaProxyProcess(pid), and only
if the liveness check passes assign ollamaProxyToken = proxyToken (and persist
if needed); if the check fails return false without mutating ollamaProxyToken so
getOllamaProxyToken/ensureOllamaAuthProxy won’t return a stale token; apply the
same change pattern where ollamaProxyToken is currently set early (references:
startOllamaAuthProxy, spawnOllamaAuthProxy, isOllamaProxyProcess,
getOllamaProxyToken, ensureOllamaAuthProxy).
- Around line 207-222: In isProxyHealthy(), don't early-return false when
isOllamaProxyProcess(pid) is falsy; instead store that boolean (e.g.,
hasValidPid = isOllamaProxyProcess(pid)), always perform the HTTP probe built
with loadPersistedProxyToken() and runCapture(..., { ignoreError: true }), and
then return true if the probe produced output, otherwise return hasValidPid (so
a successful HTTP probe wins even if the PID file is missing/stale). Update
references in the function to use the new hasValidPid local variable and remove
the early return.
---
Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 5360-5384: The code exits immediately when isProxyHealthy() is
false even though a recoverable proxy could be created; modify the branch
handling around isProxyHealthy() so that when the proxy appears unhealthy you
call ensureOllamaAuthProxy() (and recheck with isProxyHealthy()) before logging
the fatal error and calling process.exit(1); in practice, invoke
ensureOllamaAuthProxy() when !isProxyHealthy(), handle any thrown error or
returned failure, re-evaluate validation (and validation.diagnostic) and only
log the error and call process.exit(1) if the reprobe/recovery still reports an
unhealthy proxy.
- Around line 5318-5319: The host-health check currently sets hostResponding by
coercing runCapture(...) output to boolean, which can be non-empty even when
curl failed; update the logic in onboard.ts where
getLocalProviderHealthCheck(provider) is assigned to hostCheck so that you
inspect the command exit status instead of the captured stdout. Call the runner
in a way that returns the exit code (e.g., use the run variant or runCapture
result that includes an exitCode/exitStatus) and set hostResponding = hostCheck
? (result.exitCode === 0) : false, keeping the ignoreError behavior but basing
success on the process exit code; reference getLocalProviderHealthCheck,
runCapture, hostCheck, and hostResponding to locate and change the code.
---
Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 5316-5384: The duplicated diagnostic-and-bypass logic in the
vllm-local and ollama-local branches should be refactored into a small helper to
reduce cyclomatic complexity: create a helper (e.g., ensureLocalProviderOrExit
or handleLocalProviderFallback) that takes the provider name and the validation
result and encapsulates the existing checks (for vllm-local:
getLocalProviderHealthCheck + runCapture; for ollama-local: isWsl +
isProxyHealthy) and the three logging paths (warn + optional diagnostic + bypass
message, or error + optional diagnostic + platform-specific hint +
process.exit). Replace the duplicated blocks around validateLocalProvider in
setupInference (or where validateLocalProvider is called) to call this helper
before continuing to upsertProvider / runOpenshell.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: cd2c8e18-d861-4413-b058-63a9501e1fe0
📒 Files selected for processing (3)
src/lib/ollama-proxy.tssrc/lib/onboard.tssrc/nemoclaw.ts
✅ Files skipped from review due to trivial changes (1)
- src/nemoclaw.ts
| export function startOllamaAuthProxy(): boolean { | ||
| // eslint-disable-next-line @typescript-eslint/no-require-imports | ||
| const crypto = require("crypto"); | ||
| killStaleProxy(); | ||
|
|
||
| const proxyToken = crypto.randomBytes(24).toString("hex"); | ||
| ollamaProxyToken = proxyToken; | ||
| // Don't persist yet — wait until provider is confirmed in setupInference. | ||
| // If the user backs out to a different provider, the token stays in memory | ||
| // only and is discarded. | ||
| const pid = spawnOllamaAuthProxy(proxyToken); | ||
| sleepSeconds(1); | ||
| if (!isOllamaProxyProcess(pid)) { | ||
| console.error(` Error: Ollama auth proxy failed to start on :${OLLAMA_PROXY_PORT}`); | ||
| console.error(` Containers will not be able to reach Ollama without the proxy.`); | ||
| console.error( | ||
| ` Check if port ${OLLAMA_PROXY_PORT} is already in use: lsof -ti :${OLLAMA_PROXY_PORT}`, | ||
| ); | ||
| return false; | ||
| } | ||
| return true; |
There was a problem hiding this comment.
Cache the token only after the proxy is proven alive.
Line 151 and Line 185 set ollamaProxyToken before any successful liveness check. If the child exits immediately, getOllamaProxyToken() can keep returning a token that no proxy is serving, and ensureOllamaAuthProxy() currently gives callers no signal that recovery failed.
🩹 Suggested fix
export function startOllamaAuthProxy(): boolean {
// eslint-disable-next-line `@typescript-eslint/no-require-imports`
const crypto = require("crypto");
killStaleProxy();
const proxyToken = crypto.randomBytes(24).toString("hex");
- ollamaProxyToken = proxyToken;
// Don't persist yet — wait until provider is confirmed in setupInference.
// If the user backs out to a different provider, the token stays in memory
// only and is discarded.
const pid = spawnOllamaAuthProxy(proxyToken);
sleepSeconds(1);
if (!isOllamaProxyProcess(pid)) {
+ ollamaProxyToken = null;
+ clearPersistedProxyPid();
console.error(` Error: Ollama auth proxy failed to start on :${OLLAMA_PROXY_PORT}`);
console.error(` Containers will not be able to reach Ollama without the proxy.`);
console.error(
` Check if port ${OLLAMA_PROXY_PORT} is already in use: lsof -ti :${OLLAMA_PROXY_PORT}`,
);
return false;
}
+ ollamaProxyToken = proxyToken;
return true;
}
-export function ensureOllamaAuthProxy(): void {
+export function ensureOllamaAuthProxy(): boolean {
const token = loadPersistedProxyToken();
- if (!token) return;
+ if (!token) return false;
const pid = loadPersistedProxyPid();
if (isOllamaProxyProcess(pid)) {
ollamaProxyToken = token;
- return;
+ return true;
}
killStaleProxy();
- ollamaProxyToken = token;
- spawnOllamaAuthProxy(token);
+ const restartedPid = spawnOllamaAuthProxy(token);
sleepSeconds(1);
+ if (!isOllamaProxyProcess(restartedPid)) {
+ ollamaProxyToken = null;
+ clearPersistedProxyPid();
+ return false;
+ }
+ ollamaProxyToken = token;
+ return true;
}Also applies to: 172-188
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/lib/ollama-proxy.ts` around lines 145 - 165, The code sets
ollamaProxyToken before verifying the proxy is alive; update
startOllamaAuthProxy so it generates a local proxyToken, calls
spawnOllamaAuthProxy(pid), waits and checks isOllamaProxyProcess(pid), and only
if the liveness check passes assign ollamaProxyToken = proxyToken (and persist
if needed); if the check fails return false without mutating ollamaProxyToken so
getOllamaProxyToken/ensureOllamaAuthProxy won’t return a stale token; apply the
same change pattern where ollamaProxyToken is currently set early (references:
startOllamaAuthProxy, spawnOllamaAuthProxy, isOllamaProxyProcess,
getOllamaProxyToken, ensureOllamaAuthProxy).
| export function isProxyHealthy(): boolean { | ||
| // 1. PID check — fast rejection if process is gone | ||
| const pid = loadPersistedProxyPid(); | ||
| if (!isOllamaProxyProcess(pid)) return false; | ||
|
|
||
| // 2. HTTP probe — confirm the proxy actually responds, not just that | ||
| // the process exists (catches wedged/stale proxy processes) | ||
| const proxyUrl = `http://127.0.0.1:${OLLAMA_PROXY_PORT}/api/tags`; | ||
| const token = loadPersistedProxyToken(); | ||
| const probeCmd = token | ||
| ? ["curl", "-sf", "--connect-timeout", "3", "--max-time", "5", | ||
| "-H", `Authorization: Bearer ${token}`, proxyUrl] | ||
| : ["curl", "-sf", "--connect-timeout", "3", "--max-time", "5", proxyUrl]; | ||
|
|
||
| const output = runCapture(probeCmd, { ignoreError: true }); | ||
| return !!output; |
There was a problem hiding this comment.
Let the HTTP probe drive the fallback decision.
This helper is now used to decide whether onboarding can continue after the container-side check fails, but Line 209-Line 210 returns false before the HTTP probe runs. A missing or stale PID file would still abort onboarding even if the proxy is healthy on 127.0.0.1:${OLLAMA_PROXY_PORT}.
🩹 Suggested fix
export function isProxyHealthy(): boolean {
- // 1. PID check — fast rejection if process is gone
- const pid = loadPersistedProxyPid();
- if (!isOllamaProxyProcess(pid)) return false;
-
- // 2. HTTP probe — confirm the proxy actually responds, not just that
- // the process exists (catches wedged/stale proxy processes)
+ // HTTP reachability is the source of truth for the onboarding fallback.
const proxyUrl = `http://127.0.0.1:${OLLAMA_PROXY_PORT}/api/tags`;
const token = loadPersistedProxyToken();
const probeCmd = token
? ["curl", "-sf", "--connect-timeout", "3", "--max-time", "5",
"-H", `Authorization: Bearer ${token}`, proxyUrl]
: ["curl", "-sf", "--connect-timeout", "3", "--max-time", "5", proxyUrl];
const output = runCapture(probeCmd, { ignoreError: true });
- return !!output;
+ return Boolean(output);
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/lib/ollama-proxy.ts` around lines 207 - 222, In isProxyHealthy(), don't
early-return false when isOllamaProxyProcess(pid) is falsy; instead store that
boolean (e.g., hasValidPid = isOllamaProxyProcess(pid)), always perform the HTTP
probe built with loadPersistedProxyToken() and runCapture(..., { ignoreError:
true }), and then return true if the probe produced output, otherwise return
hasValidPid (so a successful HTTP probe wins even if the PID file is
missing/stale). Update references in the function to use the new hasValidPid
local variable and remove the early return.
Summary
On Brev cloud GPU instances,
nemoclaw onboardwith Local Ollama fails at step 4/8 because Docker's--add-host host-gatewaydoes not route correctly, causing a false-negative container reachability check. This PR adds timeout, retry with backoff, host-side fallback, and improved diagnostics so onboarding succeeds when the proxy is demonstrably healthy on the host.Additionally, this PR extracts the entire Ollama auth proxy subsystem from the
onboard.tsmonolith into a dedicatedollama-proxy.tsmodule, reducingonboard.tsby 161 lines and advancing issue #767.Related Issue
Fixes #2425
Ref #767
Changes
--connect-timeout 5 --max-time 10to the container reachability curl command insrc/lib/local-inference.tsto prevent indefinite hangsvalidateLocalProvider()diagnosticfield toValidationResultinterfacesrc/lib/onboard.tsinto newsrc/lib/ollama-proxy.ts— token persistence, PID tracking, process spawn/kill, health probing (161 lines removed from onboard.ts)isProxyHealthy()inollama-proxy.tsthat performs an actual HTTP probe against the proxy endpoint, not just a PID check — addresses CodeRabbit feedback about wedged/stale proxy processessetupInference()fallback inonboard.tsto useisProxyHealthy()instead of inline PID + host check logicvllm-localusing the existing health check endpointsrc/lib/local-inference.test.tscovering retry success, retry exhaustion with diagnostics, diagnostic fallback, sleep verification, and no-retry on host failureType of Change
Verification
npx prek run --all-filespassesnpm testpassesmake docsbuilds without warnings (doc changes only)AI Disclosure
Signed-off-by: Julie Yaunches jyaunches@nvidia.com
Summary by CodeRabbit
New Features
Bug Fixes