fix(onboard): retry and fallback for container reachability check by jyaunches · Pull Request #2453 · NVIDIA/NemoClaw

jyaunches · 2026-04-24T20:56:09Z

Summary

On Brev cloud GPU instances, nemoclaw onboard with Local Ollama fails at step 4/8 because Docker's --add-host host-gateway does not route correctly, causing a false-negative container reachability check. This PR adds timeout, retry with backoff, host-side fallback, and improved diagnostics so onboarding succeeds when the proxy is demonstrably healthy on the host.

Additionally, this PR extracts the entire Ollama auth proxy subsystem from the onboard.ts monolith into a dedicated ollama-proxy.ts module, reducing onboard.ts by 161 lines and advancing issue #767.

Related Issue

Fixes #2425
Ref #767

Changes

Add --connect-timeout 5 --max-time 10 to the container reachability curl command in src/lib/local-inference.ts to prevent indefinite hangs
Add retry loop (3 attempts, 2s backoff) around the container check in validateLocalProvider()
Collect HTTP status code and host-gateway resolution diagnostics when retries are exhausted
Add optional diagnostic field to ValidationResult interface
Extract Ollama auth proxy subsystem from src/lib/onboard.ts into new src/lib/ollama-proxy.ts — token persistence, PID tracking, process spawn/kill, health probing (161 lines removed from onboard.ts)
Add isProxyHealthy() in ollama-proxy.ts that performs an actual HTTP probe against the proxy endpoint, not just a PID check — addresses CodeRabbit feedback about wedged/stale proxy processes
Simplify setupInference() fallback in onboard.ts to use isProxyHealthy() instead of inline PID + host check logic
Add host-side fallback for vllm-local using the existing health check endpoint
Replace misleading "Ensure the Ollama auth proxy is running" error message with actionable "Docker container reachability check failed" message
Add 5 new unit tests in src/lib/local-inference.test.ts covering retry success, retry exhaustion with diagnostics, diagnostic fallback, sleep verification, and no-retry on host failure

Type of Change

Code change (feature, bug fix, or refactor)
Code change with doc updates
Doc only (prose changes, no code sample modifications)
Doc only (includes code sample changes)

Verification

npx prek run --all-files passes
npm test passes
Tests added or updated for new or changed behavior
No secrets, API keys, or credentials committed
Docs updated for user-facing behavior changes
make docs builds without warnings (doc changes only)
Doc pages follow the style guide (doc changes only)
New doc pages include SPDX header and frontmatter (new pages only)

AI Disclosure

AI-assisted — tool: Claude Code (pi agent)

Signed-off-by: Julie Yaunches jyaunches@nvidia.com

Summary by CodeRabbit

New Features
- Container connectivity validation now retries with stricter timeouts and captures richer diagnostics.
- Added an auth-proxy lifecycle so local authentication is persisted and health-checked for smoother reconnects.
Bug Fixes
- Local provider startup is more resilient by conditionally falling back to host-side health checks.
- Improved failure handling and clearer troubleshooting output when local inference checks fail.

) Add timeout, retry with backoff, host-side fallback, and improved diagnostics to the container reachability check during onboarding. On Brev cloud GPU instances, Docker's --add-host host-gateway does not route correctly, causing a false-negative failure that blocks onboarding even though the Ollama auth proxy is healthy on the host. The sandbox uses a different networking stack (k3s CoreDNS + NodeHosts) and would succeed. Changes: - Add --connect-timeout 5 --max-time 10 to container curl command - Retry the container check up to 3 times with 2s backoff - Collect HTTP status and host-gateway resolution diagnostics on failure - Fall back to host-side health check before exiting fatally - Replace misleading error messages with actionable diagnostics - Add diagnostic field to ValidationResult interface Fixes #2425

coderabbitai · 2026-04-24T20:56:20Z

📝 Walkthrough

Walkthrough

📝 Walkthrough

The changes add retrying Docker reachability checks with timeouts, surface container diagnostics on failure, inject a sleep function into validation, introduce an Ollama auth-proxy module, and alter onboarding to optionally bypass local-provider validation when host/proxy health checks indicate host-side availability.

Changes

Cohort / File(s)	Summary
Local inference logic & tests `src/lib/local-inference.ts`, `src/lib/local-inference.test.ts`	Add retrying Docker container reachability checks with explicit curl timeouts; inject `sleepFn` into `validateLocalProvider`; collect and expose `diagnostic` details on failure; update tests to assert retry, sleep invocation, and new diagnostics.
Onboarding flow `src/lib/onboard.ts`	Alter onboarding to consult host-side health checks when local validation fails; for vllm allow bypass if host responds, for ollama require proxy health and non-WSL before bypassing; replace in-file Ollama proxy code with calls to new `ollama-proxy` module.
Ollama auth-proxy module `src/lib/ollama-proxy.ts`	New module handling proxy token/pid persistence, spawn/ensure/kill logic, stale-process cleanup, and health probing (including HTTP probe using persisted token). Exposes lifecycle and helper functions.
Import wiring `src/nemoclaw.ts`	Switch import of `ensureOllamaAuthProxy` to the new `./lib/ollama-proxy` module (no behavioral changes at call sites).

Sequence Diagram(s)

sequenceDiagram
  participant CLI as "CLI / onboard"
  participant Validator as "local-inference\nvalidateLocalProvider"
  participant Docker as "Container (curl via docker run)"
  participant Host as "Host service (/api/tags)"
  participant Proxy as "ollama-proxy module\n(auth proxy)"

  CLI->>Validator: request validateLocalProvider(provider)
  Validator->>Docker: run docker curl (with timeouts)
  Note right of Docker: retry up to 3 times\nsleepFn between attempts
  Docker-->>Validator: success OR failure
  alt Docker success
    Validator-->>CLI: Validation OK
  else Docker failure
    Validator->>Docker: gather diagnostics (HTTP status, /etc/hosts)
    Validator-->>CLI: Validation NOT OK + diagnostic
    CLI->>Host: getLocalProviderHealthCheck(provider)
    Host-->>CLI: host responds? (yes/no)
    alt Host responds
      CLI->>Proxy: isProxyHealthy() [for ollama]
      Proxy-->>CLI: proxy healthy? (yes/no)
      alt allowed
        CLI->>CLI: continue onboarding (bypass)
      else not allowed
        CLI->>CLI: abort with diagnostic + message
      end
    else Host doesn't respond
      CLI->>CLI: abort with diagnostic + message
    end
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

"I hopped through logs with whiskers twitching,
curl timeouts set, no more twitching.
I saved a token, PID in a hole,
retried thrice, diagnosed the whole.
A little rabbit cheers: the proxy's in control! 🐇"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 45.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(onboard): retry and fallback for container reachability check' directly summarizes the main changes: adding retry logic and fallback mechanisms to container reachability validation in the onboarding flow.
Linked Issues check	✅ Passed	The PR successfully addresses all coding requirements from issue `#2425`: adds retry logic with timeouts to container checks, collects diagnostics on failure, introduces host-side fallback for both ollama-local and vllm-local providers, and provides actionable error messages distinguishing network routing issues from down services.
Out of Scope Changes check	✅ Passed	All changes are directly aligned with issue `#2425` requirements: container reachability retry/timeout improvements, host-side health check fallback logic, diagnostic collection, and related test coverage. No unrelated changes detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch pr/issue-2425

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 5517-5545: The current bypass logic relies on a ps check
(isOllamaProxyProcess(loadPersistedProxyPid())) and an Ollama /api/tags health
check (getLocalProviderHealthCheck + runCapture), which can miss a wedged/stale
auth proxy; instead, before deciding to continue or exit, call
ensureOllamaAuthProxy() (to start/restart the proxy if needed) and then perform
an actual request against the proxy health endpoint (not Ollama's /api/tags)
using the same runCapture helper; replace the proxyAlive/hostResponding logic
with a probe that first invokes ensureOllamaAuthProxy(), then hits the proxy
health URL and only treats the proxy as healthy if that probe returns success
(fall back to existing diagnostic logging and process.exit(1) if the proxy
endpoint remains unhealthy).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 76683991-0e05-4c63-abbb-78954b4ed90d

📥 Commits

Reviewing files that changed from the base of the PR and between 260b237 and 7385d6e.

📒 Files selected for processing (3)

src/lib/local-inference.test.ts
src/lib/local-inference.ts
src/lib/onboard.ts

coderabbitai · 2026-04-24T21:00:53Z

+      // Before giving up, check whether the proxy is actually healthy on the host.
+      // The container reachability check uses Docker's --add-host host-gateway,
+      // which may not work on all Docker configurations (e.g., Brev, rootless Docker).
+      // The real sandbox uses a different networking stack (k3s CoreDNS + NodeHosts).
+      const proxyAlive = !isWsl() && isOllamaProxyProcess(loadPersistedProxyPid());
+      const hostCheck = getLocalProviderHealthCheck(provider);
+      const hostResponding = hostCheck ? !!runCapture(hostCheck, { ignoreError: true }) : false;
+
+      if (proxyAlive && hostResponding) {
+        console.warn(`  ⚠ ${validation.message}`);
+        if (validation.diagnostic) {
+          console.warn(`  Diagnostic: ${validation.diagnostic}`);
+        }
+        console.warn(
+          "  The auth proxy is healthy on the host — continuing. " +
+            "The sandbox uses a different network path and may work correctly.",
        );
+      } else {
+        console.error(`  ${validation.message}`);
+        if (validation.diagnostic) {
+          console.error(`  Diagnostic: ${validation.diagnostic}`);
+        }
+        if (process.platform === "darwin") {
+          console.error(
+            "  On macOS, local inference also depends on OpenShell host routing support.",
+          );
+        }
+        process.exit(1);
      }


⚠️ Potential issue | 🟠 Major

Probe or restart the Ollama auth proxy before deciding to continue/exit.

This gate proves that Ollama itself responds, but it never proves that the auth proxy endpoint is healthy: getLocalProviderHealthCheck() only hits Ollama’s /api/tags, while proxyAlive is just a ps check. That means a wedged ollama-auth-proxy.js process can incorrectly pass the bypass, and a missing/stale PID can incorrectly fail even though ensureOllamaAuthProxy() could recover it. Please key this decision off the proxy endpoint itself, or restart/check the proxy before aborting.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 5517 - 5545, The current bypass logic relies on a ps check (isOllamaProxyProcess(loadPersistedProxyPid())) and an Ollama /api/tags health check (getLocalProviderHealthCheck + runCapture), which can miss a wedged/stale auth proxy; instead, before deciding to continue or exit, call ensureOllamaAuthProxy() (to start/restart the proxy if needed) and then perform an actual request against the proxy health endpoint (not Ollama's /api/tags) using the same runCapture helper; replace the proxyAlive/hostResponding logic with a probe that first invokes ensureOllamaAuthProxy(), then hits the proxy health URL and only treats the proxy as healthy if that probe returns success (fall back to existing diagnostic logging and process.exit(1) if the proxy endpoint remains unhealthy).

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

src/lib/onboard.ts (1)
5521-5533: ⚠️ Potential issue | 🟠 Major

Probe the auth proxy endpoint, not just the Ollama process.

This still keys the bypass off “PID exists + Ollama /api/tags responds”. A wedged ollama-auth-proxy.js on port OLLAMA_PROXY_PORT will pass this gate, and the later ensureOllamaAuthProxy() call also only checks the PID, so onboarding can continue with a dead proxy. Probe the proxy endpoint you actually hand to OpenShell after ensureOllamaAuthProxy() before downgrading this to a warning.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 5521 - 5533, The current bypass only checks
PID + host provider health (getLocalProviderHealthCheck/runCapture) but not the
actual auth-proxy HTTP endpoint, so replace or augment hostResponding to perform
an HTTP probe against the real auth proxy URL/port used by OpenShell (the same
endpoint that ensureOllamaAuthProxy registers, e.g.
http://localhost:OLLAMA_PROXY_PORT/<health-or-api-path>) and treat
success/failure from that probe when deciding whether to downgrade to a warning;
keep the existing isOllamaProxyProcess(loadPersistedProxyPid()) check but
require the proxy HTTP probe to succeed (using the same path used by
ensureOllamaAuthProxy) before allowing the bypass.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 5475-5476: The health check currently interprets the string result
of runCapture() as a boolean (hostResponding = !!runCapture(...)), which can be
truthy even when curl failed; instead call runCapture and inspect its exit
status (e.g., the returned object's exitCode or status) and set hostResponding =
true only when that exit code === 0 (or use the appropriate runCapture variant
that returns an exit/status code). Update the hostCheck/hostResponding
assignment using getLocalProviderHealthCheck and runCapture so you check the
command's exit code, and apply the same change to the other occurrence at the
region around the other hostResponding usage (the block referenced at lines
~5522-5523).

---

Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 5521-5533: The current bypass only checks PID + host provider
health (getLocalProviderHealthCheck/runCapture) but not the actual auth-proxy
HTTP endpoint, so replace or augment hostResponding to perform an HTTP probe
against the real auth proxy URL/port used by OpenShell (the same endpoint that
ensureOllamaAuthProxy registers, e.g.
http://localhost:OLLAMA_PROXY_PORT/<health-or-api-path>) and treat
success/failure from that probe when deciding whether to downgrade to a warning;
keep the existing isOllamaProxyProcess(loadPersistedProxyPid()) check but
require the proxy HTTP probe to succeed (using the same path used by
ensureOllamaAuthProxy) before allowing the bypass.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c8c5e1a9-256d-4d6b-b46f-d800642ac37e

📥 Commits

Reviewing files that changed from the base of the PR and between 7385d6e and 06f77ed.

📒 Files selected for processing (1)

src/lib/onboard.ts

…roxyHealthy (#767) Extract the entire Ollama auth proxy subsystem (token persistence, PID tracking, process spawn/kill, health probing) from onboard.ts into a dedicated ollama-proxy.ts module. Add isProxyHealthy() which performs an actual HTTP probe against the proxy endpoint (not just a PID check), addressing CodeRabbit's review feedback about wedged/stale proxy processes. Simplify the setupInference fallback to use isProxyHealthy() instead of inline PID + host check logic. - Reduces onboard.ts by 161 lines (7910 → 7749) - New ollama-proxy.ts: 223 lines - Advances #767 (split onboarding provider logic)

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (2)

src/lib/onboard.ts (2)

5360-5384: ⚠️ Potential issue | 🟠 Major

Repair or reprobe the Ollama auth proxy before aborting.

This branch exits on isProxyHealthy() before ensureOllamaAuthProxy() runs at Line 5389. A stale or missing proxy that is recoverable here still becomes a hard false-negative.

Suggested fix

+    let proxyReady = false;
     if (!validation.ok) {
       // The container reachability check uses Docker's --add-host host-gateway,
       // which may not work on all Docker configurations (e.g., Brev, rootless).
       // The real sandbox uses k3s CoreDNS + NodeHosts — a different path.
       // Probe the actual auth proxy endpoint before giving up.
-      if (!isWsl() && isProxyHealthy()) {
+      proxyReady = !isWsl() && ensureOllamaAuthProxy() && isProxyHealthy();
+      if (proxyReady) {
         console.warn(`  ⚠ ${validation.message}`);
         if (validation.diagnostic) {
           console.warn(`  Diagnostic: ${validation.diagnostic}`);
@@
-    if (!isWsl()) {
-      ensureOllamaAuthProxy();
+    if (!isWsl()) {
+      if (!proxyReady) ensureOllamaAuthProxy();
       const proxyToken = getOllamaProxyToken();

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 5360 - 5384, The code exits immediately when
isProxyHealthy() is false even though a recoverable proxy could be created;
modify the branch handling around isProxyHealthy() so that when the proxy
appears unhealthy you call ensureOllamaAuthProxy() (and recheck with
isProxyHealthy()) before logging the fatal error and calling process.exit(1); in
practice, invoke ensureOllamaAuthProxy() when !isProxyHealthy(), handle any
thrown error or returned failure, re-evaluate validation (and
validation.diagnostic) and only log the error and call process.exit(1) if the
reprobe/recovery still reports an unhealthy proxy.

5318-5319: ⚠️ Potential issue | 🟠 Major

Use curl exit status for the vLLM host-health bypass.

Line 5319 still converts runCapture() output into a boolean. A failed curl -sf can still leave captured text behind, which would incorrectly downgrade a real host failure into the warning-and-continue path.

Suggested fix

-      const hostResponding = hostCheck ? !!runCapture(hostCheck, { ignoreError: true }) : false;
+      const hostResponding = hostCheck
+        ? run(hostCheck, { ignoreError: true, suppressOutput: true }).status === 0
+        : false;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 5318 - 5319, The host-health check currently
sets hostResponding by coercing runCapture(...) output to boolean, which can be
non-empty even when curl failed; update the logic in onboard.ts where
getLocalProviderHealthCheck(provider) is assigned to hostCheck so that you
inspect the command exit status instead of the captured stdout. Call the runner
in a way that returns the exit code (e.g., use the run variant or runCapture
result that includes an exitCode/exitStatus) and set hostResponding = hostCheck
? (result.exitCode === 0) : false, keeping the ignoreError behavior but basing
success on the process exit code; reference getLocalProviderHealthCheck,
runCapture, hostCheck, and hostResponding to locate and change the code.

🧹 Nitpick comments (1)

src/lib/onboard.ts (1)
5316-5384: Extract the local-provider fallback path into a helper.

These new vLLM/Ollama branches duplicate the same diagnostic-and-bypass flow inside an already complexity-suppressed function. Pulling that decision into a small helper would keep setupInference() easier to maintain.

As per coding guidelines, {src,nemoclaw/src,scripts}/**/*.{js,ts,tsx}: TypeScript files must maintain cyclomatic complexity limit of 20, ratcheting down to 15. Enforced by ESLint.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 5316 - 5384, The duplicated
diagnostic-and-bypass logic in the vllm-local and ollama-local branches should
be refactored into a small helper to reduce cyclomatic complexity: create a
helper (e.g., ensureLocalProviderOrExit or handleLocalProviderFallback) that
takes the provider name and the validation result and encapsulates the existing
checks (for vllm-local: getLocalProviderHealthCheck + runCapture; for
ollama-local: isWsl + isProxyHealthy) and the three logging paths (warn +
optional diagnostic + bypass message, or error + optional diagnostic +
platform-specific hint + process.exit). Replace the duplicated blocks around
validateLocalProvider in setupInference (or where validateLocalProvider is
called) to call this helper before continuing to upsertProvider / runOpenshell.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/ollama-proxy.ts`:
- Around line 145-165: The code sets ollamaProxyToken before verifying the proxy
is alive; update startOllamaAuthProxy so it generates a local proxyToken, calls
spawnOllamaAuthProxy(pid), waits and checks isOllamaProxyProcess(pid), and only
if the liveness check passes assign ollamaProxyToken = proxyToken (and persist
if needed); if the check fails return false without mutating ollamaProxyToken so
getOllamaProxyToken/ensureOllamaAuthProxy won’t return a stale token; apply the
same change pattern where ollamaProxyToken is currently set early (references:
startOllamaAuthProxy, spawnOllamaAuthProxy, isOllamaProxyProcess,
getOllamaProxyToken, ensureOllamaAuthProxy).
- Around line 207-222: In isProxyHealthy(), don't early-return false when
isOllamaProxyProcess(pid) is falsy; instead store that boolean (e.g.,
hasValidPid = isOllamaProxyProcess(pid)), always perform the HTTP probe built
with loadPersistedProxyToken() and runCapture(..., { ignoreError: true }), and
then return true if the probe produced output, otherwise return hasValidPid (so
a successful HTTP probe wins even if the PID file is missing/stale). Update
references in the function to use the new hasValidPid local variable and remove
the early return.

---

Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 5360-5384: The code exits immediately when isProxyHealthy() is
false even though a recoverable proxy could be created; modify the branch
handling around isProxyHealthy() so that when the proxy appears unhealthy you
call ensureOllamaAuthProxy() (and recheck with isProxyHealthy()) before logging
the fatal error and calling process.exit(1); in practice, invoke
ensureOllamaAuthProxy() when !isProxyHealthy(), handle any thrown error or
returned failure, re-evaluate validation (and validation.diagnostic) and only
log the error and call process.exit(1) if the reprobe/recovery still reports an
unhealthy proxy.
- Around line 5318-5319: The host-health check currently sets hostResponding by
coercing runCapture(...) output to boolean, which can be non-empty even when
curl failed; update the logic in onboard.ts where
getLocalProviderHealthCheck(provider) is assigned to hostCheck so that you
inspect the command exit status instead of the captured stdout. Call the runner
in a way that returns the exit code (e.g., use the run variant or runCapture
result that includes an exitCode/exitStatus) and set hostResponding = hostCheck
? (result.exitCode === 0) : false, keeping the ignoreError behavior but basing
success on the process exit code; reference getLocalProviderHealthCheck,
runCapture, hostCheck, and hostResponding to locate and change the code.

---

Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 5316-5384: The duplicated diagnostic-and-bypass logic in the
vllm-local and ollama-local branches should be refactored into a small helper to
reduce cyclomatic complexity: create a helper (e.g., ensureLocalProviderOrExit
or handleLocalProviderFallback) that takes the provider name and the validation
result and encapsulates the existing checks (for vllm-local:
getLocalProviderHealthCheck + runCapture; for ollama-local: isWsl +
isProxyHealthy) and the three logging paths (warn + optional diagnostic + bypass
message, or error + optional diagnostic + platform-specific hint +
process.exit). Replace the duplicated blocks around validateLocalProvider in
setupInference (or where validateLocalProvider is called) to call this helper
before continuing to upsertProvider / runOpenshell.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cd2c8e18-d861-4413-b058-63a9501e1fe0

📥 Commits

Reviewing files that changed from the base of the PR and between 06f77ed and a154279.

📒 Files selected for processing (3)

src/lib/ollama-proxy.ts
src/lib/onboard.ts
src/nemoclaw.ts

✅ Files skipped from review due to trivial changes (1)

src/nemoclaw.ts

coderabbitai · 2026-04-25T01:48:33Z

+export function startOllamaAuthProxy(): boolean {
+  // eslint-disable-next-line @typescript-eslint/no-require-imports
+  const crypto = require("crypto");
+  killStaleProxy();
+
+  const proxyToken = crypto.randomBytes(24).toString("hex");
+  ollamaProxyToken = proxyToken;
+  // Don't persist yet — wait until provider is confirmed in setupInference.
+  // If the user backs out to a different provider, the token stays in memory
+  // only and is discarded.
+  const pid = spawnOllamaAuthProxy(proxyToken);
+  sleepSeconds(1);
+  if (!isOllamaProxyProcess(pid)) {
+    console.error(`  Error: Ollama auth proxy failed to start on :${OLLAMA_PROXY_PORT}`);
+    console.error(`  Containers will not be able to reach Ollama without the proxy.`);
+    console.error(
+      `  Check if port ${OLLAMA_PROXY_PORT} is already in use: lsof -ti :${OLLAMA_PROXY_PORT}`,
+    );
+    return false;
+  }
+  return true;


⚠️ Potential issue | 🟠 Major

Cache the token only after the proxy is proven alive.

Line 151 and Line 185 set ollamaProxyToken before any successful liveness check. If the child exits immediately, getOllamaProxyToken() can keep returning a token that no proxy is serving, and ensureOllamaAuthProxy() currently gives callers no signal that recovery failed.

🩹 Suggested fix

export function startOllamaAuthProxy(): boolean { // eslint-disable-next-line `@typescript-eslint/no-require-imports` const crypto = require("crypto"); killStaleProxy(); const proxyToken = crypto.randomBytes(24).toString("hex"); - ollamaProxyToken = proxyToken; // Don't persist yet — wait until provider is confirmed in setupInference. // If the user backs out to a different provider, the token stays in memory // only and is discarded. const pid = spawnOllamaAuthProxy(proxyToken); sleepSeconds(1); if (!isOllamaProxyProcess(pid)) { + ollamaProxyToken = null; + clearPersistedProxyPid(); console.error(` Error: Ollama auth proxy failed to start on :${OLLAMA_PROXY_PORT}`); console.error(` Containers will not be able to reach Ollama without the proxy.`); console.error( ` Check if port ${OLLAMA_PROXY_PORT} is already in use: lsof -ti :${OLLAMA_PROXY_PORT}`, ); return false; } + ollamaProxyToken = proxyToken; return true; } -export function ensureOllamaAuthProxy(): void { +export function ensureOllamaAuthProxy(): boolean { const token = loadPersistedProxyToken(); - if (!token) return; + if (!token) return false; const pid = loadPersistedProxyPid(); if (isOllamaProxyProcess(pid)) { ollamaProxyToken = token; - return; + return true; } killStaleProxy(); - ollamaProxyToken = token; - spawnOllamaAuthProxy(token); + const restartedPid = spawnOllamaAuthProxy(token); sleepSeconds(1); + if (!isOllamaProxyProcess(restartedPid)) { + ollamaProxyToken = null; + clearPersistedProxyPid(); + return false; + } + ollamaProxyToken = token; + return true; }

Also applies to: 172-188

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lib/ollama-proxy.ts` around lines 145 - 165, The code sets ollamaProxyToken before verifying the proxy is alive; update startOllamaAuthProxy so it generates a local proxyToken, calls spawnOllamaAuthProxy(pid), waits and checks isOllamaProxyProcess(pid), and only if the liveness check passes assign ollamaProxyToken = proxyToken (and persist if needed); if the check fails return false without mutating ollamaProxyToken so getOllamaProxyToken/ensureOllamaAuthProxy won’t return a stale token; apply the same change pattern where ollamaProxyToken is currently set early (references: startOllamaAuthProxy, spawnOllamaAuthProxy, isOllamaProxyProcess, getOllamaProxyToken, ensureOllamaAuthProxy).

coderabbitai · 2026-04-25T01:48:33Z

+export function isProxyHealthy(): boolean {
+  // 1. PID check — fast rejection if process is gone
+  const pid = loadPersistedProxyPid();
+  if (!isOllamaProxyProcess(pid)) return false;
+
+  // 2. HTTP probe — confirm the proxy actually responds, not just that
+  //    the process exists (catches wedged/stale proxy processes)
+  const proxyUrl = `http://127.0.0.1:${OLLAMA_PROXY_PORT}/api/tags`;
+  const token = loadPersistedProxyToken();
+  const probeCmd = token
+    ? ["curl", "-sf", "--connect-timeout", "3", "--max-time", "5",
+       "-H", `Authorization: Bearer ${token}`, proxyUrl]
+    : ["curl", "-sf", "--connect-timeout", "3", "--max-time", "5", proxyUrl];
+
+  const output = runCapture(probeCmd, { ignoreError: true });
+  return !!output;


⚠️ Potential issue | 🟠 Major

Let the HTTP probe drive the fallback decision.

This helper is now used to decide whether onboarding can continue after the container-side check fails, but Line 209-Line 210 returns false before the HTTP probe runs. A missing or stale PID file would still abort onboarding even if the proxy is healthy on 127.0.0.1:${OLLAMA_PROXY_PORT}.

🩹 Suggested fix

export function isProxyHealthy(): boolean { - // 1. PID check — fast rejection if process is gone - const pid = loadPersistedProxyPid(); - if (!isOllamaProxyProcess(pid)) return false; - - // 2. HTTP probe — confirm the proxy actually responds, not just that - // the process exists (catches wedged/stale proxy processes) + // HTTP reachability is the source of truth for the onboarding fallback. const proxyUrl = `http://127.0.0.1:${OLLAMA_PROXY_PORT}/api/tags`; const token = loadPersistedProxyToken(); const probeCmd = token ? ["curl", "-sf", "--connect-timeout", "3", "--max-time", "5", "-H", `Authorization: Bearer ${token}`, proxyUrl] : ["curl", "-sf", "--connect-timeout", "3", "--max-time", "5", proxyUrl]; const output = runCapture(probeCmd, { ignoreError: true }); - return !!output; + return Boolean(output); }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lib/ollama-proxy.ts` around lines 207 - 222, In isProxyHealthy(), don't early-return false when isOllamaProxyProcess(pid) is falsy; instead store that boolean (e.g., hasValidPid = isOllamaProxyProcess(pid)), always perform the HTTP probe built with loadPersistedProxyToken() and runCapture(..., { ignoreError: true }), and then return true if the probe produced output, otherwise return hasValidPid (so a successful HTTP probe wins even if the PID file is missing/stale). Update references in the function to use the new hasValidPid local variable and remove the early return.

jyaunches self-assigned this Apr 24, 2026

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into pr/issue-2425

06f77ed

coderabbitai Bot reviewed Apr 25, 2026

View reviewed changes

Comment thread src/lib/onboard.ts

coderabbitai Bot reviewed Apr 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(onboard): retry and fallback for container reachability check#2453

fix(onboard): retry and fallback for container reachability check#2453
jyaunches wants to merge 3 commits intomainfrom
pr/issue-2425

jyaunches commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 24, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 25, 2026

Uh oh!

coderabbitai Bot Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jyaunches commented Apr 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Verification

AI Disclosure

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jyaunches commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading