Skip to content

perf(onboard): add deadline-based gateway wait#2492

Open
HOYALIM wants to merge 7 commits intoNVIDIA:mainfrom
HOYALIM:issue-2001-latency-waits
Open

perf(onboard): add deadline-based gateway wait#2492
HOYALIM wants to merge 7 commits intoNVIDIA:mainfrom
HOYALIM:issue-2001-latency-waits

Conversation

@HOYALIM
Copy link
Copy Markdown

@HOYALIM HOYALIM commented Apr 26, 2026

Summary

Adds a reusable deadline-based wait primitive and applies it to the gateway startup health loop.
This is a scoped first step for the broader onboard latency work in #2001 and keeps the existing health poll environment knobs compatible.

Related Issue

Refs #2001.

Changes

  • Extend src/lib/wait.ts with synchronous waitUntil(...) using an absolute deadline, configurable interval backoff, injectable clock/sleeper hooks, and an optional attempt cap.
  • Replace the fixed gateway startup health polling loop in src/lib/onboard.ts with waitUntil(...).
  • Preserve the existing NEMOCLAW_HEALTH_POLL_COUNT and NEMOCLAW_HEALTH_POLL_INTERVAL behavior by passing the configured attempt count and interval into the new wait helper.
  • Add focused tests for immediate success, retry success, timeout behavior, interval backoff, and zero-interval attempt caps.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Additional validation run locally:

  • npm run build:cli passes
  • ./node_modules/.bin/vitest run test/wait.test.ts test/gateway-start-wait.test.ts passes
  • npm run typecheck:cli passes
  • ./node_modules/.bin/prettier --check src/lib/wait.ts test/wait.test.ts passes
  • git diff --check passes

Notes:

AI Disclosure

  • AI-assisted — tool: OpenAI Codex

Signed-off-by: Ho Lim subhoya@gmail.com

Summary by CodeRabbit

  • New Features

    • Added a configurable polling utility with controllable intervals, backoff, attempt limits, and deadline handling.
  • Refactor

    • Gateway startup health polling now uses the new retry primitive for consistent, configurable polling behavior.
  • Tests

    • Added comprehensive tests for immediate success/failure, invalid configs, deadline termination, max-attempts, backoff capping, and precise polling sequences.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 26, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 26, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a synchronous polling primitive waitUntil and refactors gateway health polling in startGatewayWithOptions to use it, converting intervals to milliseconds and driving retries via configurable options while preserving per-attempt repair/attach/select/probe steps and original failure behavior.

Changes

Cohort / File(s) Summary
Polling utility
src/lib/wait.ts
Adds exported waitUntil and WaitUntilOptions. Implements injectable now/sleep, sanitized timing/backoff, deadline and attempt validation, immediate predicate check, bounded sleeps, exponential backoff clamped by maxIntervalMs, and termination by success, deadline, or maxAttempts.
Gateway health refactor
src/lib/onboard.ts
Replaces manual for/sleep loop in startGatewayWithOptions with waitUntil, mapping intervals to initialIntervalMs/maxIntervalMs, using backoffFactor: 1, maxAttempts: healthPollCount, and an effectively unbounded deadlineMs; retains repair, metadata attach, reselect, and probe sequence and same error on failure.
Tests
test/wait.test.ts
Adds comprehensive tests for waitUntil: immediate success, expired-deadline behavior, invalid-deadline TypeError, repeated polling with exact sleep sequence, deadline-based termination (including truncated final sleep), backoff respecting maxIntervalMs, maxAttempts termination, and zero-interval progress/yielding cases.

Sequence Diagram(s)

sequenceDiagram
  participant Onboard as Onboard/startGatewayWithOptions
  participant Wait as waitUntil
  participant Repair as RepairSecrets
  participant Meta as AttachMetadata
  participant Select as SelectGateway
  participant Probe as HealthProbes

  rect rgba(200,220,255,0.5)
    Onboard->>Wait: call waitUntil(condition, options)
  end

  rect rgba(220,255,200,0.5)
    Wait->>Repair: attempt repair of missing bootstrap secrets
    Wait->>Meta: conditionally attach gateway metadata
    Wait->>Select: re-select gateway if applicable
    Wait->>Probe: run probes (status, gateway info -g <name>, gateway info)
    Probe-->>Wait: return probe results
    Wait-->>Onboard: isGatewayHealthy? (condition result)
  end

  rect rgba(255,230,200,0.5)
    alt healthy
      Onboard->>Onboard: proceed
    else not healthy & attempts remain
      Wait->>Wait: sleep (bounded/backoff) and retry
    else deadline/attempts exhausted
      Wait-->>Onboard: return false -> throw Error("Gateway failed to start")
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hopped through sleeps and checked the door,
Timers in paw, I sniffed the floor.
Secrets mended, metadata neat,
Backoff footsteps — soft, repeat.
The gateway stirs; I tap my feet.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'perf(onboard): add deadline-based gateway wait' accurately describes the main change—introducing deadline-based waiting to gateway startup polling in the onboard module.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@HOYALIM HOYALIM changed the title [codex] Add deadline-based gateway wait perf(onboard): add deadline-based gateway wait Apr 26, 2026
@HOYALIM HOYALIM marked this pull request as ready for review April 27, 2026 00:02
Copilot AI review requested due to automatic review settings April 27, 2026 00:02
@HOYALIM HOYALIM marked this pull request as draft April 27, 2026 00:06
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a reusable, synchronous, deadline-based polling helper and applies it to the gateway startup health wait loop as part of the onboard latency work (refs #2001), while aiming to preserve existing health poll env knob behavior.

Changes:

  • Added waitUntil(...) to src/lib/wait.ts with deadline, backoff, injectable clock/sleeper hooks, and optional attempt cap.
  • Replaced the gateway startup health polling loop in src/lib/onboard.ts with the new waitUntil(...) helper.
  • Expanded test/wait.test.ts with focused unit tests for waitUntil(...) behavior (immediate success, retries, deadline, backoff, and attempt caps).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
test/wait.test.ts Adds unit coverage for the new waitUntil(...) helper.
src/lib/wait.ts Introduces the waitUntil(...) synchronous polling primitive and options.
src/lib/onboard.ts Switches gateway health polling to use waitUntil(...) with existing poll count/interval inputs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/lib/onboard.ts Outdated
Comment thread src/lib/wait.ts Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
test/wait.test.ts (1)

42-172: Add a contract test for invalid deadlineMs.

Please add a test that asserts waitUntil throws TypeError for non-finite deadlineMs (e.g., NaN), so the input validation path is locked in.

✅ Suggested test
+  it("waitUntil throws when deadlineMs is non-finite", () => {
+    expect(() =>
+      waitUntil(() => false, {
+        deadlineMs: Number.NaN,
+      }),
+    ).toThrow(TypeError);
+  });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/wait.test.ts` around lines 42 - 172, Add a new test in wait.test.ts
using the existing waitUntil harness that verifies passing a non-finite
deadlineMs (e.g., NaN) causes waitUntil to throw a TypeError: call waitUntil
with a simple predicate and options including deadlineMs: NaN plus the same now
and sleep mocks used in other tests, and assert with expect(() =>
waitUntil(...)).toThrow(TypeError) so the input validation path for deadlineMs
is locked in.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 3512-3541: The deadline calculation (healthDeadlineMs) only
accounts for sleep budget and can expire while each waitUntil attempt runs
multiple shell-outs (repairGatewayBootstrapSecrets, runCaptureOpenshell,
isGatewayHealthy), shortening the effective retry window; fix by extending or
removing the deadline: either compute healthDeadlineMs to also include an
allowance for per-attempt probe runtime (e.g., add healthPollCount *
probeTimeoutMs or add a conservative maxProbeRuntimeMs) so the deadline >= total
sleep budget + total probe runtime, or omit deadlineMs from the waitUntil call
to rely solely on maxAttempts/healthPollCount; update the healthDeadlineMs
computation or the waitUntil invocation accordingly (references:
healthPollIntervalMs, healthDeadlineMs, healthPollCount, waitUntil,
repairGatewayBootstrapSecrets, runCaptureOpenshell, isGatewayHealthy).

In `@src/lib/wait.ts`:
- Around line 98-99: The loop can hot-spin when intervalMs (or maxIntervalMs) is
0 and maxAttempts is unbounded; change the sleeper call and interval update to
enforce a minimum non-zero sleep (e.g., MIN_SLEEP_MS = 1) so you never call
sleeper(0). Specifically, in the block using sleeper(Math.min(intervalMs,
deadlineMs - currentMs)) and intervalMs = Math.min(maxIntervalMs, intervalMs *
backoffFactor), clamp the sleep duration with Math.max(MIN_SLEEP_MS, ...) and
ensure intervalMs is also floored to at least MIN_SLEEP_MS after applying
backoff so the loop yields CPU even when initialIntervalMs/maxIntervalMs are
zero (references: sleeper, intervalMs, maxIntervalMs, backoffFactor, deadlineMs,
currentMs).

---

Nitpick comments:
In `@test/wait.test.ts`:
- Around line 42-172: Add a new test in wait.test.ts using the existing
waitUntil harness that verifies passing a non-finite deadlineMs (e.g., NaN)
causes waitUntil to throw a TypeError: call waitUntil with a simple predicate
and options including deadlineMs: NaN plus the same now and sleep mocks used in
other tests, and assert with expect(() => waitUntil(...)).toThrow(TypeError) so
the input validation path for deadlineMs is locked in.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 88a34e7e-6332-4895-82d8-e2cfc9f03551

📥 Commits

Reviewing files that changed from the base of the PR and between 1f615e2 and 2b8bc08.

📒 Files selected for processing (3)
  • src/lib/onboard.ts
  • src/lib/wait.ts
  • test/wait.test.ts

Comment thread src/lib/onboard.ts
Comment thread src/lib/wait.ts Outdated
@HOYALIM HOYALIM marked this pull request as ready for review April 27, 2026 00:08
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/wait.ts`:
- Around line 94-108: The loop currently calls condition() even after the
deadline guard because the deadline check (using now() and deadlineMs) comes
after condition(); move the deadline/time check so it runs before invoking
condition() (and keep the existing attempts checks), ensuring you check
Number.isFinite(currentMs) and currentMs >= deadlineMs prior to calling
condition() to avoid the extra probe; update the loop in src/lib/wait.ts
(references: attempts, maxAttempts, deadlineMs, now(), condition()) accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d13eeb32-c6e5-4ffb-bbea-dae420740d03

📥 Commits

Reviewing files that changed from the base of the PR and between 49316b4 and 26b594e.

📒 Files selected for processing (2)
  • src/lib/wait.ts
  • test/wait.test.ts

Comment thread src/lib/wait.ts Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/lib/wait.ts (1)

93-109: ⚠️ Potential issue | 🟠 Major

Honor an already-expired deadline before the first probe.

Line 95 only checks the deadline after at least one attempt, so condition() on Line 107 still runs once when deadlineMs <= now() at entry. For callers using this to drive real health probes, that is still one probe past the requested budget. Please move the deadline guard out of the attempts > 0 branch and add a regression test for an already-expired deadline.

Suggested fix
   let attempts = 0;
   for (;;) {
-    if (attempts > 0) {
-      const currentMs = now();
-      if (!Number.isFinite(currentMs) || currentMs >= deadlineMs) {
-        return false;
-      }
+    const currentMs = now();
+    if (!Number.isFinite(currentMs) || currentMs >= deadlineMs) {
+      return false;
     }
 
     if (attempts >= maxAttempts) {
       return false;
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/wait.ts` around lines 93 - 109, The loop currently only checks the
deadline after the first probe (gated by attempts > 0), so move the deadline
guard to run on every iteration before invoking condition(): call const
currentMs = now(); if (!Number.isFinite(currentMs) || currentMs >= deadlineMs)
return false; (remove the attempts > 0 gate) so an already-expired deadline
causes an immediate false; also add a regression test that sets deadlineMs <=
now() and asserts the wait function returns false (and that condition() is not
invoked) to prevent regressions; refer to symbols attempts, now(), deadlineMs,
condition(), and maxAttempts when making changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/lib/wait.ts`:
- Around line 93-109: The loop currently only checks the deadline after the
first probe (gated by attempts > 0), so move the deadline guard to run on every
iteration before invoking condition(): call const currentMs = now(); if
(!Number.isFinite(currentMs) || currentMs >= deadlineMs) return false; (remove
the attempts > 0 gate) so an already-expired deadline causes an immediate false;
also add a regression test that sets deadlineMs <= now() and asserts the wait
function returns false (and that condition() is not invoked) to prevent
regressions; refer to symbols attempts, now(), deadlineMs, condition(), and
maxAttempts when making changes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ab7479bd-599b-456e-8318-8c5f61cdf8fa

📥 Commits

Reviewing files that changed from the base of the PR and between 26b594e and 38350a1.

📒 Files selected for processing (2)
  • src/lib/wait.ts
  • test/wait.test.ts

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/wait.ts`:
- Around line 95-116: The code uses currentMs captured before calling
condition(), which can be stale if condition() is slow; update the timestamp
after the condition() call and before computing remainingMs/sleepDurationMs so
the sleep budget respects deadlineMs and avoids sleeping past the deadline; in
the wait function recompute currentMs = now() (or equivalent) after condition()
returns and use that value to derive remainingMs, requestedSleepMs and to clamp
sleeper(...) (referencing currentMs, now(), deadlineMs, condition(),
remainingMs, requestedSleepMs, MIN_UNCAPPED_SLEEP_MS, hasAttemptCap, and
sleeper).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3b8e8326-56d6-4585-bb6e-fa32ebce51cf

📥 Commits

Reviewing files that changed from the base of the PR and between 38350a1 and 44901e0.

📒 Files selected for processing (2)
  • src/lib/wait.ts
  • test/wait.test.ts

Comment thread src/lib/wait.ts
@HOYALIM HOYALIM force-pushed the issue-2001-latency-waits branch from bb47782 to 5a709c4 Compare April 27, 2026 04:17
HOYALIM

This comment was marked as resolved.

@wscurran wscurran added dependencies Pull requests that update a dependency file enhancement: performance labels Apr 27, 2026
@wscurran
Copy link
Copy Markdown
Contributor

✨ Thanks for submitting this pull request that proposes a way to improve the performance of NemoClaw's onboard process by introducing a deadline-based wait primitive.


Related open issues:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file enhancement: performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants