fix: delay canary health checks until system ready #4985

tzulingk · 2025-12-16T22:19:07Z

Overview:

This PR fixes race conditions in canary health checks that caused "no responders", "instance_id not found" errors in Kubernetes environments. The primary fix delays the canary health check manager start until the system is ready, eliminating the root cause where health checks fired before NATS endpoints were fully subscribed.

Problem: Health checks were starting immediately when DistributedRuntime was created, often before NATS endpoints had established their subscriptions, leading to false failures in K8s with CPU throttling and scheduling delays.

Solution: Wait for system readiness (same condition used by readiness probes) before starting canary health checks, with additional defense-in-depth measures for edge cases.

Details:

Delayed Canary Health Check Start** (lib/runtime/src/distributed.rs)

Health check manager now waits for system readiness before starting
Polls health status with exponential backoff (100ms → 5s intervals)
Max wait time: 5 minutes (handles large model loading)
Logs progress every 10 seconds during wait

Test Results:
System startup with delayed canary checks:
Waited 11.3s for system readiness
Health check manager started AFTER all endpoints ready
First canary fired at T+20s (after system stable)

Multiple successful health checks (no failures):
15+ consecutive health checks completed successfully
No "no responders" errors
No "instance_id not found" errors
No timeouts

Log evidence from /tmp/tzuling_backend.log:
[INFO] Waiting for system readiness before starting canary health checks
[INFO] System ready after 11.3s (endpoints: ["unload_lora", "load_lora", "list_loras", "clear_kv_blocks", "generate"]), starting canary health check manager
[INFO] Canary health check manager started (canary_wait_time: 20s, request_timeout: 3s)
[DEBUG] Health check successful for generate (repeated 15+ times)
[DEBUG] Health check completed for generate (repeated 15+ times)

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

DIS-1185

Summary by CodeRabbit

Refactor
- Enhanced health check initialization with a system readiness verification phase. Implements exponential backoff polling with a 5-minute maximum wait window before health checks commence. Improved startup logging reflects readiness status throughout the process. Health checks proceed regardless of final readiness determination to ensure continuous system operation.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…bscription race conditions Signed-off-by: [email protected] <[email protected]>

coderabbitai · 2025-12-16T22:21:49Z

Walkthrough

Introduces a readiness wait with exponential backoff polling before starting canary health checks. Health check startup configuration is deferred until after system readiness is determined within a spawned task, reducing potential race conditions while maintaining the same observable outcome.

Changes

Cohort / File(s)	Summary
Health Check Readiness Deferral `lib/runtime/src/distributed.rs`	Added readiness polling loop with exponential backoff (max 5 minutes) before canary health check startup. Health check configuration and manager startup moved inside spawned task, executed after readiness verification. Includes new logging for readiness state and warning fallback if max wait exceeded.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Readiness polling implementation: Verify exponential backoff logic and 5-minute timeout behavior
Task spawning and async control flow: Confirm proper error handling and task lifecycle within spawned context
Configuration timing: Ensure canary_wait_time and request_timeout capture is correct and thread-safe
Logging coverage: Review that readiness state transitions are adequately logged for debugging
Race condition elimination: Validate that system endpoint subscription occurs before health checks initiate

Poem

🐰 Patience, dear checks, before you begin!
We'll wait for the system with a carrot-y grin,
Five minutes to settle with backoff so smooth—
Now race conditions? Poof! No more to prove! 🥕✨

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: delaying canary health checks until the system is ready, which directly addresses the race condition issue described in the PR.
Description check	✅ Passed	The PR description is well-structured, comprehensive, and follows the required template with all major sections completed.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

lib/runtime/src/distributed.rs (2)

264-296: Consider adding cancellation support to the spawned task.

The spawned task runs without a cancellation token, so it will continue executing even if the runtime shuts down during the readiness wait (up to 5 minutes). This could cause the task to attempt starting the health check manager after the system has begun shutting down.

Consider passing the cancellation token and using tokio::select!:

+            let cancel_token = drt_clone.primary_token();
             tokio::spawn(async move {
                 tracing::info!("Waiting for system readiness before starting canary health checks");
 
                 // Poll readiness with exponential backoff
                 let mut wait_time = std::time::Duration::from_millis(100);
                 let max_wait = std::time::Duration::from_secs(300); // 5 minutes for large model loading
                 let mut total_waited = std::time::Duration::ZERO;
 
                 while total_waited < max_wait {
+                    if cancel_token.is_cancelled() {
+                        tracing::info!("Canary health check startup cancelled due to shutdown");
+                        return;
+                    }
                     let (healthy, endpoints) = drt_clone.system_health().lock().get_health_status();

283-290: Progress logging may trigger inconsistently.

The condition total_waited.as_secs().is_multiple_of(10) checks for exact multiples of 10 seconds. However, with exponential backoff timing (0.1s, 0.3s, 0.7s, 1.5s, 3.1s, 6.3s, 11.3s, 16.3s, 21.3s...), the accumulated time rarely lands on exact 10-second boundaries, causing the log to trigger sporadically or not at all during the early wait period.

Consider tracking the last log time separately:

+                let mut last_log_time = std::time::Duration::ZERO;
+
                 while total_waited < max_wait {
                     // ... get health status ...

-                    // Log progress every 10 seconds
-                    if total_waited.as_secs() > 0 && total_waited.as_secs().is_multiple_of(10) {
+                    // Log progress every 10 seconds
+                    if total_waited >= last_log_time + std::time::Duration::from_secs(10) {
+                        last_log_time = total_waited;
                         tracing::debug!(
                             "Still waiting for system readiness... ({:?} elapsed, endpoints: {:?})",
                             total_waited,
                             endpoints
                         );
                     }

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 32eaecb and 9a50bb1.

📒 Files selected for processing (1)

lib/runtime/src/distributed.rs (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

lib/runtime/src/distributed.rs (1)

lib/runtime/src/health_check.rs (1)

start_health_check_manager (384-395)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)

GitHub Check: trtllm (arm64)
GitHub Check: sglang (amd64)
GitHub Check: vllm (arm64)
GitHub Check: trtllm (amd64)
GitHub Check: vllm (amd64)
GitHub Check: sglang (arm64)
GitHub Check: operator (amd64)
GitHub Check: operator (arm64)
GitHub Check: tests (lib/bindings/python)
GitHub Check: tests (launch/dynamo-run)
GitHub Check: tests (lib/runtime/examples)
GitHub Check: clippy (launch/dynamo-run)
GitHub Check: Build and Test - dynamo
GitHub Check: clippy (lib/bindings/python)
GitHub Check: clippy (.)
GitHub Check: tests (.)

🔇 Additional comments (1)

lib/runtime/src/distributed.rs (1)

256-327: Overall approach looks good for fixing the race condition.

The implementation correctly delays health checks until system readiness, using exponential backoff to reduce polling overhead while still detecting readiness quickly. The defensive "proceed anyway" behavior after the 5-minute timeout ensures health checks eventually start even if readiness is never signaled.

fix: delay canary health checks until system ready to prevent NATS su…

9a50bb1

…bscription race conditions Signed-off-by: [email protected] <[email protected]>

tzulingk requested a review from a team as a code owner December 16, 2025 22:19

pull-request-size bot added the size/M label Dec 16, 2025

github-actions bot added the fix label Dec 16, 2025

coderabbitai bot reviewed Dec 16, 2025

View reviewed changes

tzulingk requested review from alec-flowers and nnshah1 December 16, 2025 22:21

tzulingk enabled auto-merge (squash) December 16, 2025 22:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: delay canary health checks until system ready #4985

fix: delay canary health checks until system ready #4985

Uh oh!

tzulingk commented Dec 16, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 16, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: delay canary health checks until system ready #4985

Are you sure you want to change the base?

fix: delay canary health checks until system ready #4985

Uh oh!

Conversation

tzulingk commented Dec 16, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tzulingk commented Dec 16, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 16, 2025 •

edited

Loading