Skip to content

Conversation

@tzulingk
Copy link
Contributor

@tzulingk tzulingk commented Dec 16, 2025

Overview:

This PR fixes race conditions in canary health checks that caused "no responders", "instance_id not found" errors in Kubernetes environments. The primary fix delays the canary health check manager start until the system is ready, eliminating the root cause where health checks fired before NATS endpoints were fully subscribed.

Problem: Health checks were starting immediately when DistributedRuntime was created, often before NATS endpoints had established their subscriptions, leading to false failures in K8s with CPU throttling and scheduling delays.

Solution: Wait for system readiness (same condition used by readiness probes) before starting canary health checks, with additional defense-in-depth measures for edge cases.

Details:

Delayed Canary Health Check Start** (lib/runtime/src/distributed.rs)

  • Health check manager now waits for system readiness before starting
  • Polls health status with exponential backoff (100ms → 5s intervals)
  • Max wait time: 5 minutes (handles large model loading)
  • Logs progress every 10 seconds during wait

Test Results:
System startup with delayed canary checks:
Waited 11.3s for system readiness
Health check manager started AFTER all endpoints ready
First canary fired at T+20s (after system stable)

Multiple successful health checks (no failures):
15+ consecutive health checks completed successfully
No "no responders" errors
No "instance_id not found" errors
No timeouts

Log evidence from /tmp/tzuling_backend.log:
[INFO] Waiting for system readiness before starting canary health checks
[INFO] System ready after 11.3s (endpoints: ["unload_lora", "load_lora", "list_loras", "clear_kv_blocks", "generate"]), starting canary health check manager
[INFO] Canary health check manager started (canary_wait_time: 20s, request_timeout: 3s)
[DEBUG] Health check successful for generate (repeated 15+ times)
[DEBUG] Health check completed for generate (repeated 15+ times)

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

DIS-1185

Summary by CodeRabbit

  • Refactor
    • Enhanced health check initialization with a system readiness verification phase. Implements exponential backoff polling with a 5-minute maximum wait window before health checks commence. Improved startup logging reflects readiness status throughout the process. Health checks proceed regardless of final readiness determination to ensure continuous system operation.

✏️ Tip: You can customize this high-level summary in your review settings.

@tzulingk tzulingk requested a review from a team as a code owner December 16, 2025 22:19
@github-actions github-actions bot added the fix label Dec 16, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 16, 2025

Walkthrough

Introduces a readiness wait with exponential backoff polling before starting canary health checks. Health check startup configuration is deferred until after system readiness is determined within a spawned task, reducing potential race conditions while maintaining the same observable outcome.

Changes

Cohort / File(s) Summary
Health Check Readiness Deferral
lib/runtime/src/distributed.rs
Added readiness polling loop with exponential backoff (max 5 minutes) before canary health check startup. Health check configuration and manager startup moved inside spawned task, executed after readiness verification. Includes new logging for readiness state and warning fallback if max wait exceeded.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Readiness polling implementation: Verify exponential backoff logic and 5-minute timeout behavior
  • Task spawning and async control flow: Confirm proper error handling and task lifecycle within spawned context
  • Configuration timing: Ensure canary_wait_time and request_timeout capture is correct and thread-safe
  • Logging coverage: Review that readiness state transitions are adequately logged for debugging
  • Race condition elimination: Validate that system endpoint subscription occurs before health checks initiate

Poem

🐰 Patience, dear checks, before you begin!
We'll wait for the system with a carrot-y grin,
Five minutes to settle with backoff so smooth—
Now race conditions? Poof! No more to prove! 🥕✨

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: delaying canary health checks until the system is ready, which directly addresses the race condition issue described in the PR.
Description check ✅ Passed The PR description is well-structured, comprehensive, and follows the required template with all major sections completed.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
lib/runtime/src/distributed.rs (2)

264-296: Consider adding cancellation support to the spawned task.

The spawned task runs without a cancellation token, so it will continue executing even if the runtime shuts down during the readiness wait (up to 5 minutes). This could cause the task to attempt starting the health check manager after the system has begun shutting down.

Consider passing the cancellation token and using tokio::select!:

+            let cancel_token = drt_clone.primary_token();
             tokio::spawn(async move {
                 tracing::info!("Waiting for system readiness before starting canary health checks");
 
                 // Poll readiness with exponential backoff
                 let mut wait_time = std::time::Duration::from_millis(100);
                 let max_wait = std::time::Duration::from_secs(300); // 5 minutes for large model loading
                 let mut total_waited = std::time::Duration::ZERO;
 
                 while total_waited < max_wait {
+                    if cancel_token.is_cancelled() {
+                        tracing::info!("Canary health check startup cancelled due to shutdown");
+                        return;
+                    }
                     let (healthy, endpoints) = drt_clone.system_health().lock().get_health_status();

283-290: Progress logging may trigger inconsistently.

The condition total_waited.as_secs().is_multiple_of(10) checks for exact multiples of 10 seconds. However, with exponential backoff timing (0.1s, 0.3s, 0.7s, 1.5s, 3.1s, 6.3s, 11.3s, 16.3s, 21.3s...), the accumulated time rarely lands on exact 10-second boundaries, causing the log to trigger sporadically or not at all during the early wait period.

Consider tracking the last log time separately:

+                let mut last_log_time = std::time::Duration::ZERO;
+
                 while total_waited < max_wait {
                     // ... get health status ...

-                    // Log progress every 10 seconds
-                    if total_waited.as_secs() > 0 && total_waited.as_secs().is_multiple_of(10) {
+                    // Log progress every 10 seconds
+                    if total_waited >= last_log_time + std::time::Duration::from_secs(10) {
+                        last_log_time = total_waited;
                         tracing::debug!(
                             "Still waiting for system readiness... ({:?} elapsed, endpoints: {:?})",
                             total_waited,
                             endpoints
                         );
                     }
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 32eaecb and 9a50bb1.

📒 Files selected for processing (1)
  • lib/runtime/src/distributed.rs (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
lib/runtime/src/distributed.rs (1)
lib/runtime/src/health_check.rs (1)
  • start_health_check_manager (384-395)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: sglang (amd64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: sglang (arm64)
  • GitHub Check: operator (amd64)
  • GitHub Check: operator (arm64)
  • GitHub Check: tests (lib/bindings/python)
  • GitHub Check: tests (launch/dynamo-run)
  • GitHub Check: tests (lib/runtime/examples)
  • GitHub Check: clippy (launch/dynamo-run)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: clippy (lib/bindings/python)
  • GitHub Check: clippy (.)
  • GitHub Check: tests (.)
🔇 Additional comments (1)
lib/runtime/src/distributed.rs (1)

256-327: Overall approach looks good for fixing the race condition.

The implementation correctly delays health checks until system readiness, using exponential backoff to reduce polling overhead while still detecting readiness quickly. The defensive "proceed anyway" behavior after the 5-minute timeout ensures health checks eventually start even if readiness is never signaled.

@tzulingk tzulingk enabled auto-merge (squash) December 16, 2025 22:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants