-
Notifications
You must be signed in to change notification settings - Fork 748
fix: delay canary health checks until system ready #4985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…bscription race conditions Signed-off-by: [email protected] <[email protected]>
WalkthroughIntroduces a readiness wait with exponential backoff polling before starting canary health checks. Health check startup configuration is deferred until after system readiness is determined within a spawned task, reducing potential race conditions while maintaining the same observable outcome. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
lib/runtime/src/distributed.rs (2)
264-296: Consider adding cancellation support to the spawned task.The spawned task runs without a cancellation token, so it will continue executing even if the runtime shuts down during the readiness wait (up to 5 minutes). This could cause the task to attempt starting the health check manager after the system has begun shutting down.
Consider passing the cancellation token and using
tokio::select!:+ let cancel_token = drt_clone.primary_token(); tokio::spawn(async move { tracing::info!("Waiting for system readiness before starting canary health checks"); // Poll readiness with exponential backoff let mut wait_time = std::time::Duration::from_millis(100); let max_wait = std::time::Duration::from_secs(300); // 5 minutes for large model loading let mut total_waited = std::time::Duration::ZERO; while total_waited < max_wait { + if cancel_token.is_cancelled() { + tracing::info!("Canary health check startup cancelled due to shutdown"); + return; + } let (healthy, endpoints) = drt_clone.system_health().lock().get_health_status();
283-290: Progress logging may trigger inconsistently.The condition
total_waited.as_secs().is_multiple_of(10)checks for exact multiples of 10 seconds. However, with exponential backoff timing (0.1s, 0.3s, 0.7s, 1.5s, 3.1s, 6.3s, 11.3s, 16.3s, 21.3s...), the accumulated time rarely lands on exact 10-second boundaries, causing the log to trigger sporadically or not at all during the early wait period.Consider tracking the last log time separately:
+ let mut last_log_time = std::time::Duration::ZERO; + while total_waited < max_wait { // ... get health status ... - // Log progress every 10 seconds - if total_waited.as_secs() > 0 && total_waited.as_secs().is_multiple_of(10) { + // Log progress every 10 seconds + if total_waited >= last_log_time + std::time::Duration::from_secs(10) { + last_log_time = total_waited; tracing::debug!( "Still waiting for system readiness... ({:?} elapsed, endpoints: {:?})", total_waited, endpoints ); }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
lib/runtime/src/distributed.rs(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
lib/runtime/src/distributed.rs (1)
lib/runtime/src/health_check.rs (1)
start_health_check_manager(384-395)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
- GitHub Check: trtllm (arm64)
- GitHub Check: sglang (amd64)
- GitHub Check: vllm (arm64)
- GitHub Check: trtllm (amd64)
- GitHub Check: vllm (amd64)
- GitHub Check: sglang (arm64)
- GitHub Check: operator (amd64)
- GitHub Check: operator (arm64)
- GitHub Check: tests (lib/bindings/python)
- GitHub Check: tests (launch/dynamo-run)
- GitHub Check: tests (lib/runtime/examples)
- GitHub Check: clippy (launch/dynamo-run)
- GitHub Check: Build and Test - dynamo
- GitHub Check: clippy (lib/bindings/python)
- GitHub Check: clippy (.)
- GitHub Check: tests (.)
🔇 Additional comments (1)
lib/runtime/src/distributed.rs (1)
256-327: Overall approach looks good for fixing the race condition.The implementation correctly delays health checks until system readiness, using exponential backoff to reduce polling overhead while still detecting readiness quickly. The defensive "proceed anyway" behavior after the 5-minute timeout ensures health checks eventually start even if readiness is never signaled.
Overview:
This PR fixes race conditions in canary health checks that caused "no responders", "instance_id not found" errors in Kubernetes environments. The primary fix delays the canary health check manager start until the system is ready, eliminating the root cause where health checks fired before NATS endpoints were fully subscribed.
Problem: Health checks were starting immediately when
DistributedRuntimewas created, often before NATS endpoints had established their subscriptions, leading to false failures in K8s with CPU throttling and scheduling delays.Solution: Wait for system readiness (same condition used by readiness probes) before starting canary health checks, with additional defense-in-depth measures for edge cases.
Details:
Delayed Canary Health Check Start** (
lib/runtime/src/distributed.rs)Test Results:
System startup with delayed canary checks:
Waited 11.3s for system readiness
Health check manager started AFTER all endpoints ready
First canary fired at T+20s (after system stable)
Multiple successful health checks (no failures):
15+ consecutive health checks completed successfully
No "no responders" errors
No "instance_id not found" errors
No timeouts
Log evidence from /tmp/tzuling_backend.log:
[INFO] Waiting for system readiness before starting canary health checks
[INFO] System ready after 11.3s (endpoints: ["unload_lora", "load_lora", "list_loras", "clear_kv_blocks", "generate"]), starting canary health check manager
[INFO] Canary health check manager started (canary_wait_time: 20s, request_timeout: 3s)
[DEBUG] Health check successful for generate (repeated 15+ times)
[DEBUG] Health check completed for generate (repeated 15+ times)
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
DIS-1185
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.