-
Notifications
You must be signed in to change notification settings - Fork 751
fix: support dynamic metadata updates in Kubernetes backed discovery #4988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: mohammedabdulwahhab <[email protected]>
Signed-off-by: mohammedabdulwahhab <[email protected]>
Signed-off-by: mohammedabdulwahhab <[email protected]>
Signed-off-by: mohammedabdulwahhab <[email protected]>
Signed-off-by: mohammedabdulwahhab <[email protected]>
Signed-off-by: mohammedabdulwahhab <[email protected]>
Signed-off-by: mohammedabdulwahhab <[email protected]>
Signed-off-by: mohammedabdulwahhab <[email protected]>
WalkthroughThis PR extends the discovery system with a new Kubernetes CustomResourceDefinition (DynamoWorkerMetadata) for persisting pod metadata. It introduces a reflector-based daemon replacing HTTP-based polling, adds pod UID tracking throughout the operator and runtime layers, disables some CI workflow steps, and implements Rust code for CRD creation and application. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~65 minutes
Poem
Pre-merge checks❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
.github/workflows/container-validation-backends.yml(1 hunks)deploy/cloud/helm/crds/templates/nvidia.com_dynamoworkermetadatas.yaml(1 hunks)deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go(2 hunks)deploy/cloud/operator/internal/discovery/resource.go(2 hunks)deploy/cloud/operator/internal/dynamo/component_common.go(1 hunks)deploy/cloud/operator/internal/dynamo/component_planner_test.go(1 hunks)deploy/cloud/operator/internal/dynamo/graph_test.go(11 hunks)lib/runtime/src/discovery/kube.rs(8 hunks)lib/runtime/src/discovery/kube/crd.rs(1 hunks)lib/runtime/src/discovery/kube/daemon.rs(3 hunks)lib/runtime/src/discovery/kube/utils.rs(5 hunks)
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-12-02T18:13:40.065Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 4698
File: .github/workflows/container-validation-dynamo.yml:68-68
Timestamp: 2025-12-02T18:13:40.065Z
Learning: In the ai-dynamo/dynamo repository, backend-specific tests (vllm, sglang, trtllm) are intentionally excluded from the container-validation-dynamo.yml workflow using "not (vllm or sglang or trtllm)" because they run in a separate container-validation-backends.yml workflow that has dedicated jobs for each backend. This separation keeps framework-agnostic tests separate from backend-specific tests.
Applied to files:
.github/workflows/container-validation-backends.yml
📚 Learning: 2025-09-04T19:03:06.643Z
Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 2872
File: examples/multimodal/deploy/agg_qwen.yaml:53-60
Timestamp: 2025-09-04T19:03:06.643Z
Learning: In the dynamo repository, Kubernetes Custom Resources use `gpu: "1"` format for GPU resource limits and requests, not the standard Kubernetes `nvidia.com/gpu: 1` format. This applies to DynamoGraphDeployment resources and other dynamo CRs.
Applied to files:
lib/runtime/src/discovery/kube/crd.rsdeploy/cloud/helm/crds/templates/nvidia.com_dynamoworkermetadatas.yaml
📚 Learning: 2025-11-05T08:41:06.483Z
Learnt from: ryanolson
Repo: ai-dynamo/dynamo PR: 4070
File: lib/discovery/src/systems/etcd/peer.rs:151-188
Timestamp: 2025-11-05T08:41:06.483Z
Learning: In lib/discovery/src/systems/etcd/peer.rs, the register_instance method intentionally captures the lease_id before entering the OperationExecutor closure. If the lease is revoked or fails, the operation should hard-fail rather than retry with a new lease, because the system does not track which entries were registered under which lease. Retrying with a fresh lease would create inconsistent state.
Applied to files:
lib/runtime/src/discovery/kube.rs
📚 Learning: 2025-07-18T16:04:31.771Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:92-98
Timestamp: 2025-07-18T16:04:31.771Z
Learning: CRD schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from Kubernetes library upgrades and should not be manually modified as changes would be overwritten during regeneration.
Applied to files:
deploy/cloud/helm/crds/templates/nvidia.com_dynamoworkermetadatas.yaml
🧬 Code graph analysis (1)
lib/runtime/src/discovery/kube.rs (4)
lib/runtime/src/discovery/kube/utils.rs (1)
hash_pod_name(10-14)lib/runtime/src/discovery/kube/crd.rs (3)
apply_cr(89-115)build_cr(56-76)new(42-44)lib/runtime/src/discovery/mod.rs (2)
instance_id(181-186)instance_id(220-220)lib/runtime/src/discovery/kube/daemon.rs (1)
new(35-45)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4988/merge) by mohammedabdulwahhab.
lib/runtime/src/discovery/kube/daemon.rs
[error] 1-1: Trailing whitespace detected. pre-commit hook trailing-whitespace failed and modified the file; changes were applied to lib/runtime/src/discovery/kube/daemon.rs. Please review and commit the updated file. Run 'pre-commit run --all-files' locally to reproduce.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: vllm (arm64)
- GitHub Check: operator (amd64)
- GitHub Check: operator (arm64)
- GitHub Check: vllm (amd64)
- GitHub Check: trtllm (amd64)
- GitHub Check: sglang (amd64)
- GitHub Check: sglang (arm64)
- GitHub Check: Mirror Repository to GitLab
- GitHub Check: clippy (launch/dynamo-run)
- GitHub Check: clippy (.)
- GitHub Check: clippy (lib/bindings/python)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (15)
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go (1)
848-852: LGTM! POD_UID environment variable added consistently.The POD_UID environment variable has been correctly added to both leader and worker pod templates in the test expectations, following the same pattern as POD_NAME and POD_NAMESPACE. This aligns with the PR's objective of supporting pod UID tracking for Kubernetes-backed discovery.
Also applies to: 982-986
deploy/cloud/operator/internal/dynamo/component_planner_test.go (1)
67-71: LGTM! POD_UID added to planner test expectations.The POD_UID environment variable has been correctly added to the planner component test, consistent with the implementation in component_common.go and other component tests.
deploy/cloud/operator/internal/dynamo/component_common.go (1)
107-114: LGTM! POD_UID environment variable correctly implemented.The POD_UID environment variable has been properly added to the common container configuration, making it available to all Dynamo components (frontend, worker, planner, etc.). The implementation follows the established pattern for POD_NAME and POD_NAMESPACE using Kubernetes downward API.
deploy/cloud/operator/internal/dynamo/graph_test.go (1)
1414-1421: LGTM! POD_UID additions are consistent across all test cases.The POD_UID environment variable is correctly added to all Grove Pod clique specs, following the same pattern as the existing POD_NAME and POD_NAMESPACE variables using
FieldRefto extractmetadata.uid. This aligns with the production code changes that expose the pod UID for CR owner references.Also applies to: 1621-1628, 2010-2017, 2199-2206, 2362-2369, 2532-2539, 2951-2958, 3127-3134, 3290-3297, 3467-3474, 5105-5110
lib/runtime/src/discovery/kube/utils.rs (2)
30-31: Improved endpoint extraction with clearer control flow.The refactored
extract_endpoint_infofunction now has better readability:
- Early extraction of
slice_namefor consistent logging- Direct iteration over
&slice.endpoints- Clear handling of missing addresses via
matchwith earlycontinueAlso applies to: 45-45, 67-72
78-116: POD_UID is now required for CR owner references.The
PodInfostruct correctly includespod_uidas a required field. This is necessary for setting theOwnerReferenceon theDynamoWorkerMetadataCR, enabling automatic garbage collection when pods are deleted.The asymmetry between
POD_UID(required, errors on missing) andPOD_NAMESPACE(optional, defaults to "default") is intentional and appropriate - without the UID, owner references cannot be established, but namespace can have a sensible default.deploy/cloud/helm/crds/templates/nvidia.com_dynamoworkermetadatas.yaml (1)
1-64: Well-structured CRD definition for DynamoWorkerMetadata.The new CRD is correctly defined with:
- Appropriate API group (
nvidia.com) and version (v1alpha1)- Namespaced scope (matching the Rust code's
namespacedattribute)x-kubernetes-preserve-unknown-fields: trueon thedatafield to allow flexible JSON storage- Required
datafield in spec (aligns withDynamoWorkerMetadataSpecin crd.rs)- Useful label for discovery backend identification
lib/runtime/src/discovery/kube.rs (2)
104-145: CR persistence integrated with metadata registration.The implementation correctly:
- Holds the write lock across both the in-memory update and CR apply to prevent race conditions
- Builds the CR with pod identity for owner reference
- Uses server-side apply for idempotent operations
The explicit comment about holding the lock is appreciated - it documents the intentional design choice. While this does mean the lock is held during the network call, it ensures consistency between in-memory state and the persisted CR.
Consider whether the lock duration during network calls could become a bottleneck under high registration/unregistration load. If this becomes an issue, an alternative pattern would be to update in-memory state, release the lock, then apply the CR (accepting eventual consistency with conflict resolution).
150-196: Unregister flow mirrors register with CR persistence.The unregister implementation correctly follows the same pattern as register:
- Acquires write lock
- Updates in-memory metadata
- Builds and applies CR with updated state
This ensures the CR always reflects the current in-memory state, and other pods watching the CR will see the removal.
lib/runtime/src/discovery/kube/crd.rs (4)
26-38: Well-defined CRD spec with kube derive macro.The
DynamoWorkerMetadataSpeccorrectly uses:
schema = "disabled"since the OpenAPI schema is defined externally in the YAML CRD manifestserde_json::Valuefor flexible JSON storage matching the CRD'sx-kubernetes-preserve-unknown-fields: true
56-76: Owner reference correctly configured for automatic garbage collection.The
build_crfunction properly sets up the owner reference:
controller: Some(true)marks the Pod as the controlling ownerblock_owner_deletion: Some(false)ensures pod deletion isn't blocked by CR finalizersThis means when a pod is deleted, Kubernetes will automatically garbage collect the associated
DynamoWorkerMetadataCR, preventing orphaned resources.
89-115: Server-side apply with force for reliable CR updates.The
apply_crfunction uses:
PatchParams::apply(FIELD_MANAGER).force()- theforce()is appropriate here since each pod exclusively manages its own CR- Server-side apply semantics ensure idempotent create-or-update behavior
The error handling provides good context on failure, and the debug log aids troubleshooting.
117-184: Comprehensive unit tests covering CRD functionality.The test suite validates:
- CRD metadata (group, version, kind, plural) generated by the derive macro
- Spec creation and data storage
- CR instantiation with proper name assignment
- JSON serialization roundtrip to ensure data integrity
lib/runtime/src/discovery/kube/daemon.rs (2)
98-98: Monitor spawned reflector tasks for panics.The JoinHandles from
tokio::spawnare discarded. If either reflector task panics, the main loop continues running but stops receiving updates, causing the daemon to serve stale metadata indefinitely.Consider storing the handles and monitoring them:
let ep_handle = tokio::spawn(ep_reflector_stream); let cr_handle = tokio::spawn(cr_reflector_stream);Then add to the select loop:
loop { tokio::select! { _ = notify.notified() => { // existing logic... } + result = &mut ep_handle => { + tracing::error!("EndpointSlice reflector task terminated: {:?}", result); + anyhow::bail!("EndpointSlice reflector died"); + } + result = &mut cr_handle => { + tracing::error!("DynamoWorkerMetadata reflector task terminated: {:?}", result); + anyhow::bail!("DynamoWorkerMetadata reflector died"); + } _ = self.cancel_token.cancelled() => { tracing::info!("Discovery daemon received cancellation"); break; } } }Also applies to: 135-135
⛔ Skipped due to learnings
Learnt from: ryanolson Repo: ai-dynamo/dynamo PR: 1093 File: lib/llm/src/block_manager/block/registry.rs:98-122 Timestamp: 2025-05-29T06:20:12.901Z Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.Learnt from: kthui Repo: ai-dynamo/dynamo PR: 3004 File: lib/runtime/src/pipeline/network/ingress/push_handler.rs:271-277 Timestamp: 2025-09-11T03:24:47.820Z Learning: In lib/runtime/src/pipeline/network/ingress/push_handler.rs, the maintainer prefers to keep the existing error comparison logic using format!("{:?}", err) == STREAM_ERR_MSG unchanged until proper error types are implemented, even though it has technical debt. Avoid suggesting changes to working legacy code that will be refactored later.Learnt from: ryanolson Repo: ai-dynamo/dynamo PR: 4070 File: lib/discovery/src/systems/etcd/peer.rs:151-188 Timestamp: 2025-11-05T08:41:06.483Z Learning: In lib/discovery/src/systems/etcd/peer.rs, the register_instance method intentionally captures the lease_id before entering the OperationExecutor closure. If the lease is revoked or fails, the operation should hard-fail rather than retry with a new lease, because the system does not track which entries were registered under which lease. Retrying with a fresh lease would create inconsistent state.Learnt from: PeaBrane Repo: ai-dynamo/dynamo PR: 2465 File: lib/runtime/src/utils/typed_prefix_watcher.rs:94-101 Timestamp: 2025-08-15T23:51:04.958Z Learning: In the dynamo codebase's etcd client implementation, `PrefixWatcher` uses `#[derive(Dissolve)]` to generate a `dissolve()` method. The pattern `let (_, _watcher, mut events_rx) = prefix_watcher.dissolve();` is the standard and intended usage throughout the codebase. The `mpsc::Receiver<WatchEvent>` maintains the etcd watch stream independently, so the `Watcher` handle can be safely dropped. This pattern is used consistently in critical infrastructure modules like component/client.rs, utils/leader_worker_barrier.rs, and entrypoint/input/http.rs.
273-289: The CR naming convention is consistently enforced through thebuild_cr()function.The correlation logic in line 275 (
cr_map.get(&pod_name)) correctly assumes CR names equal pod names because thebuild_cr()function in lib/runtime/src/discovery/kube/crd.rs explicitly uses the pod name as the CR name (line 63:DynamoWorkerMetadata::new(pod_name, spec)). This naming relationship is documented in the function signature and verified by unit tests, ensuring consistency across CR creation operations.
Signed-off-by: mohammedabdulwahhab <[email protected]>
Signed-off-by: mohammedabdulwahhab <[email protected]>
Signed-off-by: mohammedabdulwahhab <[email protected]>
Signed-off-by: mohammedabdulwahhab <[email protected]>
Signed-off-by: mohammedabdulwahhab <[email protected]>
Signed-off-by: mohammedabdulwahhab <[email protected]>
6bd1885 to
9c32955
Compare
Signed-off-by: mohammedabdulwahhab <[email protected]>
9c32955 to
3e7d529
Compare
biswapanda
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall looks good to me.
Let's address the sequence issue w/ deregister flow.
Signed-off-by: mohammedabdulwahhab <[email protected]>
Worker Metadata in Kubernetes is now exposed in a
DynamoWorkerMetadataCR (introduced in this PR)This PR therefore changes workers to:
closes https://linear.app/nvidia/issue/DYN-1463/change-discovery-to-k8s-by-default-in-all-deploy-examples-and-tests
closes https://linear.app/nvidia/issue/DEP-612/multi-node-using-k8s-backed-discovery