Skip to content

Conversation

@mohammedabdulwahhab
Copy link
Contributor

@mohammedabdulwahhab mohammedabdulwahhab commented Dec 16, 2025

Worker Metadata in Kubernetes is now exposed in a DynamoWorkerMetadata CR (introduced in this PR)

This PR therefore changes workers to:

  1. persist their metadata to this CR
  2. subscribe to changes to this CR to pick up metadata updates to other workers

closes https://linear.app/nvidia/issue/DYN-1463/change-discovery-to-k8s-by-default-in-all-deploy-examples-and-tests
closes https://linear.app/nvidia/issue/DEP-612/multi-node-using-k8s-backed-discovery

Signed-off-by: mohammedabdulwahhab <[email protected]>
Signed-off-by: mohammedabdulwahhab <[email protected]>
Signed-off-by: mohammedabdulwahhab <[email protected]>
@mohammedabdulwahhab mohammedabdulwahhab changed the title fix: introduce dynamoworkermetadata CR fix: support dynamic metadata updates in Kubernetes backed discovery Dec 17, 2025
Signed-off-by: mohammedabdulwahhab <[email protected]>
Signed-off-by: mohammedabdulwahhab <[email protected]>
Signed-off-by: mohammedabdulwahhab <[email protected]>
@mohammedabdulwahhab mohammedabdulwahhab marked this pull request as ready for review December 17, 2025 22:24
@mohammedabdulwahhab mohammedabdulwahhab requested review from a team as code owners December 17, 2025 22:24
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 17, 2025

Walkthrough

This PR extends the discovery system with a new Kubernetes CustomResourceDefinition (DynamoWorkerMetadata) for persisting pod metadata. It introduces a reflector-based daemon replacing HTTP-based polling, adds pod UID tracking throughout the operator and runtime layers, disables some CI workflow steps, and implements Rust code for CRD creation and application.

Changes

Cohort / File(s) Summary
CI/CD Workflow Changes
\.github/workflows/container-validation-backends\.yml
Commented out three previously active steps (Tester, Set up Go, Check for uncommitted changes) in the operator job, leaving only the Build Container step active
Kubernetes CRD Manifest
deploy/cloud/helm/crds/templates/nvidia\.com_dynamoworkermetadatas\.yaml
Added new CustomResourceDefinition for DynamoWorkerMetadata with group nvidia.com, kind DynamoWorkerMetadata, v1alpha1 version, namespaced scope, schema defining spec with required data field, and Age printer column
Operator RBAC & Discovery
deploy/cloud/operator/internal/discovery/resource\.go
Added new apiGroupNvidia constant and RBAC policy rule granting permissions (create, get, list, watch, update, patch, delete) for dynamoworkermetadatas resource in Nvidia API group
Operator Dynamo Component & Tests
deploy/cloud/operator/internal/dynamo/component_common\.go, component_planner_test\.go, graph_test\.go
Added POD_UID environment variable sourced from metadata.uid to common container configuration and test pod specs across planner and graph test scenarios
Operator Controller Tests
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test\.go
Added POD_UID environment variable sourced from metadata.uid to LeaderWorkerSet pod templates in test paths
Rust Runtime Discovery Module
lib/runtime/src/discovery/kube\.rs
Added crd module re-exports (DynamoWorkerMetadata, DynamoWorkerMetadataSpec), extended KubeDiscoveryClient with kube_client and pod_info fields, added pod UID to initialization logging, and implemented CR persistence on register/unregister operations
Rust CRD Integration
lib/runtime/src/discovery/kube/crd\.rs
Introduced new DynamoWorkerMetadataSpec struct with data field, build_cr function to serialize DiscoveryMetadata into CRD, apply_cr async function for server-side CRD application, and comprehensive unit tests for serialization and CR operations
Rust Discovery Daemon Refactor
lib/runtime/src/discovery/kube/daemon\.rs
Replaced HTTP-based per-pod metadata fetching with dual-reflector approach (EndpointSlice for readiness, DynamoWorkerMetadata for metadata), introduced debounce-driven snapshot generation, removed HTTP client and caching mechanisms, updated aggregate_snapshot signature to accept separate readers for both reflectors
Rust Pod Info Enhancement
lib/runtime/src/discovery/kube/utils\.rs
Added new public pod_uid field to PodInfo struct, updated from_env to read POD_UID environment variable and system_port from RuntimeConfig, adjusted extract_endpoint_info control flow for explicit handling of missing endpoint addresses

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

  • lib/runtime/src/discovery/kube/daemon.rs: Significant architectural refactor from HTTP polling to reflector-based patterns; requires careful review of the new aggregation logic and debounce mechanism
  • lib/runtime/src/discovery/kube/crd.rs: New public API surface with CRD building and application; validate serialization logic and error handling
  • lib/runtime/src/discovery/kube.rs: Integration of new CRD module and client initialization changes; ensure proper error propagation
  • Cross-layer consistency: Verify POD_UID additions are correctly propagated across operator, controller, and daemon implementations
  • Kubernetes manifest correctness: Validate CRD schema, RBAC permissions, and owner references

Poem

🐰 A pod now knows its own true name,
With UID tracked in metadata's frame,
No more HTTP calls in the night—
CRDs make discovery right,
Reflectors dance where reflectors came!

Pre-merge checks

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Description check ❓ Inconclusive The PR description is concise but lacks sufficient implementation details for full evaluation. Add more details about: (1) the structure and purpose of the new DynamoWorkerMetadata CR, (2) how metadata persistence works, (3) how subscription/watching mechanism functions, and (4) guidance on where reviewers should focus their attention.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: support dynamic metadata updates in Kubernetes backed discovery' accurately summarizes the main change: adding DynamoWorkerMetadata CR support for dynamic metadata updates in Kubernetes-backed discovery.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 54dcf3a and d5f8c77.

📒 Files selected for processing (11)
  • .github/workflows/container-validation-backends.yml (1 hunks)
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamoworkermetadatas.yaml (1 hunks)
  • deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go (2 hunks)
  • deploy/cloud/operator/internal/discovery/resource.go (2 hunks)
  • deploy/cloud/operator/internal/dynamo/component_common.go (1 hunks)
  • deploy/cloud/operator/internal/dynamo/component_planner_test.go (1 hunks)
  • deploy/cloud/operator/internal/dynamo/graph_test.go (11 hunks)
  • lib/runtime/src/discovery/kube.rs (8 hunks)
  • lib/runtime/src/discovery/kube/crd.rs (1 hunks)
  • lib/runtime/src/discovery/kube/daemon.rs (3 hunks)
  • lib/runtime/src/discovery/kube/utils.rs (5 hunks)
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-12-02T18:13:40.065Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 4698
File: .github/workflows/container-validation-dynamo.yml:68-68
Timestamp: 2025-12-02T18:13:40.065Z
Learning: In the ai-dynamo/dynamo repository, backend-specific tests (vllm, sglang, trtllm) are intentionally excluded from the container-validation-dynamo.yml workflow using "not (vllm or sglang or trtllm)" because they run in a separate container-validation-backends.yml workflow that has dedicated jobs for each backend. This separation keeps framework-agnostic tests separate from backend-specific tests.

Applied to files:

  • .github/workflows/container-validation-backends.yml
📚 Learning: 2025-09-04T19:03:06.643Z
Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 2872
File: examples/multimodal/deploy/agg_qwen.yaml:53-60
Timestamp: 2025-09-04T19:03:06.643Z
Learning: In the dynamo repository, Kubernetes Custom Resources use `gpu: "1"` format for GPU resource limits and requests, not the standard Kubernetes `nvidia.com/gpu: 1` format. This applies to DynamoGraphDeployment resources and other dynamo CRs.

Applied to files:

  • lib/runtime/src/discovery/kube/crd.rs
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamoworkermetadatas.yaml
📚 Learning: 2025-11-05T08:41:06.483Z
Learnt from: ryanolson
Repo: ai-dynamo/dynamo PR: 4070
File: lib/discovery/src/systems/etcd/peer.rs:151-188
Timestamp: 2025-11-05T08:41:06.483Z
Learning: In lib/discovery/src/systems/etcd/peer.rs, the register_instance method intentionally captures the lease_id before entering the OperationExecutor closure. If the lease is revoked or fails, the operation should hard-fail rather than retry with a new lease, because the system does not track which entries were registered under which lease. Retrying with a fresh lease would create inconsistent state.

Applied to files:

  • lib/runtime/src/discovery/kube.rs
📚 Learning: 2025-07-18T16:04:31.771Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:92-98
Timestamp: 2025-07-18T16:04:31.771Z
Learning: CRD schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from Kubernetes library upgrades and should not be manually modified as changes would be overwritten during regeneration.

Applied to files:

  • deploy/cloud/helm/crds/templates/nvidia.com_dynamoworkermetadatas.yaml
🧬 Code graph analysis (1)
lib/runtime/src/discovery/kube.rs (4)
lib/runtime/src/discovery/kube/utils.rs (1)
  • hash_pod_name (10-14)
lib/runtime/src/discovery/kube/crd.rs (3)
  • apply_cr (89-115)
  • build_cr (56-76)
  • new (42-44)
lib/runtime/src/discovery/mod.rs (2)
  • instance_id (181-186)
  • instance_id (220-220)
lib/runtime/src/discovery/kube/daemon.rs (1)
  • new (35-45)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4988/merge) by mohammedabdulwahhab.
lib/runtime/src/discovery/kube/daemon.rs

[error] 1-1: Trailing whitespace detected. pre-commit hook trailing-whitespace failed and modified the file; changes were applied to lib/runtime/src/discovery/kube/daemon.rs. Please review and commit the updated file. Run 'pre-commit run --all-files' locally to reproduce.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: vllm (arm64)
  • GitHub Check: operator (amd64)
  • GitHub Check: operator (arm64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: sglang (amd64)
  • GitHub Check: sglang (arm64)
  • GitHub Check: Mirror Repository to GitLab
  • GitHub Check: clippy (launch/dynamo-run)
  • GitHub Check: clippy (.)
  • GitHub Check: clippy (lib/bindings/python)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (15)
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go (1)

848-852: LGTM! POD_UID environment variable added consistently.

The POD_UID environment variable has been correctly added to both leader and worker pod templates in the test expectations, following the same pattern as POD_NAME and POD_NAMESPACE. This aligns with the PR's objective of supporting pod UID tracking for Kubernetes-backed discovery.

Also applies to: 982-986

deploy/cloud/operator/internal/dynamo/component_planner_test.go (1)

67-71: LGTM! POD_UID added to planner test expectations.

The POD_UID environment variable has been correctly added to the planner component test, consistent with the implementation in component_common.go and other component tests.

deploy/cloud/operator/internal/dynamo/component_common.go (1)

107-114: LGTM! POD_UID environment variable correctly implemented.

The POD_UID environment variable has been properly added to the common container configuration, making it available to all Dynamo components (frontend, worker, planner, etc.). The implementation follows the established pattern for POD_NAME and POD_NAMESPACE using Kubernetes downward API.

deploy/cloud/operator/internal/dynamo/graph_test.go (1)

1414-1421: LGTM! POD_UID additions are consistent across all test cases.

The POD_UID environment variable is correctly added to all Grove Pod clique specs, following the same pattern as the existing POD_NAME and POD_NAMESPACE variables using FieldRef to extract metadata.uid. This aligns with the production code changes that expose the pod UID for CR owner references.

Also applies to: 1621-1628, 2010-2017, 2199-2206, 2362-2369, 2532-2539, 2951-2958, 3127-3134, 3290-3297, 3467-3474, 5105-5110

lib/runtime/src/discovery/kube/utils.rs (2)

30-31: Improved endpoint extraction with clearer control flow.

The refactored extract_endpoint_info function now has better readability:

  • Early extraction of slice_name for consistent logging
  • Direct iteration over &slice.endpoints
  • Clear handling of missing addresses via match with early continue

Also applies to: 45-45, 67-72


78-116: POD_UID is now required for CR owner references.

The PodInfo struct correctly includes pod_uid as a required field. This is necessary for setting the OwnerReference on the DynamoWorkerMetadata CR, enabling automatic garbage collection when pods are deleted.

The asymmetry between POD_UID (required, errors on missing) and POD_NAMESPACE (optional, defaults to "default") is intentional and appropriate - without the UID, owner references cannot be established, but namespace can have a sensible default.

deploy/cloud/helm/crds/templates/nvidia.com_dynamoworkermetadatas.yaml (1)

1-64: Well-structured CRD definition for DynamoWorkerMetadata.

The new CRD is correctly defined with:

  • Appropriate API group (nvidia.com) and version (v1alpha1)
  • Namespaced scope (matching the Rust code's namespaced attribute)
  • x-kubernetes-preserve-unknown-fields: true on the data field to allow flexible JSON storage
  • Required data field in spec (aligns with DynamoWorkerMetadataSpec in crd.rs)
  • Useful label for discovery backend identification
lib/runtime/src/discovery/kube.rs (2)

104-145: CR persistence integrated with metadata registration.

The implementation correctly:

  1. Holds the write lock across both the in-memory update and CR apply to prevent race conditions
  2. Builds the CR with pod identity for owner reference
  3. Uses server-side apply for idempotent operations

The explicit comment about holding the lock is appreciated - it documents the intentional design choice. While this does mean the lock is held during the network call, it ensures consistency between in-memory state and the persisted CR.

Consider whether the lock duration during network calls could become a bottleneck under high registration/unregistration load. If this becomes an issue, an alternative pattern would be to update in-memory state, release the lock, then apply the CR (accepting eventual consistency with conflict resolution).


150-196: Unregister flow mirrors register with CR persistence.

The unregister implementation correctly follows the same pattern as register:

  1. Acquires write lock
  2. Updates in-memory metadata
  3. Builds and applies CR with updated state

This ensures the CR always reflects the current in-memory state, and other pods watching the CR will see the removal.

lib/runtime/src/discovery/kube/crd.rs (4)

26-38: Well-defined CRD spec with kube derive macro.

The DynamoWorkerMetadataSpec correctly uses:

  • schema = "disabled" since the OpenAPI schema is defined externally in the YAML CRD manifest
  • serde_json::Value for flexible JSON storage matching the CRD's x-kubernetes-preserve-unknown-fields: true

56-76: Owner reference correctly configured for automatic garbage collection.

The build_cr function properly sets up the owner reference:

  • controller: Some(true) marks the Pod as the controlling owner
  • block_owner_deletion: Some(false) ensures pod deletion isn't blocked by CR finalizers

This means when a pod is deleted, Kubernetes will automatically garbage collect the associated DynamoWorkerMetadata CR, preventing orphaned resources.


89-115: Server-side apply with force for reliable CR updates.

The apply_cr function uses:

  • PatchParams::apply(FIELD_MANAGER).force() - the force() is appropriate here since each pod exclusively manages its own CR
  • Server-side apply semantics ensure idempotent create-or-update behavior

The error handling provides good context on failure, and the debug log aids troubleshooting.


117-184: Comprehensive unit tests covering CRD functionality.

The test suite validates:

  • CRD metadata (group, version, kind, plural) generated by the derive macro
  • Spec creation and data storage
  • CR instantiation with proper name assignment
  • JSON serialization roundtrip to ensure data integrity
lib/runtime/src/discovery/kube/daemon.rs (2)

98-98: Monitor spawned reflector tasks for panics.

The JoinHandles from tokio::spawn are discarded. If either reflector task panics, the main loop continues running but stops receiving updates, causing the daemon to serve stale metadata indefinitely.

Consider storing the handles and monitoring them:

let ep_handle = tokio::spawn(ep_reflector_stream);
let cr_handle = tokio::spawn(cr_reflector_stream);

Then add to the select loop:

 loop {
     tokio::select! {
         _ = notify.notified() => {
             // existing logic...
         }
+        result = &mut ep_handle => {
+            tracing::error!("EndpointSlice reflector task terminated: {:?}", result);
+            anyhow::bail!("EndpointSlice reflector died");
+        }
+        result = &mut cr_handle => {
+            tracing::error!("DynamoWorkerMetadata reflector task terminated: {:?}", result);
+            anyhow::bail!("DynamoWorkerMetadata reflector died");
+        }
         _ = self.cancel_token.cancelled() => {
             tracing::info!("Discovery daemon received cancellation");
             break;
         }
     }
 }

Also applies to: 135-135

⛔ Skipped due to learnings
Learnt from: ryanolson
Repo: ai-dynamo/dynamo PR: 1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3004
File: lib/runtime/src/pipeline/network/ingress/push_handler.rs:271-277
Timestamp: 2025-09-11T03:24:47.820Z
Learning: In lib/runtime/src/pipeline/network/ingress/push_handler.rs, the maintainer prefers to keep the existing error comparison logic using format!("{:?}", err) == STREAM_ERR_MSG unchanged until proper error types are implemented, even though it has technical debt. Avoid suggesting changes to working legacy code that will be refactored later.
Learnt from: ryanolson
Repo: ai-dynamo/dynamo PR: 4070
File: lib/discovery/src/systems/etcd/peer.rs:151-188
Timestamp: 2025-11-05T08:41:06.483Z
Learning: In lib/discovery/src/systems/etcd/peer.rs, the register_instance method intentionally captures the lease_id before entering the OperationExecutor closure. If the lease is revoked or fails, the operation should hard-fail rather than retry with a new lease, because the system does not track which entries were registered under which lease. Retrying with a fresh lease would create inconsistent state.
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2465
File: lib/runtime/src/utils/typed_prefix_watcher.rs:94-101
Timestamp: 2025-08-15T23:51:04.958Z
Learning: In the dynamo codebase's etcd client implementation, `PrefixWatcher` uses `#[derive(Dissolve)]` to generate a `dissolve()` method. The pattern `let (_, _watcher, mut events_rx) = prefix_watcher.dissolve();` is the standard and intended usage throughout the codebase. The `mpsc::Receiver<WatchEvent>` maintains the etcd watch stream independently, so the `Watcher` handle can be safely dropped. This pattern is used consistently in critical infrastructure modules like component/client.rs, utils/leader_worker_barrier.rs, and entrypoint/input/http.rs.

273-289: The CR naming convention is consistently enforced through the build_cr() function.

The correlation logic in line 275 (cr_map.get(&pod_name)) correctly assumes CR names equal pod names because the build_cr() function in lib/runtime/src/discovery/kube/crd.rs explicitly uses the pod name as the CR name (line 63: DynamoWorkerMetadata::new(pod_name, spec)). This naming relationship is documented in the function signature and verified by unit tests, ensuring consistency across CR creation operations.

Signed-off-by: mohammedabdulwahhab <[email protected]>
Signed-off-by: mohammedabdulwahhab <[email protected]>
Signed-off-by: mohammedabdulwahhab <[email protected]>
Signed-off-by: mohammedabdulwahhab <[email protected]>
@mohammedabdulwahhab mohammedabdulwahhab force-pushed the mabdulwahhab/dynamic-metadata branch from 6bd1885 to 9c32955 Compare December 19, 2025 00:13
@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 19, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: mohammedabdulwahhab <[email protected]>
Copy link
Contributor

@biswapanda biswapanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good to me.

Let's address the sequence issue w/ deregister flow.

@mohammedabdulwahhab mohammedabdulwahhab merged commit 4106b90 into main Dec 19, 2025
37 of 38 checks passed
@mohammedabdulwahhab mohammedabdulwahhab deleted the mabdulwahhab/dynamic-metadata branch December 19, 2025 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants