[Parity with CUDA vLLM, CUDA SGLang]: native ATOM llm-d Kubernetes Distributed Inferencing Orchestrator for Production LLM Inferencing [Functional Enablement + Guides + Nightly CI]

### Suggestion Description

hi @chunfangamd @andyluo7

Majority of Production LLM serving is deploy through Kubernetes, where the system must route requests to the right replica, preserve KV-cache locality, scale up and down prefill/decode workers, handle multi-node MoE, etc. llm-d is the Kubernetes orchestrator that [officially supported AMD project](https://rocm.blogs.amd.com/artificial-intelligence/llm-d-distributed/README.html) and a [production deployment target for AMD vLLM customers for example Oracle and other customers](https://blogs.oracle.com/ai-and-datascience/llm-inference-at-scale-with-llm-d-on-oci) What is the timeline for when ATOM will support k8s distributed inferencing orchestrator?

Here is the features we would like to see:

- [ ] [parity with CUDA vLLM llm-d] basic functionality with multinode disagg  https://github.com/llm-d/llm-d/tree/main/guides/pd-disaggregation
- [ ] [parity with CUDA vLLM llm-d] multinode wideEP  https://github.com/llm-d/llm-d/tree/main/guides/wide-ep-lws
- [ ] [parity with CUDA vLLM llm-d] Optimized multinode production deployment guide docs  https://github.com/llm-d/llm-d/tree/main/guides

nightly upstream CI parity
- [ ] **ATOM llm-d using AMD first party Pollara NIC equivalent of PD disaggregation E2E Nightly CI** — https://github.com/llm-d/llm-d/actions/workflows/nightly-e2e-pd-disaggregation-ocp.yaml 
- [ ] **ATOM llm-d  using AMD first party Pollara NIC equivalent of Optimized baseline E2E Nightly CI** —  https://github.com/llm-d/llm-d/actions/workflows/nightly-e2e-optimized-baseline-cks.yaml
- [ ] **ATOM llm-d  equivalent of  Wide-EP E2E  Nightly CI** — https://github.com/llm-d/llm-d/actions/workflows/e2e-wide-ep-accelerator-gke.yaml (MoE / DeepSeek-R1-class DP+EP across multi-nodes)
- [ ] **ATOM llm-d  equivalent of Precise prefix-cache aware E2E Nightly CI** — https://github.com/llm-d/llm-d/actions/workflows/nightly-e2e-precise-prefix-cache-ocp.yaml
- [ ] **ATOM llm-d equivalent of Predicted-latency balancing E2E (OCP)** — https://github.com/llm-d/llm-d/actions/workflows/nightly-e2e-predicted-latency-ocp.yaml
- [ ] **ATOM llm-d equivalent of Workload Variant Autoscaler (WVA) E2E Nightly CI** — https://github.com/llm-d/llm-d/blob/main/.github/workflows/nightly-e2e-wva-cks.yaml

### Operating System

_No response_

### GPU

_No response_

### ROCm Component

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Parity with CUDA vLLM, CUDA SGLang]: native ATOM llm-d Kubernetes Distributed Inferencing Orchestrator for Production LLM Inferencing [Functional Enablement + Guides + Nightly CI] #1187

Suggestion Description

Operating System

GPU

ROCm Component

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Parity with CUDA vLLM, CUDA SGLang]: native ATOM llm-d Kubernetes Distributed Inferencing Orchestrator for Production LLM Inferencing [Functional Enablement + Guides + Nightly CI] #1187

Description

Suggestion Description

Operating System

GPU

ROCm Component

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions