Suggestion Description
hi @chunfangamd @andyluo7
Majority of Production LLM serving is deploy through Kubernetes, where the system must route requests to the right replica, preserve KV-cache locality, scale up and down prefill/decode workers, handle multi-node MoE, etc. llm-d is the Kubernetes orchestrator that officially supported AMD project and a production deployment target for AMD vLLM customers for example Oracle and other customers What is the timeline for when ATOM will support k8s distributed inferencing orchestrator?
Here is the features we would like to see:
nightly upstream CI parity
Operating System
No response
GPU
No response
ROCm Component
No response
Suggestion Description
hi @chunfangamd @andyluo7
Majority of Production LLM serving is deploy through Kubernetes, where the system must route requests to the right replica, preserve KV-cache locality, scale up and down prefill/decode workers, handle multi-node MoE, etc. llm-d is the Kubernetes orchestrator that officially supported AMD project and a production deployment target for AMD vLLM customers for example Oracle and other customers What is the timeline for when ATOM will support k8s distributed inferencing orchestrator?
Here is the features we would like to see:
nightly upstream CI parity
Operating System
No response
GPU
No response
ROCm Component
No response