feat: Support PD Disaggregation: DP=2(1P1D),TP=n#1364
Open
ZiyiTsang wants to merge 9 commits into
Open
Conversation
- Add examples/experimental/inference_service/pd_online_rollout.py: thin
online-rollout entry that asserts rollout.pd_disaggregation=true. Uses
the existing online_rollout.yaml + CLI overrides instead of a separate
YAML.
- Add docs/{en,zh}/tutorial/pd_disaggregation.md and link from _toc.yml.
- Update examples README with an Example 3 section pointing at the
override-based run command.
- Router: surface worker_type / bootstrap_port on /workers; cap
bootstrap_room to 63 bits to fit SGLang's signed-int64 limit.
- pyproject: remove mooncake-transfer-engine from the sglang extra.
Users now install mooncake-transfer-engine or nixl themselves when
enabling PD.
Reuse online_rollout.py directly with CLI overrides for PD; the existing config validation in InferenceEngineConfig.__post_init__ already covers the pd_online_rollout.py guard rails. Trim README and tutorial pages to the minimum: install KV transport engine, then one override-based command.
Introduce a new backend string format that encodes PD group structure directly in the rollout.backend field: sglang(P:d1t1p1|D:d1t1p1). pd_disaggregation is now auto-derived from the backend string via regex in InferenceEngineConfig.__post_init__, eliminating the need for a separate rollout.pd_disaggregation=true flag. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Add pd_parallel and pd_group grammar rules to ALLOCATION_GRAMMAR - Add pd_group and pd_parallel transformer methods - Update from_str to return synthetic allocation for PD specs (dp_size=sum) - Add from_str_multi for callers needing individual PD groups - Fix OmegaConf Literal type annotation error in SGLangConfig - Update Hydra quoting in docs for PD syntax (parentheses conflict) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- LocalScheduler.startup_timeout: 30s → 300s (model loading needs time) - RolloutControllerV2._WORKERS_READY_TIMEOUT: 30s → 300s (match scheduler) - Remove unused Literal import from cli_args.py - Add 10 unit tests for sglang(P:...|D:...) PD allocation parsing Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Collaborator
Author
|
Ready to human review. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Large language models suffer from low GPU utilisation during autoregressive
decoding — the decode phase is memory-bound, and leaves compute idle. PD
(Prefill-Decode) disaggregation splits a single inference role into two
specialised roles to speed up rollout.
Architecture
a matched (prefill, decode) pair and dispatches the request to both
concurrently, injecting
bootstrap_host/port/roominto the JSON body.prefill,decode, orregular. The/route_pdendpoint picks one worker from each pool.backend's native generate API. Each proxy is typed and advertises its role
to the Router.
--disaggregation-mode prefill|decode. Afterprefill finishes, the mooncake transfer engine pushes the KV cache to the
decode server, which then continues token generation.
What's added
New backend syntax (
areal/api/alloc_mode.py)pd_parallelandpd_groupparse the syntax.ModelAllocation.from_str()merges groups into a synthetic allocation(
dp_size = sum), stashing individual groups inalloc._pd_groups.ModelAllocation.from_str_multi()returns the raw per-group list.Gateway PD dispatch (
areal/experimental/inference_service/gateway/)streaming.py: newPDPairdataclass,query_router_pd()to fetch amatched pair from the Router, and
pd_dual_dispatch()that concurrentlyforwards the request to both prefill and decode Data Proxies with injected
bootstrap fields.
app.py:chat_completionschecksconfig.pd_disaggregation; when trueit routes through
pd_dual_dispatchinstead of the normal single-workerpath. Streaming is not supported in PD mode (returns 400).
Router PD support (
areal/experimental/inference_service/router/)state.py:WorkerInfogainsworker_type("regular" | "prefill" | "decode")and
bootstrap_portfields;WorkerRegistryaddsget_prefill_workers()and
get_decode_workers()methods.app.py: newPOST /route_pdendpoint that picks one prefill and onedecode worker, returning their addresses plus a shared bootstrap triplet.
Controller PD orchestration (
areal/experimental/inference_service/controller/)controller.py: whenconfig.pd_disaggregationis true, the controllerforks two inference server groups (prefill + decode) with appropriate
--disaggregation-modeflags and bootstrap ports. Data proxies are typed(
prefill/decode) and registered accordingly.--pd-disaggregationflag when PD is active.Data proxy typing (
areal/experimental/inference_service/data_proxy/)app.py: accepts--worker-type prefill|decode|regularand--bootstrap-portarguments. The worker type is advertised to the Router during registration.
Timeout defaults
LocalScheduler.startup_timeout: 30 s → 300 sRolloutControllerV2._WORKERS_READY_TIMEOUT: 30 s → 300 sCLI & config
areal/api/cli_args.py:pd_disaggregation: boolfield onInferenceEngineConfig;disaggregation_modechanged fromLiteraltostr(OmegaConf compat).Documentation
docs/en/tutorial/pd_disaggregation.mdanddocs/zh/tutorial/pd_disaggregation.md— architecture overview, Hydra quoting rules, quickstart with
sglang(P:d1t1p1|D:d1t1p1).examples/experimental/inference_service/README.md— PD example commandTest plan
Unit tests
tests/test_pd_alloc_mode.py— 10 cases covering parser, allocationmerging, backward compat, and edge cases:
Existing integration tests updated for PD:
test_controller.py,test_controller_integration.py,test_gateway_integration.py,test_data_proxy_integration.py.E2E test case (requires transfer engine installed)
Manual verification
Expected: valid JSON response with
choices[0].message.contentcontainingthe model's answer (not an error).
Known limitations
sglangbackend;vllmis not supported.Scope
This PR: DP=2(1P1D),TP=n, without weight_transfer(as V2 not support)
Next PR: DP=N,TP=N(PD heterogeneous), without weight_transfer(as V2 not support)
Next Next PR: Integrate weight_transfer (after V2 loop done)
Test Result
In my case, I test 2 configs in E2E test case under mooncake transfer engine:
The chat_completion api return are exactly the same.
Note
This PR implements PD (Prefill-Decode) disaggregation on top of the
RolloutControllerV2inference service architecture. The inference-side changes (gateway routing, worker type registry, bootstrap triplet propagation, and KV cache transfer coordination) are complete and have been validated via E2E inference tests.However,
RolloutControllerV2currently lacks theInferenceEngineinterface methods required for weight synchronization (init_weights_update_group,update_weights_from_distributed, etc.). Therefore, the full RL training loop — which depends on the weight-update closed loop between the training engine and the inference engine — cannot be end-to-end tested untilRolloutControllerV2completes its weight-sync path.Once the weight-update capability lands in
RolloutControllerV2(via XCCL, AWEX, or disk mode), the PD disaggregation logic in this PR should work transparently without additional changes, since all weight-sync broadcast mechanics are orthogonal to the prefill/decode group split.Related Issue
Detail in issue #1329
Type of Change