feat: Support PD Disaggregation: DP=2(1P1D),TP=n by ZiyiTsang · Pull Request #1364 · areal-project/AReaL

ZiyiTsang · 2026-05-25T11:49:26Z

Motivation

Large language models suffer from low GPU utilisation during autoregressive
decoding — the decode phase is memory-bound, and leaves compute idle. PD
(Prefill-Decode) disaggregation splits a single inference role into two
specialised roles to speed up rollout.

Architecture

Gateway receives the client request. In PD mode it asks the Router for
a matched (prefill, decode) pair and dispatches the request to both
concurrently, injecting bootstrap_host/port/room into the JSON body.
Router maintains a registry of workers typed as prefill, decode, or
regular. The /route_pd endpoint picks one worker from each pool.
Data Proxies translate between the OpenAI chat-completions API and the
backend's native generate API. Each proxy is typed and advertises its role
to the Router.
SGLang servers run with --disaggregation-mode prefill|decode. After
prefill finishes, the mooncake transfer engine pushes the KV cache to the
decode server, which then continues token generation.

What's added

New backend syntax (`areal/api/alloc_mode.py`)

sglang(P:d1t1p1|D:d1t1p1)
 │       │              │
 │       └── groups     └── each group: name:d{dp}t{tp}p{pp}
 └── backend name

Lark grammar rules pd_parallel and pd_group parse the syntax.
ModelAllocation.from_str() merges groups into a synthetic allocation
(dp_size = sum), stashing individual groups in alloc._pd_groups.
ModelAllocation.from_str_multi() returns the raw per-group list.
Fully backward-compatible: non-PD specs work unchanged.

Gateway PD dispatch (`areal/experimental/inference_service/gateway/`)

streaming.py: new PDPair dataclass, query_router_pd() to fetch a
matched pair from the Router, and pd_dual_dispatch() that concurrently
forwards the request to both prefill and decode Data Proxies with injected
bootstrap fields.
app.py: chat_completions checks config.pd_disaggregation; when true
it routes through pd_dual_dispatch instead of the normal single-worker
path. Streaming is not supported in PD mode (returns 400).

Router PD support (`areal/experimental/inference_service/router/`)

state.py: WorkerInfo gains worker_type ("regular" | "prefill" | "decode")
and bootstrap_port fields; WorkerRegistry adds get_prefill_workers()
and get_decode_workers() methods.
app.py: new POST /route_pd endpoint that picks one prefill and one
decode worker, returning their addresses plus a shared bootstrap triplet.

Controller PD orchestration (`areal/experimental/inference_service/controller/`)

controller.py: when config.pd_disaggregation is true, the controller
forks two inference server groups (prefill + decode) with appropriate
--disaggregation-mode flags and bootstrap ports. Data proxies are typed
(prefill / decode) and registered accordingly.
Gateway command includes --pd-disaggregation flag when PD is active.

Data proxy typing (`areal/experimental/inference_service/data_proxy/`)

app.py: accepts --worker-type prefill|decode|regular and --bootstrap-port
arguments. The worker type is advertised to the Router during registration.

Timeout defaults

LocalScheduler.startup_timeout: 30 s → 300 s
RolloutControllerV2._WORKERS_READY_TIMEOUT: 30 s → 300 s

CLI & config

areal/api/cli_args.py: pd_disaggregation: bool field on
InferenceEngineConfig; disaggregation_mode changed from Literal to
str (OmegaConf compat).

Documentation

docs/en/tutorial/pd_disaggregation.md and docs/zh/tutorial/pd_disaggregation.md
— architecture overview, Hydra quoting rules, quickstart with
sglang(P:d1t1p1|D:d1t1p1).
examples/experimental/inference_service/README.md — PD example command

Test plan

Unit tests

tests/test_pd_alloc_mode.py — 10 cases covering parser, allocation
merging, backward compat, and edge cases:

uv run pytest tests/test_pd_alloc_mode.py -v

Existing integration tests updated for PD:
test_controller.py, test_controller_integration.py,
test_gateway_integration.py, test_data_proxy_integration.py.

E2E test case (requires transfer engine installed)

# PD disaggregation (2 workers: prefill + decode)
python3 examples/experimental/inference_service/online_rollout.py \
    --config examples/experimental/inference_service/online_rollout.yaml \
    --model Qwen/Qwen3-0.6B \
    'scheduler.type=local' \
    'rollout.agent.mode=online' \
    'rollout.backend="sglang(P:d1t1p1|D:d1t1p1)"' \
    'train_dataset.batch_size=4'

Manual verification

# Verify PD routing via gateway
curl -s http://<gateway>/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer sk-test123456" \
    -d '{"model":"Qwen/Qwen3-0.6B","messages":[{"role":"user","content":"What is 2+2?"}],"max_tokens":50}'

Expected: valid JSON response with choices[0].message.content containing
the model's answer (not an error).

Known limitations

Streaming is not supported in PD mode (non-streaming only).
PD mode only supports the sglang backend; vllm is not supported.

Scope

This PR: DP=2(1P1D),TP=n, without weight_transfer(as V2 not support)
Next PR: DP=N,TP=N(PD heterogeneous), without weight_transfer(as V2 not support)
Next Next PR: Integrate weight_transfer (after V2 loop done)

Test Result

In my case, I test 2 configs in E2E test case under mooncake transfer engine:

Backend: sglang(P:d1t1p1|D:d1t1p1)
Backend: sglang(P:d1t2p1|D:d1t1p1)

The chat_completion api return are exactly the same.

Note

This PR implements PD (Prefill-Decode) disaggregation on top of the RolloutControllerV2 inference service architecture. The inference-side changes (gateway routing, worker type registry, bootstrap triplet propagation, and KV cache transfer coordination) are complete and have been validated via E2E inference tests.

However, RolloutControllerV2 currently lacks the InferenceEngine interface methods required for weight synchronization (init_weights_update_group, update_weights_from_distributed, etc.). Therefore, the full RL training loop — which depends on the weight-update closed loop between the training engine and the inference engine — cannot be end-to-end tested until RolloutControllerV2 completes its weight-sync path.

Once the weight-update capability lands in RolloutControllerV2 (via XCCL, AWEX, or disk mode), the PD disaggregation logic in this PR should work transparently without additional changes, since all weight-sync broadcast mechanics are orthogonal to the prefill/decode group split.

Related Issue

Detail in issue #1329

Type of Change

…ggregation for inference

- Add examples/experimental/inference_service/pd_online_rollout.py: thin online-rollout entry that asserts rollout.pd_disaggregation=true. Uses the existing online_rollout.yaml + CLI overrides instead of a separate YAML. - Add docs/{en,zh}/tutorial/pd_disaggregation.md and link from _toc.yml. - Update examples README with an Example 3 section pointing at the override-based run command. - Router: surface worker_type / bootstrap_port on /workers; cap bootstrap_room to 63 bits to fit SGLang's signed-int64 limit. - pyproject: remove mooncake-transfer-engine from the sglang extra. Users now install mooncake-transfer-engine or nixl themselves when enabling PD.

Reuse online_rollout.py directly with CLI overrides for PD; the existing config validation in InferenceEngineConfig.__post_init__ already covers the pd_online_rollout.py guard rails. Trim README and tutorial pages to the minimum: install KV transport engine, then one override-based command.

Introduce a new backend string format that encodes PD group structure directly in the rollout.backend field: sglang(P:d1t1p1|D:d1t1p1). pd_disaggregation is now auto-derived from the backend string via regex in InferenceEngineConfig.__post_init__, eliminating the need for a separate rollout.pd_disaggregation=true flag. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Add pd_parallel and pd_group grammar rules to ALLOCATION_GRAMMAR - Add pd_group and pd_parallel transformer methods - Update from_str to return synthetic allocation for PD specs (dp_size=sum) - Add from_str_multi for callers needing individual PD groups - Fix OmegaConf Literal type annotation error in SGLangConfig - Update Hydra quoting in docs for PD syntax (parentheses conflict) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- LocalScheduler.startup_timeout: 30s → 300s (model loading needs time) - RolloutControllerV2._WORKERS_READY_TIMEOUT: 30s → 300s (match scheduler) - Remove unused Literal import from cli_args.py - Add 10 unit tests for sglang(P:...|D:...) PD allocation parsing Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

ZiyiTsang · 2026-05-30T13:37:35Z

Ready to human review.
This PR does not involve weight updates, so it is relatively harmless

ZiyiTsang added 2 commits May 25, 2026 11:07

feat(inference): add support for prefill-decode disaggregation mode

e8cb567

feat(cli): add pd_disaggregation option to enable prefill-decode disa…

ad4f254

…ggregation for inference

This comment was marked as low quality.

Sign in to view

ZiyiTsang changed the title ~~Feat support pd~~ Support PD Disaggregation May 25, 2026

ZiyiTsang and others added 5 commits May 26, 2026 14:53

merge: origin/main into feat--support-PD

c5a293b

ZiyiTsang changed the title ~~Support PD Disaggregation~~ Support PD Disaggregation (P/D=1,TP=1) May 28, 2026

ZiyiTsang changed the title ~~Support PD Disaggregation (P/D=1,TP=1)~~ Support PD Disaggregation (P+D=2,TP=1) May 28, 2026

ZiyiTsang changed the title ~~Support PD Disaggregation (P+D=2,TP=1)~~ feat: Support PD Disaggregation: DP=2(1P1D),TP=1 May 28, 2026

Merge branch 'main' into feat--support-PD

f7081e9

ZiyiTsang changed the title ~~feat: Support PD Disaggregation: DP=2(1P1D),TP=1~~ feat: Support PD Disaggregation: DP=2(1P1D),TP=n May 30, 2026

ZiyiTsang marked this pull request as ready for review May 30, 2026 13:36

ZiyiTsang requested review from CormickKneey, HwVanICI, PrometheusComing, TaoZex, fishcrap, garrett4wade, guozhihao-224, nuzant, rchardx and sitabulaixizawaluduo as code owners May 30, 2026 13:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support PD Disaggregation: DP=2(1P1D),TP=n#1364

feat: Support PD Disaggregation: DP=2(1P1D),TP=n#1364
ZiyiTsang wants to merge 9 commits into
areal-project:mainfrom
ZiyiTsang:feat--support-PD

ZiyiTsang commented May 25, 2026 •

edited

Loading

Uh oh!

This comment was marked as low quality.

Uh oh!

ZiyiTsang commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ZiyiTsang commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Architecture

What's added

New backend syntax (areal/api/alloc_mode.py)

Gateway PD dispatch (areal/experimental/inference_service/gateway/)

Router PD support (areal/experimental/inference_service/router/)

Controller PD orchestration (areal/experimental/inference_service/controller/)

Data proxy typing (areal/experimental/inference_service/data_proxy/)

Timeout defaults

CLI & config

Documentation

Test plan

Unit tests

E2E test case (requires transfer engine installed)

Manual verification

Known limitations

Scope

Test Result

Note

Related Issue

Type of Change

Uh oh!

This comment was marked as low quality.

Uh oh!

ZiyiTsang commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ZiyiTsang commented May 25, 2026 •

edited

Loading

New backend syntax (`areal/api/alloc_mode.py`)

Gateway PD dispatch (`areal/experimental/inference_service/gateway/`)

Router PD support (`areal/experimental/inference_service/router/`)

Controller PD orchestration (`areal/experimental/inference_service/controller/`)

Data proxy typing (`areal/experimental/inference_service/data_proxy/`)