Skip to content

feat: Support PD Disaggregation: DP=2(1P1D),TP=n#1364

Open
ZiyiTsang wants to merge 9 commits into
areal-project:mainfrom
ZiyiTsang:feat--support-PD
Open

feat: Support PD Disaggregation: DP=2(1P1D),TP=n#1364
ZiyiTsang wants to merge 9 commits into
areal-project:mainfrom
ZiyiTsang:feat--support-PD

Conversation

@ZiyiTsang
Copy link
Copy Markdown
Collaborator

@ZiyiTsang ZiyiTsang commented May 25, 2026

Motivation

Large language models suffer from low GPU utilisation during autoregressive
decoding — the decode phase is memory-bound, and leaves compute idle. PD
(Prefill-Decode) disaggregation splits a single inference role into two
specialised roles to speed up rollout.

Architecture

image
  1. Gateway receives the client request. In PD mode it asks the Router for
    a matched (prefill, decode) pair and dispatches the request to both
    concurrently, injecting bootstrap_host/port/room into the JSON body.
  2. Router maintains a registry of workers typed as prefill, decode, or
    regular. The /route_pd endpoint picks one worker from each pool.
  3. Data Proxies translate between the OpenAI chat-completions API and the
    backend's native generate API. Each proxy is typed and advertises its role
    to the Router.
  4. SGLang servers run with --disaggregation-mode prefill|decode. After
    prefill finishes, the mooncake transfer engine pushes the KV cache to the
    decode server, which then continues token generation.

What's added

New backend syntax (areal/api/alloc_mode.py)

sglang(P:d1t1p1|D:d1t1p1)
 │       │              │
 │       └── groups     └── each group: name:d{dp}t{tp}p{pp}
 └── backend name
  • Lark grammar rules pd_parallel and pd_group parse the syntax.
  • ModelAllocation.from_str() merges groups into a synthetic allocation
    (dp_size = sum), stashing individual groups in alloc._pd_groups.
  • ModelAllocation.from_str_multi() returns the raw per-group list.
  • Fully backward-compatible: non-PD specs work unchanged.

Gateway PD dispatch (areal/experimental/inference_service/gateway/)

  • streaming.py: new PDPair dataclass, query_router_pd() to fetch a
    matched pair from the Router, and pd_dual_dispatch() that concurrently
    forwards the request to both prefill and decode Data Proxies with injected
    bootstrap fields.
  • app.py: chat_completions checks config.pd_disaggregation; when true
    it routes through pd_dual_dispatch instead of the normal single-worker
    path. Streaming is not supported in PD mode (returns 400).

Router PD support (areal/experimental/inference_service/router/)

  • state.py: WorkerInfo gains worker_type ("regular" | "prefill" | "decode")
    and bootstrap_port fields; WorkerRegistry adds get_prefill_workers()
    and get_decode_workers() methods.
  • app.py: new POST /route_pd endpoint that picks one prefill and one
    decode worker, returning their addresses plus a shared bootstrap triplet.

Controller PD orchestration (areal/experimental/inference_service/controller/)

  • controller.py: when config.pd_disaggregation is true, the controller
    forks two inference server groups (prefill + decode) with appropriate
    --disaggregation-mode flags and bootstrap ports. Data proxies are typed
    (prefill / decode) and registered accordingly.
  • Gateway command includes --pd-disaggregation flag when PD is active.

Data proxy typing (areal/experimental/inference_service/data_proxy/)

  • app.py: accepts --worker-type prefill|decode|regular and --bootstrap-port
    arguments. The worker type is advertised to the Router during registration.

Timeout defaults

  • LocalScheduler.startup_timeout: 30 s → 300 s
  • RolloutControllerV2._WORKERS_READY_TIMEOUT: 30 s → 300 s

CLI & config

  • areal/api/cli_args.py: pd_disaggregation: bool field on
    InferenceEngineConfig; disaggregation_mode changed from Literal to
    str (OmegaConf compat).

Documentation

  • docs/en/tutorial/pd_disaggregation.md and docs/zh/tutorial/pd_disaggregation.md
    — architecture overview, Hydra quoting rules, quickstart with
    sglang(P:d1t1p1|D:d1t1p1).
  • examples/experimental/inference_service/README.md — PD example command

Test plan

Unit tests

tests/test_pd_alloc_mode.py — 10 cases covering parser, allocation
merging, backward compat, and edge cases:

uv run pytest tests/test_pd_alloc_mode.py -v

Existing integration tests updated for PD:
test_controller.py, test_controller_integration.py,
test_gateway_integration.py, test_data_proxy_integration.py.

E2E test case (requires transfer engine installed)

# PD disaggregation (2 workers: prefill + decode)
python3 examples/experimental/inference_service/online_rollout.py \
    --config examples/experimental/inference_service/online_rollout.yaml \
    --model Qwen/Qwen3-0.6B \
    'scheduler.type=local' \
    'rollout.agent.mode=online' \
    'rollout.backend="sglang(P:d1t1p1|D:d1t1p1)"' \
    'train_dataset.batch_size=4'

Manual verification

# Verify PD routing via gateway
curl -s http://<gateway>/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer sk-test123456" \
    -d '{"model":"Qwen/Qwen3-0.6B","messages":[{"role":"user","content":"What is 2+2?"}],"max_tokens":50}'

Expected: valid JSON response with choices[0].message.content containing
the model's answer (not an error).

Known limitations

  • Streaming is not supported in PD mode (non-streaming only).
  • PD mode only supports the sglang backend; vllm is not supported.

Scope

This PR: DP=2(1P1D),TP=n, without weight_transfer(as V2 not support)
Next PR: DP=N,TP=N(PD heterogeneous), without weight_transfer(as V2 not support)
Next Next PR: Integrate weight_transfer (after V2 loop done)

Test Result

In my case, I test 2 configs in E2E test case under mooncake transfer engine:

  • Backend: sglang(P:d1t1p1|D:d1t1p1)
  • Backend: sglang(P:d1t2p1|D:d1t1p1)

The chat_completion api return are exactly the same.

Note

This PR implements PD (Prefill-Decode) disaggregation on top of the RolloutControllerV2 inference service architecture. The inference-side changes (gateway routing, worker type registry, bootstrap triplet propagation, and KV cache transfer coordination) are complete and have been validated via E2E inference tests.

However, RolloutControllerV2 currently lacks the InferenceEngine interface methods required for weight synchronization (init_weights_update_group, update_weights_from_distributed, etc.). Therefore, the full RL training loop — which depends on the weight-update closed loop between the training engine and the inference engine — cannot be end-to-end tested until RolloutControllerV2 completes its weight-sync path.

Once the weight-update capability lands in RolloutControllerV2 (via XCCL, AWEX, or disk mode), the PD disaggregation logic in this PR should work transparently without additional changes, since all weight-sync broadcast mechanics are orthogonal to the prefill/decode group split.

Related Issue

Detail in issue #1329

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📝 Documentation update
  • ♻️ Refactoring
  • ⚡ Performance improvement
  • ✅ Test coverage improvement

gemini-code-assist[bot]

This comment was marked as low quality.

@ZiyiTsang ZiyiTsang changed the title Feat support pd Support PD Disaggregation May 25, 2026
ZiyiTsang and others added 5 commits May 26, 2026 14:53
- Add examples/experimental/inference_service/pd_online_rollout.py: thin
  online-rollout entry that asserts rollout.pd_disaggregation=true. Uses
  the existing online_rollout.yaml + CLI overrides instead of a separate
  YAML.
- Add docs/{en,zh}/tutorial/pd_disaggregation.md and link from _toc.yml.
- Update examples README with an Example 3 section pointing at the
  override-based run command.
- Router: surface worker_type / bootstrap_port on /workers; cap
  bootstrap_room to 63 bits to fit SGLang's signed-int64 limit.
- pyproject: remove mooncake-transfer-engine from the sglang extra.
  Users now install mooncake-transfer-engine or nixl themselves when
  enabling PD.
Reuse online_rollout.py directly with CLI overrides for PD; the existing
config validation in InferenceEngineConfig.__post_init__ already covers
the pd_online_rollout.py guard rails.

Trim README and tutorial pages to the minimum: install KV transport
engine, then one override-based command.
Introduce a new backend string format that encodes PD group structure
directly in the rollout.backend field: sglang(P:d1t1p1|D:d1t1p1).
pd_disaggregation is now auto-derived from the backend string via regex
in InferenceEngineConfig.__post_init__, eliminating the need for a
separate rollout.pd_disaggregation=true flag.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Add pd_parallel and pd_group grammar rules to ALLOCATION_GRAMMAR
- Add pd_group and pd_parallel transformer methods
- Update from_str to return synthetic allocation for PD specs (dp_size=sum)
- Add from_str_multi for callers needing individual PD groups
- Fix OmegaConf Literal type annotation error in SGLangConfig
- Update Hydra quoting in docs for PD syntax (parentheses conflict)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@ZiyiTsang ZiyiTsang changed the title Support PD Disaggregation Support PD Disaggregation (P/D=1,TP=1) May 28, 2026
@ZiyiTsang ZiyiTsang changed the title Support PD Disaggregation (P/D=1,TP=1) Support PD Disaggregation (P+D=2,TP=1) May 28, 2026
- LocalScheduler.startup_timeout: 30s → 300s (model loading needs time)
- RolloutControllerV2._WORKERS_READY_TIMEOUT: 30s → 300s (match scheduler)
- Remove unused Literal import from cli_args.py
- Add 10 unit tests for sglang(P:...|D:...) PD allocation parsing

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@ZiyiTsang ZiyiTsang changed the title Support PD Disaggregation (P+D=2,TP=1) feat: Support PD Disaggregation: DP=2(1P1D),TP=1 May 28, 2026
@ZiyiTsang ZiyiTsang changed the title feat: Support PD Disaggregation: DP=2(1P1D),TP=1 feat: Support PD Disaggregation: DP=2(1P1D),TP=n May 30, 2026
@ZiyiTsang ZiyiTsang marked this pull request as ready for review May 30, 2026 13:36
@ZiyiTsang
Copy link
Copy Markdown
Collaborator Author

Ready to human review.
This PR does not involve weight updates, so it is relatively harmless

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant