Add MoEDispatcher for expert-parallel token dispatch/combine by mawad-amd · Pull Request #492 · ROCm/iris

mawad-amd · 2026-03-28T09:44:44Z

Summary

Promote MoE dispatch/combine from examples/31_expert_sharded_moe/ to production iris.ccl module
Add MoEDispatcher class with pre-allocated buffers and handle-based API (dispatch() → combine())
Add moe_utils.py with routing helpers (ExptAssignment, RaggedTensorMetadata, topk, reduce)
Add ctx.ccl.moe_dispatcher() factory method
Add benchmark comparing iris vs naive all_to_all dispatch

Key Design

Direct iris.store() scatter — no AllToAll/AllToAllv; sparse per-token routing via symmetric heap
Pre-allocated buffers in __init__ — dispatch, combine, all-gather routing buffers allocated once
Handle-based API — dispatch() returns opaque DispatchHandle consumed by combine()
Reuses proven kernels — _convert_dp_to_ep and _convert_ep_to_dp are identical to the example

Performance (iris vs naive all_to_all)

T_local	iris (us)	naive (us)	Speedup
32	28,462	19,368	0.68x
64	24,177	34,645	1.43x
128	24,757	66,385	2.68x
256	24,679	125,387	5.08x
512	25,987	227,935	8.77x

iris shows near-constant latency, winning at T_local >= 48 tokens/rank with up to 8.8x speedup.

Full performance study: https://gist.github.com/mawad-amd/c43c02e3662a180a88e26faa31c62f52

Test plan

MoE example tests pass (2/2)
New MoE dispatch tests pass (21/21) — e2e, dispatch-only, combine-only, buffer-reuse, topk=1, handle-frozen
CCL regression tests pass (no new failures)
Parametrized: T_local=[32,128], H=[64,256], k=[1,2], dtype=[bf16,fp32]

🤖 Generated with Claude Code

Promotes the dispatch/combine kernels from examples/31_expert_sharded_moe/ into iris/ccl/ as a production MoEDispatcher class with pre-allocated buffers and a handle-based API. Dispatch routes tokens to expert-owning ranks via direct iris.store scatter; combine sends results back with masked reduction. New files: - iris/ccl/moe_utils.py: ExptAssignment, RaggedTensorMetadata, topk, BitmatrixMetadata, and masked reduce kernel - iris/ccl/moe_dispatch.py: MoEDispatcher, DispatchHandle, MoEDispatchConfig, and the Triton dispatch/combine kernels - tests/ccl/test_moe_dispatch.py: E2E, dispatch-only, combine-only, buffer reuse, topk=1, and handle immutability tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Pre-allocated dispatch/combine buffers were hardcoded to bfloat16, causing Triton compilation errors when used with float32 inputs (tl.dot operand type mismatch). Now accepts a dtype parameter (default: bfloat16) that controls buffer allocation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Benchmarks iris MoEDispatcher (direct iris.store scatter) against a naive approach using torch.distributed.all_gather + host-side sorting across various batch sizes, hidden dims, and topk values. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Ruff F821 flagged `dispatcher` as undefined in a lambda inside a loop. Convert to a def with default args to eagerly bind the loop variable. Also apply ruff auto-fixes (unused `time` import, f-strings without placeholders). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mawad-amd and others added 4 commits March 27, 2026 23:44

Apply Ruff auto-fixes

f799abf

mawad-amd requested review from BKP and neoblizz as code owners March 28, 2026 09:44

Copilot AI review requested due to automatic review settings March 28, 2026 09:44

github-actions bot added in-progress We are working on it iris Iris project issue labels Mar 28, 2026

mawad-amd and others added 2 commits March 28, 2026 13:24

Apply Ruff auto-fixes

ee99270

mawad-amd closed this Mar 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MoEDispatcher for expert-parallel token dispatch/combine#492

Add MoEDispatcher for expert-parallel token dispatch/combine#492
mawad-amd wants to merge 6 commits intomainfrom
muhaawad/moe-token-dispatch

mawad-amd commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mawad-amd commented Mar 28, 2026

Summary

Key Design

Performance (iris vs naive all_to_all)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant