Skip to content

Add MoEDispatcher for expert-parallel token dispatch/combine#492

Closed
mawad-amd wants to merge 6 commits intomainfrom
muhaawad/moe-token-dispatch
Closed

Add MoEDispatcher for expert-parallel token dispatch/combine#492
mawad-amd wants to merge 6 commits intomainfrom
muhaawad/moe-token-dispatch

Conversation

@mawad-amd
Copy link
Copy Markdown
Collaborator

Summary

  • Promote MoE dispatch/combine from examples/31_expert_sharded_moe/ to production iris.ccl module
  • Add MoEDispatcher class with pre-allocated buffers and handle-based API (dispatch()combine())
  • Add moe_utils.py with routing helpers (ExptAssignment, RaggedTensorMetadata, topk, reduce)
  • Add ctx.ccl.moe_dispatcher() factory method
  • Add benchmark comparing iris vs naive all_to_all dispatch

Key Design

  • Direct iris.store() scatter — no AllToAll/AllToAllv; sparse per-token routing via symmetric heap
  • Pre-allocated buffers in __init__ — dispatch, combine, all-gather routing buffers allocated once
  • Handle-based APIdispatch() returns opaque DispatchHandle consumed by combine()
  • Reuses proven kernels_convert_dp_to_ep and _convert_ep_to_dp are identical to the example

Performance (iris vs naive all_to_all)

T_local iris (us) naive (us) Speedup
32 28,462 19,368 0.68x
64 24,177 34,645 1.43x
128 24,757 66,385 2.68x
256 24,679 125,387 5.08x
512 25,987 227,935 8.77x

iris shows near-constant latency, winning at T_local >= 48 tokens/rank with up to 8.8x speedup.

Full performance study: https://gist.github.com/mawad-amd/c43c02e3662a180a88e26faa31c62f52

Test plan

  • MoE example tests pass (2/2)
  • New MoE dispatch tests pass (21/21) — e2e, dispatch-only, combine-only, buffer-reuse, topk=1, handle-frozen
  • CCL regression tests pass (no new failures)
  • Parametrized: T_local=[32,128], H=[64,256], k=[1,2], dtype=[bf16,fp32]

🤖 Generated with Claude Code

mawad-amd and others added 4 commits March 27, 2026 23:44
Promotes the dispatch/combine kernels from examples/31_expert_sharded_moe/
into iris/ccl/ as a production MoEDispatcher class with pre-allocated
buffers and a handle-based API. Dispatch routes tokens to expert-owning
ranks via direct iris.store scatter; combine sends results back with
masked reduction.

New files:
- iris/ccl/moe_utils.py: ExptAssignment, RaggedTensorMetadata, topk,
  BitmatrixMetadata, and masked reduce kernel
- iris/ccl/moe_dispatch.py: MoEDispatcher, DispatchHandle,
  MoEDispatchConfig, and the Triton dispatch/combine kernels
- tests/ccl/test_moe_dispatch.py: E2E, dispatch-only, combine-only,
  buffer reuse, topk=1, and handle immutability tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pre-allocated dispatch/combine buffers were hardcoded to bfloat16,
causing Triton compilation errors when used with float32 inputs
(tl.dot operand type mismatch). Now accepts a dtype parameter
(default: bfloat16) that controls buffer allocation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Benchmarks iris MoEDispatcher (direct iris.store scatter) against
a naive approach using torch.distributed.all_gather + host-side
sorting across various batch sizes, hidden dims, and topk values.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mawad-amd mawad-amd requested review from BKP and neoblizz as code owners March 28, 2026 09:44
Copilot AI review requested due to automatic review settings March 28, 2026 09:44
@github-actions github-actions bot added in-progress We are working on it iris Iris project issue labels Mar 28, 2026
mawad-amd and others added 2 commits March 28, 2026 13:24
Ruff F821 flagged `dispatcher` as undefined in a lambda inside a loop.
Convert to a def with default args to eagerly bind the loop variable.
Also apply ruff auto-fixes (unused `time` import, f-strings without
placeholders).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mawad-amd mawad-amd closed this Mar 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

in-progress We are working on it iris Iris project issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant