WCER: run a Mixture-of-Experts using only the experts your workload uses #3575
guruswami-ai
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
A MoE keeps all of its experts in memory but routes each token to only a few of them. When a server runs a fixed workload (mostly code, or chat, or math), most experts almost never get picked. WCER traces which experts a workload actually uses, keeps only those in memory, restricts routing to that set, and checks that the trimmed model produces the same tokens the full model would under the same restriction. How much you save depends on how concentrated the routing is, which you can measure from one trace before you deploy.
Code, traces, patches and figures: https://github.com/guruswami-ai/wcer
Savings are predictable from one trace
The number of experts needed to cover 90% of routing predicts how much memory you can cut. It held across 5 models from 4 families:
So one cheap trace tells you whether residency is worth it, including when it isn't (balanced models like Mixtral). It isn't a property of the model family either: the two DeepSeek models sit at opposite ends, so you measure rather than assume.
The trimmed model matches the full model exactly
Under the same routing restriction, the trimmed model picks the same next token as the full model (top-1 agreement 1.0, relative logit difference 0.0). This is a memory optimization. For the chosen expert set there is no quality tradeoff.
Two more findings
A shared expert cushions mistakes. Models with an always-on shared expert (DeepSeek V2 and V4) degrade gracefully when you pick the wrong resident set, around 2.5x perplexity. Models without one (OLMoE, Qwen3) degrade catastrophically, around 194x. The shared expert acts as a safety net against a bad guess.
Pick experts by router weight, not frequency. Choosing experts by how strongly the router prefers them, rather than how often it picks them, recovers close to full quality at half the experts (Qwen3, weighted at 50%: 15.20 ppl vs 15.19 full).
Scope and limits
It saves memory and time-to-first-token. It does not improve throughput, since tokens per second is roughly unchanged (per-token compute is the same), and it does not speed up cold load from disk, since the loader still reads the expert rows.
It is not "best quality per GB". Against a dense model that fits the same RAM, a trimmed large MoE trades quality for speed (dense Qwen3-14B: 10.96 ppl at 26.5 tok/s; WCER Qwen3-30B-A3B at 50%: 16.41 ppl at 83.2 tok/s). WCER makes a chosen MoE fit. It does not claim to beat every alternative.
Validated on Apple Silicon and MLX, 4-bit, single node, prefill traces, perplexity quality. Non-Apple hardware, batched throughput, and task-accuracy metrics are still open.
Background: this came out of expert-parallelism work
WCER is a pivot from expert parallelism (EP), not a replacement for it. EP puts different experts on different nodes so a cluster can host a MoE too big for any one machine. It is still the right answer for genuinely oversized models. It just isn't viable as it stands today, so we built what we could ship now. The two compose: EP for what won't fit anywhere, WCER for what you can shrink to fit.
While getting EP running on a 2-node Apple Silicon setup over Thunderbolt, building on the open MLX EP work (the
all_to_allPR #3164 and draft #3158), we found:all_to_allgap (roughly 10x down to 4x) but doesn't close it, because the interconnect is latency-bound. It also doesn't save memory in the naive form, since each rank loads the full model and then shards.all_to_allbeyond 2 ranks plusGroup.split(), neither of which is upstream yet (the available collective is capped at size 2).Caveat: this was naive EP. We didn't replicate the shared-expert path or overlap the
all_to_allbehind shared compute, both of which should help prefill. The per-layer collective in decode is the structural cost.So EP is gated, not dead. It is waiting on the upstream collective and the optimizations above, and on this kind of interconnect it is a capacity tool rather than a speed win. Rather than wait, we pivoted to the complementary question. Instead of moving experts across nodes, reduce how many have to be resident at all, which avoids the per-token collectives entirely. That is WCER.
I'd be glad to add these EP measurements to the open EP threads (#3158 and #3164) and help move that work along. They are a useful data point on where EP pays off (capacity, large batches) versus where it doesn't (single-node-capable models, latency-bound links). This is part of a longer line of cluster-scaling work from the same project: pipeline-parallelism patches for Llama, Qwen2 and Mixtral (ml-explore/mlx-lm#1051) and pipeline-versus-tensor parallelism on Kimi-K2 1T (#2990).
The traces are reusable on their own
Each trace is a versioned
expert-trace/1JSON: per-layer selection histograms, coverage curves, top co-activation pairs, hash-layer ids, and optionally router-weighted importance. I haven't found a published expert-activation profile set for MLX MoE models, so these may be useful by themselves. The repo ships 15 of them (5 models, 3 workloads each).The one change models need
Most models run unmodified. DeepSeek needs a small opt-in hook,
MoEGate._resident_mask. It defaults to None, which is stock behavior; when set to a boolean mask it pushes non-resident experts below the selection threshold before arg-partition. That is about 6 lines per model, with no behavior change when it is unset. Both patches are in the repo underpatches/.I'm happy to open a PR against mlx-lm for the hook if it would be welcome. I wanted to check interest and approach here first rather than send an unsolicited PR. If you'd rather not carry it, it works fine as an external patch.
Questions
_resident_maskhook seem reasonable to upstream (default off, no in-tree consumer yet), or is it better left as an external patch?Related work from this project: the 290-run benchmark suite (#3300), pipeline-versus-tensor parallelism on Kimi-K2 1T (#2990), systematic quant/context benchmarks (#3209), an RDMA-over-Thunderbolt transfer guide (#3481), pipeline-parallelism patches for mlx-lm models (ml-explore/mlx-lm#1051), and an interactive distributed-inference simulator (ml-explore/mlx-lm#1070).
Beta Was this translation helpful? Give feedback.
All reactions