WCER: run a Mixture-of-Experts using only the experts your workload uses #3575

guruswami-ai · 2026-05-21T11:40:40Z

guruswami-ai
May 21, 2026

A MoE keeps all of its experts in memory but routes each token to only a few of them. When a server runs a fixed workload (mostly code, or chat, or math), most experts almost never get picked. WCER traces which experts a workload actually uses, keeps only those in memory, restricts routing to that set, and checks that the trimmed model produces the same tokens the full model would under the same restriction. How much you save depends on how concentrated the routing is, which you can measure from one trace before you deploy.

Code, traces, patches and figures: https://github.com/guruswami-ai/wcer

Savings are predictable from one trace

The number of experts needed to cover 90% of routing predicts how much memory you can cut. It held across 5 models from 4 families:

model	family	experts for 90% of routing	memory cut at full quality
Mixtral-8x7B	Mistral	88% (7/8)	~14% (not worth it)
OLMoE-1B-7B	OLMo	72%	~23%
DeepSeek-V2-Lite	DeepSeek	66%	~23%
Qwen3-30B-A3B	Qwen	38%	~47%
DeepSeek-V4-Flash	DeepSeek	25%	~68%

So one cheap trace tells you whether residency is worth it, including when it isn't (balanced models like Mixtral). It isn't a property of the model family either: the two DeepSeek models sit at opposite ends, so you measure rather than assume.

The trimmed model matches the full model exactly

Under the same routing restriction, the trimmed model picks the same next token as the full model (top-1 agreement 1.0, relative logit difference 0.0). This is a memory optimization. For the chosen expert set there is no quality tradeoff.

Two more findings

A shared expert cushions mistakes. Models with an always-on shared expert (DeepSeek V2 and V4) degrade gracefully when you pick the wrong resident set, around 2.5x perplexity. Models without one (OLMoE, Qwen3) degrade catastrophically, around 194x. The shared expert acts as a safety net against a bad guess.

Pick experts by router weight, not frequency. Choosing experts by how strongly the router prefers them, rather than how often it picks them, recovers close to full quality at half the experts (Qwen3, weighted at 50%: 15.20 ppl vs 15.19 full).

Scope and limits

It saves memory and time-to-first-token. It does not improve throughput, since tokens per second is roughly unchanged (per-token compute is the same), and it does not speed up cold load from disk, since the loader still reads the expert rows.

It is not "best quality per GB". Against a dense model that fits the same RAM, a trimmed large MoE trades quality for speed (dense Qwen3-14B: 10.96 ppl at 26.5 tok/s; WCER Qwen3-30B-A3B at 50%: 16.41 ppl at 83.2 tok/s). WCER makes a chosen MoE fit. It does not claim to beat every alternative.

Validated on Apple Silicon and MLX, 4-bit, single node, prefill traces, perplexity quality. Non-Apple hardware, batched throughput, and task-accuracy metrics are still open.

Background: this came out of expert-parallelism work

WCER is a pivot from expert parallelism (EP), not a replacement for it. EP puts different experts on different nodes so a cluster can host a MoE too big for any one machine. It is still the right answer for genuinely oversized models. It just isn't viable as it stands today, so we built what we could ship now. The two compose: EP for what won't fit anywhere, WCER for what you can shrink to fit.

While getting EP running on a 2-node Apple Silicon setup over Thunderbolt, building on the open MLX EP work (the all_to_all PR #3164 and draft #3158), we found:

The plumbing works. EP2 was bit-correct (OLMoE-1B-7B, 2 ranks vs single node, top-1 token agreement 1.000).
For a model that already fits one node, EP2 is currently a throughput loss, around 5x slower prefill and 12x slower decode at batch 1. Batching narrows the per-layer all_to_all gap (roughly 10x down to 4x) but doesn't close it, because the interconnect is latency-bound. It also doesn't save memory in the naive form, since each rank loads the full model and then shards.
N-rank EP (experts across many nodes) needs all_to_all beyond 2 ranks plus Group.split(), neither of which is upstream yet (the available collective is capped at size 2).

Caveat: this was naive EP. We didn't replicate the shared-expert path or overlap the all_to_all behind shared compute, both of which should help prefill. The per-layer collective in decode is the structural cost.

So EP is gated, not dead. It is waiting on the upstream collective and the optimizations above, and on this kind of interconnect it is a capacity tool rather than a speed win. Rather than wait, we pivoted to the complementary question. Instead of moving experts across nodes, reduce how many have to be resident at all, which avoids the per-token collectives entirely. That is WCER.

I'd be glad to add these EP measurements to the open EP threads (#3158 and #3164) and help move that work along. They are a useful data point on where EP pays off (capacity, large batches) versus where it doesn't (single-node-capable models, latency-bound links). This is part of a longer line of cluster-scaling work from the same project: pipeline-parallelism patches for Llama, Qwen2 and Mixtral (ml-explore/mlx-lm#1051) and pipeline-versus-tensor parallelism on Kimi-K2 1T (#2990).

The traces are reusable on their own

Each trace is a versioned expert-trace/1 JSON: per-layer selection histograms, coverage curves, top co-activation pairs, hash-layer ids, and optionally router-weighted importance. I haven't found a published expert-activation profile set for MLX MoE models, so these may be useful by themselves. The repo ships 15 of them (5 models, 3 workloads each).

The one change models need

Most models run unmodified. DeepSeek needs a small opt-in hook, MoEGate._resident_mask. It defaults to None, which is stock behavior; when set to a boolean mask it pushes non-resident experts below the selection threshold before arg-partition. That is about 6 lines per model, with no behavior change when it is unset. Both patches are in the repo under patches/.

I'm happy to open a PR against mlx-lm for the hook if it would be welcome. I wanted to check interest and approach here first rather than send an unsolicited PR. If you'd rather not carry it, it works fine as an external patch.

Questions

Does the opt-in _resident_mask hook seem reasonable to upstream (default off, no in-tree consumer yet), or is it better left as an external patch?
Has anyone measured routing concentration on other MoEs (GLM-4.x, Qwen3-235B, DeepSeek-V3)? I'd like to know whether the 90%-coverage predictor holds at larger scale.
Is there an MLX-friendly way to read only the resident expert rows from disk, so residency cuts load time as well as resident memory?

Related work from this project: the 290-run benchmark suite (#3300), pipeline-versus-tensor parallelism on Kimi-K2 1T (#2990), systematic quant/context benchmarks (#3209), an RDMA-over-Thunderbolt transfer guide (#3481), pipeline-parallelism patches for mlx-lm models (ml-explore/mlx-lm#1051), and an interactive distributed-inference simulator (ml-explore/mlx-lm#1070).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WCER: run a Mixture-of-Experts using only the experts your workload uses #3575

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

WCER: run a Mixture-of-Experts using only the experts your workload uses #3575

Uh oh!

guruswami-ai May 21, 2026

Savings are predictable from one trace

The trimmed model matches the full model exactly

Two more findings

Scope and limits

Background: this came out of expert-parallelism work

The traces are reusable on their own

The one change models need

Questions

Replies: 0 comments

guruswami-ai
May 21, 2026