Summary
When streaming experts (e.g., from disk/CPU RAM to GPU) in MoE models, the
caching policy could be optimized based on the actual usage pattern, instead
of a generic LRU-style policy.
Motivation
Expert activation is not uniformly distributed: it strongly depends on the
workload. For example, when the model is used in chat mode in a specific
language (e.g., Italian), only a limited subset of experts is activated most
of the time.
Proposal
- Track expert activation statistics during inference (per layer, e.g., a
frequency counter or sliding window).
- Use these statistics to drive the cache: pin or prioritize the "hot"
experts for the current session/workload, evicting rarely used ones first
(LFU-like or frequency-weighted policy instead of plain LRU).
- Optionally allow saving/loading an activation profile, so a known workload
(e.g., "Italian chat") can pre-warm the cache at startup.
Expected benefit
Higher cache hit rate on the expert weights → fewer host-to-device (or
disk-to-RAM) transfers → faster token generation in streaming mode,
especially on systems where experts don't fit in VRAM.
Possible concerns
- Small overhead for tracking activations (should be negligible).
- The profile must adapt if the workload changes mid-session (sliding
window / decay handles this).
This is especially relevant on memory-constrained consumer hardware
(e.g., Apple Silicon with 16GB unified memory), where experts cannot all
stay resident and are repeatedly re-read from disk via mmap. Even a
modest improvement in cache hit rate would noticeably improve token
generation speed in these setups.
Summary
When streaming experts (e.g., from disk/CPU RAM to GPU) in MoE models, the
caching policy could be optimized based on the actual usage pattern, instead
of a generic LRU-style policy.
Motivation
Expert activation is not uniformly distributed: it strongly depends on the
workload. For example, when the model is used in chat mode in a specific
language (e.g., Italian), only a limited subset of experts is activated most
of the time.
Proposal
frequency counter or sliding window).
experts for the current session/workload, evicting rarely used ones first
(LFU-like or frequency-weighted policy instead of plain LRU).
(e.g., "Italian chat") can pre-warm the cache at startup.
Expected benefit
Higher cache hit rate on the expert weights → fewer host-to-device (or
disk-to-RAM) transfers → faster token generation in streaming mode,
especially on systems where experts don't fit in VRAM.
Possible concerns
window / decay handles this).
This is especially relevant on memory-constrained consumer hardware
(e.g., Apple Silicon with 16GB unified memory), where experts cannot all
stay resident and are repeatedly re-read from disk via mmap. Even a
modest improvement in cache hit rate would noticeably improve token
generation speed in these setups.