Skip to content

Feature request: usage-aware expert caching for MoE expert streaming #390

@GiorgioOppo

Description

@GiorgioOppo

Summary

When streaming experts (e.g., from disk/CPU RAM to GPU) in MoE models, the
caching policy could be optimized based on the actual usage pattern, instead
of a generic LRU-style policy.

Motivation

Expert activation is not uniformly distributed: it strongly depends on the
workload. For example, when the model is used in chat mode in a specific
language (e.g., Italian), only a limited subset of experts is activated most
of the time.

Proposal

  • Track expert activation statistics during inference (per layer, e.g., a
    frequency counter or sliding window).
  • Use these statistics to drive the cache: pin or prioritize the "hot"
    experts for the current session/workload, evicting rarely used ones first
    (LFU-like or frequency-weighted policy instead of plain LRU).
  • Optionally allow saving/loading an activation profile, so a known workload
    (e.g., "Italian chat") can pre-warm the cache at startup.

Expected benefit

Higher cache hit rate on the expert weights → fewer host-to-device (or
disk-to-RAM) transfers → faster token generation in streaming mode,
especially on systems where experts don't fit in VRAM.

Possible concerns

  • Small overhead for tracking activations (should be negligible).
  • The profile must adapt if the workload changes mid-session (sliding
    window / decay handles this).

This is especially relevant on memory-constrained consumer hardware
(e.g., Apple Silicon with 16GB unified memory), where experts cannot all
stay resident and are repeatedly re-read from disk via mmap. Even a
modest improvement in cache hit rate would noticeably improve token
generation speed in these setups.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions