Feature request: usage-aware expert caching for MoE expert streaming

## Summary
When streaming experts (e.g., from disk/CPU RAM to GPU) in MoE models, the
caching policy could be optimized based on the actual usage pattern, instead
of a generic LRU-style policy.

## Motivation
Expert activation is not uniformly distributed: it strongly depends on the
workload. For example, when the model is used in chat mode in a specific
language (e.g., Italian), only a limited subset of experts is activated most
of the time.

## Proposal
- Track expert activation statistics during inference (per layer, e.g., a
  frequency counter or sliding window).
- Use these statistics to drive the cache: pin or prioritize the "hot"
  experts for the current session/workload, evicting rarely used ones first
  (LFU-like or frequency-weighted policy instead of plain LRU).
- Optionally allow saving/loading an activation profile, so a known workload
  (e.g., "Italian chat") can pre-warm the cache at startup.

## Expected benefit
Higher cache hit rate on the expert weights → fewer host-to-device (or
disk-to-RAM) transfers → faster token generation in streaming mode,
especially on systems where experts don't fit in VRAM.

## Possible concerns
- Small overhead for tracking activations (should be negligible).
- The profile must adapt if the workload changes mid-session (sliding
  window / decay handles this).

This is especially relevant on memory-constrained consumer hardware
(e.g., Apple Silicon with 16GB unified memory), where experts cannot all
stay resident and are repeatedly re-read from disk via mmap. Even a
modest improvement in cache hit rate would noticeably improve token
generation speed in these setups.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: usage-aware expert caching for MoE expert streaming #390

Summary

Motivation

Proposal

Expected benefit

Possible concerns

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Feature request: usage-aware expert caching for MoE expert streaming #390

Description

Summary

Motivation

Proposal

Expected benefit

Possible concerns

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions