Skip to content

fix(cache): enforce SSD cache limit across model switches (#1915)#1922

Closed
imi4u36d wants to merge 1 commit into
jundot:mainfrom
imi4u36d:fix/ssd-cache-limit-model-switch
Closed

fix(cache): enforce SSD cache limit across model switches (#1915)#1922
imi4u36d wants to merge 1 commit into
jundot:mainfrom
imi4u36d:fix/ssd-cache-limit-model-switch

Conversation

@imi4u36d

@imi4u36d imi4u36d commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Problem

SSD cache usage exceeds the configured limit when switching between models .

When switching models, each model creates a new PagedSSDCacheManager sharing the same SSD cache directory. Incompatible blocks from other models are left on disk but not indexed, making them invisible to the LRU eviction logic. This causes two issues:

  1. _get_effective_max_size() miscalculates the limit — it uses index.total_size + disk_free, treating incompatible blocks' disk space as "available". The effective limit is inflated and eviction never triggers for those blocks.

  2. Incompatible blocks are never evicted_enforce_size_limit_for_new_block() only evicts indexed (compatible) blocks via self._index.evict_until_size(). Blocks from previous models accumulate unboundedly.

Reproduction scenario from the issue:

  • 64GB SSD, 45GB cache limit configured
  • Model A fills 30GB of cache → 30GB on disk
  • Switch to Model B → index scans, total_size = 0 (Model A's blocks are incompatible)
  • disk_available = 0 + 34GB(free) = 34GBeffective_max = 33.6GB
  • Model B writes 33GB → actual disk usage = 63GB, far exceeding the 45GB limit

Fix

Two changes in omlx/cache/paged_ssd_cache.py:

1. _get_effective_max_size() — use actual disk usage

# Before (buggy):
disk_available = self._index.total_size + disk_free
disk_limit = int(disk_available * self._DISK_SAFE_RATIO)

# After (fixed):
disk_limit = int((usage.used + usage.free) * self._DISK_SAFE_RATIO)

Using shutil.disk_usage().used + .free accounts for all files in the cache directory, including incompatible blocks from other models that aren't in this manager's index.

2. _scan_existing_files() + new _cleanup_incompatible_blocks() — evict stale blocks

During startup scan, incompatible blocks are now collected. If the total on-disk cache (compatible + incompatible) exceeds the effective limit, incompatible blocks are evicted oldest-first to reclaim space.

Tests

Updated 4 test cases in test_paged_ssd_cache.py to match the corrected disk space calculation. All 134 SSD cache tests pass.

Closes #1915

When switching models, each model creates a new PagedSSDCacheManager
that shares the same SSD cache directory.  Incompatible blocks from
other models are left on disk but not indexed, making them invisible
to the LRU eviction logic.  This caused two problems:

1. _get_effective_max_size() used index.total_size + disk_free, which
   treated incompatible blocks' disk space as available, so the effective
   limit was inflated and eviction never triggered.

2. _enforce_size_limit_for_new_block() only evicts indexed (compatible)
   blocks, so incompatible blocks from previous models accumulated
   unboundedly, exceeding the configured cache limit.

Fix:
- Change _get_effective_max_size() to use shutil.disk_usage().used + .free
  so all on-disk usage (including incompatible blocks) is accounted for.
- Add _cleanup_incompatible_blocks() which removes incompatible blocks
  (oldest first) during startup scan when total on-disk cache exceeds
  the effective limit.

Fixes jundot#1915
@jundot

jundot commented Jun 19, 2026

Copy link
Copy Markdown
Owner

Thanks for tracing the incompatible-block path and putting this PR together. It pointed at the right area of the cache stack.

I’m going to close this version rather than merge it because I ended up taking a different accounting approach. The main issue is that usage.used + usage.free is effectively filesystem capacity, not cache-directory usage, and startup cleanup only handles incompatible blocks that already exceed the configured limit. In the reported scenario, old-model cache can remain below the configured limit, then the new model can add its own cache and push the shared cache directory over the limit.

I handled this separately by tracking incompatible blocks for shared-budget accounting and eviction, while keeping them invisible to the current model’s lookup path. Thanks again for the investigation.

@jundot jundot closed this Jun 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] The SSD cache usage exceeds the limit.

2 participants