fix(cache): enforce SSD cache limit across model switches (#1915) by imi4u36d · Pull Request #1922 · jundot/omlx

imi4u36d · 2026-06-18T01:11:57Z

Problem

SSD cache usage exceeds the configured limit when switching between models .

When switching models, each model creates a new PagedSSDCacheManager sharing the same SSD cache directory. Incompatible blocks from other models are left on disk but not indexed, making them invisible to the LRU eviction logic. This causes two issues:

_get_effective_max_size() miscalculates the limit — it uses index.total_size + disk_free, treating incompatible blocks' disk space as "available". The effective limit is inflated and eviction never triggers for those blocks.
Incompatible blocks are never evicted — _enforce_size_limit_for_new_block() only evicts indexed (compatible) blocks via self._index.evict_until_size(). Blocks from previous models accumulate unboundedly.

Reproduction scenario from the issue:

64GB SSD, 45GB cache limit configured
Model A fills 30GB of cache → 30GB on disk
Switch to Model B → index scans, total_size = 0 (Model A's blocks are incompatible)
disk_available = 0 + 34GB(free) = 34GB → effective_max = 33.6GB
Model B writes 33GB → actual disk usage = 63GB, far exceeding the 45GB limit

Fix

Two changes in omlx/cache/paged_ssd_cache.py:

1. `_get_effective_max_size()` — use actual disk usage

# Before (buggy):
disk_available = self._index.total_size + disk_free
disk_limit = int(disk_available * self._DISK_SAFE_RATIO)

# After (fixed):
disk_limit = int((usage.used + usage.free) * self._DISK_SAFE_RATIO)

Using shutil.disk_usage().used + .free accounts for all files in the cache directory, including incompatible blocks from other models that aren't in this manager's index.

2. `_scan_existing_files()` + new `_cleanup_incompatible_blocks()` — evict stale blocks

During startup scan, incompatible blocks are now collected. If the total on-disk cache (compatible + incompatible) exceeds the effective limit, incompatible blocks are evicted oldest-first to reclaim space.

Tests

Updated 4 test cases in test_paged_ssd_cache.py to match the corrected disk space calculation. All 134 SSD cache tests pass.

Closes #1915

When switching models, each model creates a new PagedSSDCacheManager that shares the same SSD cache directory. Incompatible blocks from other models are left on disk but not indexed, making them invisible to the LRU eviction logic. This caused two problems: 1. _get_effective_max_size() used index.total_size + disk_free, which treated incompatible blocks' disk space as available, so the effective limit was inflated and eviction never triggered. 2. _enforce_size_limit_for_new_block() only evicts indexed (compatible) blocks, so incompatible blocks from previous models accumulated unboundedly, exceeding the configured cache limit. Fix: - Change _get_effective_max_size() to use shutil.disk_usage().used + .free so all on-disk usage (including incompatible blocks) is accounted for. - Add _cleanup_incompatible_blocks() which removes incompatible blocks (oldest first) during startup scan when total on-disk cache exceeds the effective limit. Fixes jundot#1915

jundot · 2026-06-19T03:43:57Z

Thanks for tracing the incompatible-block path and putting this PR together. It pointed at the right area of the cache stack.

I’m going to close this version rather than merge it because I ended up taking a different accounting approach. The main issue is that usage.used + usage.free is effectively filesystem capacity, not cache-directory usage, and startup cleanup only handles incompatible blocks that already exceed the configured limit. In the reported scenario, old-model cache can remain below the configured limit, then the new model can add its own cache and push the shared cache directory over the limit.

I handled this separately by tracking incompatible blocks for shared-budget accounting and eviction, while keeping them invisible to the current model’s lookup path. Thanks again for the investigation.

imi4u36d mentioned this pull request Jun 18, 2026

[BUG] The SSD cache usage exceeds the limit. #1915

Open

jundot closed this Jun 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cache): enforce SSD cache limit across model switches (#1915)#1922

fix(cache): enforce SSD cache limit across model switches (#1915)#1922
imi4u36d wants to merge 1 commit into
jundot:mainfrom
imi4u36d:fix/ssd-cache-limit-model-switch

imi4u36d commented Jun 18, 2026 •

edited

Loading

Uh oh!

jundot commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

imi4u36d commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

1. _get_effective_max_size() — use actual disk usage

2. _scan_existing_files() + new _cleanup_incompatible_blocks() — evict stale blocks

Tests

Uh oh!

jundot commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

imi4u36d commented Jun 18, 2026 •

edited

Loading

1. `_get_effective_max_size()` — use actual disk usage

2. `_scan_existing_files()` + new `_cleanup_incompatible_blocks()` — evict stale blocks