fix(cache): enforce SSD cache limit across model switches (#1915)#1922
fix(cache): enforce SSD cache limit across model switches (#1915)#1922imi4u36d wants to merge 1 commit into
Conversation
When switching models, each model creates a new PagedSSDCacheManager that shares the same SSD cache directory. Incompatible blocks from other models are left on disk but not indexed, making them invisible to the LRU eviction logic. This caused two problems: 1. _get_effective_max_size() used index.total_size + disk_free, which treated incompatible blocks' disk space as available, so the effective limit was inflated and eviction never triggered. 2. _enforce_size_limit_for_new_block() only evicts indexed (compatible) blocks, so incompatible blocks from previous models accumulated unboundedly, exceeding the configured cache limit. Fix: - Change _get_effective_max_size() to use shutil.disk_usage().used + .free so all on-disk usage (including incompatible blocks) is accounted for. - Add _cleanup_incompatible_blocks() which removes incompatible blocks (oldest first) during startup scan when total on-disk cache exceeds the effective limit. Fixes jundot#1915
|
Thanks for tracing the incompatible-block path and putting this PR together. It pointed at the right area of the cache stack. I’m going to close this version rather than merge it because I ended up taking a different accounting approach. The main issue is that I handled this separately by tracking incompatible blocks for shared-budget accounting and eviction, while keeping them invisible to the current model’s lookup path. Thanks again for the investigation. |
Problem
SSD cache usage exceeds the configured limit when switching between models .
When switching models, each model creates a new
PagedSSDCacheManagersharing the same SSD cache directory. Incompatible blocks from other models are left on disk but not indexed, making them invisible to the LRU eviction logic. This causes two issues:_get_effective_max_size()miscalculates the limit — it usesindex.total_size + disk_free, treating incompatible blocks' disk space as "available". The effective limit is inflated and eviction never triggers for those blocks.Incompatible blocks are never evicted —
_enforce_size_limit_for_new_block()only evicts indexed (compatible) blocks viaself._index.evict_until_size(). Blocks from previous models accumulate unboundedly.Reproduction scenario from the issue:
disk_available = 0 + 34GB(free) = 34GB→effective_max = 33.6GBFix
Two changes in
omlx/cache/paged_ssd_cache.py:1.
_get_effective_max_size()— use actual disk usageUsing
shutil.disk_usage().used + .freeaccounts for all files in the cache directory, including incompatible blocks from other models that aren't in this manager's index.2.
_scan_existing_files()+ new_cleanup_incompatible_blocks()— evict stale blocksDuring startup scan, incompatible blocks are now collected. If the total on-disk cache (compatible + incompatible) exceeds the effective limit, incompatible blocks are evicted oldest-first to reclaim space.
Tests
Updated 4 test cases in
test_paged_ssd_cache.pyto match the corrected disk space calculation. All 134 SSD cache tests pass.Closes #1915