ArchiveScanner: fix WriteCacheData crash when pool cache outlives .sdp archives#2961
Conversation
…p archives CArchiveScanner::WriteCacheData has a verification loop that erases- during-iteration on poolFilesInfo (a spring::unordered_map). Two design defects combine into a deterministic 0xc0000005 crash: A. Empty-map dereference. The loop wraps it == end() back to begin(), then breaks if it == st (the original randomized start). When the loop empties the map mid-iteration, begin() == end() but st points at a smaller bucket index, so the next ExistenceTest(end()) call dereferences past the allocated buckets array. B. Vacuous missing-verdict when allPoolRootDirs is empty. The loop builds allPoolRootDirs by filtering archiveInfos for .sdp entries. With zero .sdp archives in scan dirs (e.g. dev install with empty data/packages/ but a stale archivecache.lua from a prior prod run), ExistenceTest iterates an empty set and returns false unconditionally. Every entry gets erased, then defect A triggers. Fix: - Wrap the verification loop in if (!allPoolRootDirs.empty()) -- with no roots, the only safe action is to leave the cache alone, since the ExistenceTest cannot meaningfully verify anything. - Add if (poolFilesInfo.empty()) break inside the loop body as defense in depth against any other future code path that drains the map mid- iteration. Reproduces deterministically on WSL2 dev installs that point at the prod BAR launcher data dir: prod populated archivecache.lua with 10k+ pool entries, dev clears data/packages/, next launch erases the whole cache and crashes. Regression originally introduced in spring/spring PR beyond-all-reason#2078 (deadline target for pool files scanning), which replaced a naturally- terminating it != end() loop with a time-bounded wraparound loop.
Co-authored-by: sprunk <spr.ng@o2.pl>
specific variable conveying what we're protecting against Co-authored-by: sprunk <spr.ng@o2.pl>
|
What if The change so far looks good despite that potential remaining bug though. |
unordered_map::erase invalidates the erased iterator. If st is the entry being erased, the termination check silently stops working and the loop runs to its 1s deadline re-checking survivors. Re-anchor st to erase()'s return value when st is the one erased; wrap to begin() if that landed on end()
|
Right. st dangles after erase, so it == st never trips again. Any iterator held across a mutation is suspect; erase(it) only promises a valid return value for it itself, not for other iterators into the container. My higher order language intuition completely failed me here. The probabilistic sweep still feels shaky to me, is there a reason we can't do a bounded full pass here and bail for pathological cases? Happy to take it as a follow up if so. |
@lhog would be best to say here, though AFAICT the reason is that 1s timeout is important for UX reasons, but if (say) you can process 1234 files per second then quickly enough you'd end up always processing the same first 1234 files on each run and never reach the others before it times out. |
|
On the thrash concern: the loop picks a random startOffset each call (mt19937 seeded from system time), so different launches sample different windows. So it shouldn't lock onto the same 1234 files. Sorts itself out over N runs. Follow-up idea regardless (for a subsequent PR): stat pool root mtimes on startup, store them in the cache, skip the sweep when unchanged. When they have changed, do a full bounded pass instead of the probabilistic loop. The mtime gate means we only pay that cost when something actually moved. I like this because the cache becomes authoritative - downstream code can trust it instead of treating it as a best-effort. |
|
The idea was to have it time bound due to not being able to predict the amount of files in the pool folder and how fast or slow the file IO is. Also to have the start element randomized, so across several runs it eventually cleans out. I'm aware that that loop had pathological behavior in a few rare cases, but IIRC the fix looked too clumsy so I skipped it. Will have a look into the PR, hopefully this week. |
|
@lhog No rush man, but this is probably ready to go as-is. Definitely want your review but I don't think we need to worry about any further refactors here, was just me spitballing. |
fyi ivand is taking a break atm, so you may need to get another reviewer if you for some reason want this in soon. |
|
@sprunk do you want to merge this? I think it might solve some weird edge cases people would just wipe their cache for (if they're experienced enough) |
|
Thanks. Maybe @lhog will still find improvements when he comes back but let's take what's there. |
…p archives (beyond-all-reason#2961) Fixes an invalid end iterator dereference
Consolidates main's non-layer content onto the canonical macos-layer base: .github/ CI/deployment workflows plus the fork's engine customizations (GetPrevFrameChecksum, Engine.FeatureSupport.hasChecksums, focus-surface naming, ArchiveScanner beyond-all-reason#2961). Tree is preserved verbatim from main; only the history is reorganised so the macOS layer is the canonical macos-layer series underneath.
CArchiveScanner::WriteCacheData has a verification loop that erases-during-iteration on poolFilesInfo (a spring::unordered_map). Two design defects combine into a deterministic 0xc0000005 crash:
A. Empty-map dereference. The loop wraps it == end() back to begin(), then breaks if it == st (the original randomized start). When the loop empties the map mid-iteration, begin() == end() but st points at a smaller bucket index, so the next ExistenceTest(end()) call dereferences past the allocated buckets array.
B. Vacuous missing-verdict when allPoolRootDirs is empty. The loop builds allPoolRootDirs by filtering archiveInfos for .sdp entries. With zero .sdp archives in scan dirs (e.g. dev install with empty data/packages/ but a stale archivecache.lua from a prior prod run), ExistenceTest iterates an empty set and returns false unconditionally. Every entry gets erased, then defect A triggers.
Fix:
Wrap the verification loop in if (!allPoolRootDirs.empty()) -- with no roots, the only safe action is to leave the cache alone, since the ExistenceTest cannot meaningfully verify anything.
Add if (poolFilesInfo.empty()) break inside the loop body as defense in depth against any other future code path that drains the map mid-iteration.
Reproduces deterministically on WSL2 dev installs that point at the prod BAR launcher data dir: prod populated archivecache.lua with 10k+ pool entries, dev clears data/packages/, next launch erases the whole cache and crashes.
Regression originally introduced in spring/spring PR #2078 (deadline target for pool files scanning), which replaced a naturally-terminating it != end() loop with a time-bounded wraparound loop.
LLMs
yes