Skip to content

ArchiveScanner: fix WriteCacheData crash when pool cache outlives .sdp archives#2961

Merged
sprunk merged 4 commits into
beyond-all-reason:masterfrom
keithharvey:fix/archivescanner-empty-pool-roots-crash
Jun 10, 2026
Merged

ArchiveScanner: fix WriteCacheData crash when pool cache outlives .sdp archives#2961
sprunk merged 4 commits into
beyond-all-reason:masterfrom
keithharvey:fix/archivescanner-empty-pool-roots-crash

Conversation

@keithharvey

@keithharvey keithharvey commented May 5, 2026

Copy link
Copy Markdown
Contributor

CArchiveScanner::WriteCacheData has a verification loop that erases-during-iteration on poolFilesInfo (a spring::unordered_map). Two design defects combine into a deterministic 0xc0000005 crash:

A. Empty-map dereference. The loop wraps it == end() back to begin(), then breaks if it == st (the original randomized start). When the loop empties the map mid-iteration, begin() == end() but st points at a smaller bucket index, so the next ExistenceTest(end()) call dereferences past the allocated buckets array.

B. Vacuous missing-verdict when allPoolRootDirs is empty. The loop builds allPoolRootDirs by filtering archiveInfos for .sdp entries. With zero .sdp archives in scan dirs (e.g. dev install with empty data/packages/ but a stale archivecache.lua from a prior prod run), ExistenceTest iterates an empty set and returns false unconditionally. Every entry gets erased, then defect A triggers.

Fix:

  • Wrap the verification loop in if (!allPoolRootDirs.empty()) -- with no roots, the only safe action is to leave the cache alone, since the ExistenceTest cannot meaningfully verify anything.

  • Add if (poolFilesInfo.empty()) break inside the loop body as defense in depth against any other future code path that drains the map mid-iteration.

Reproduces deterministically on WSL2 dev installs that point at the prod BAR launcher data dir: prod populated archivecache.lua with 10k+ pool entries, dev clears data/packages/, next launch erases the whole cache and crashes.

Regression originally introduced in spring/spring PR #2078 (deadline target for pool files scanning), which replaced a naturally-terminating it != end() loop with a time-bounded wraparound loop.

LLMs

yes

…p archives

CArchiveScanner::WriteCacheData has a verification loop that erases-

during-iteration on poolFilesInfo (a spring::unordered_map). Two

design defects combine into a deterministic 0xc0000005 crash:

A. Empty-map dereference. The loop wraps it == end() back to begin(),

   then breaks if it == st (the original randomized start). When the

   loop empties the map mid-iteration, begin() == end() but st points

   at a smaller bucket index, so the next ExistenceTest(end()) call

   dereferences past the allocated buckets array.

B. Vacuous missing-verdict when allPoolRootDirs is empty. The loop

   builds allPoolRootDirs by filtering archiveInfos for .sdp entries.

   With zero .sdp archives in scan dirs (e.g. dev install with empty

   data/packages/ but a stale archivecache.lua from a prior prod run),

   ExistenceTest iterates an empty set and returns false unconditionally.

   Every entry gets erased, then defect A triggers.

Fix:

- Wrap the verification loop in if (!allPoolRootDirs.empty()) -- with

  no roots, the only safe action is to leave the cache alone, since the

  ExistenceTest cannot meaningfully verify anything.

- Add if (poolFilesInfo.empty()) break inside the loop body as defense

  in depth against any other future code path that drains the map mid-

  iteration.

Reproduces deterministically on WSL2 dev installs that point at the

prod BAR launcher data dir: prod populated archivecache.lua with 10k+

pool entries, dev clears data/packages/, next launch erases the whole

cache and crashes.

Regression originally introduced in spring/spring PR beyond-all-reason#2078 (deadline

target for pool files scanning), which replaced a naturally-

terminating it != end() loop with a time-bounded wraparound loop.
Comment thread rts/System/FileSystem/ArchiveScanner.cpp Outdated
Comment thread rts/System/FileSystem/ArchiveScanner.cpp Outdated
keithharvey and others added 2 commits May 5, 2026 08:43
Co-authored-by: sprunk <spr.ng@o2.pl>
specific variable conveying what we're protecting against

Co-authored-by: sprunk <spr.ng@o2.pl>
@sprunk

sprunk commented May 5, 2026

Copy link
Copy Markdown
Collaborator

What if st is erased? Won't it fail to hit the it == st condition and go through already-processed elements over and over until it hits the timeout?

The change so far looks good despite that potential remaining bug though.

  unordered_map::erase invalidates the erased iterator. If st is the
  entry being erased, the  termination check silently stops
  working and the loop runs to its 1s deadline re-checking survivors.

  Re-anchor st to erase()'s return value when st is the one erased;
  wrap to begin() if that landed on end()
@keithharvey

keithharvey commented May 5, 2026

Copy link
Copy Markdown
Contributor Author

Right. st dangles after erase, so it == st never trips again. Any iterator held across a mutation is suspect; erase(it) only promises a valid return value for it itself, not for other iterators into the container. My higher order language intuition completely failed me here.

The probabilistic sweep still feels shaky to me, is there a reason we can't do a bounded full pass here and bail for pathological cases? Happy to take it as a follow up if so.

@sprunk

sprunk commented May 5, 2026

Copy link
Copy Markdown
Collaborator

is there a reason we can't do a bounded full pass here and bail for pathological cases?

@lhog would be best to say here, though AFAICT the reason is that 1s timeout is important for UX reasons, but if (say) you can process 1234 files per second then quickly enough you'd end up always processing the same first 1234 files on each run and never reach the others before it times out.

@sprunk sprunk requested a review from lhog May 5, 2026 16:23
@keithharvey

Copy link
Copy Markdown
Contributor Author

On the thrash concern: the loop picks a random startOffset each call (mt19937 seeded from system time), so different launches sample different windows. So it shouldn't lock onto the same 1234 files. Sorts itself out over N runs.

Follow-up idea regardless (for a subsequent PR): stat pool root mtimes on startup, store them in the cache, skip the sweep when unchanged. When they have changed, do a full bounded pass instead of the probabilistic loop. The mtime gate means we only pay that cost when something actually moved. I like this because the cache becomes authoritative - downstream code can trust it instead of treating it as a best-effort.

@lhog

lhog commented May 5, 2026

Copy link
Copy Markdown
Collaborator

The idea was to have it time bound due to not being able to predict the amount of files in the pool folder and how fast or slow the file IO is. Also to have the start element randomized, so across several runs it eventually cleans out.

I'm aware that that loop had pathological behavior in a few rare cases, but IIRC the fix looked too clumsy so I skipped it. Will have a look into the PR, hopefully this week.

@keithharvey

Copy link
Copy Markdown
Contributor Author

@lhog No rush man, but this is probably ready to go as-is. Definitely want your review but I don't think we need to worry about any further refactors here, was just me spitballing.

@bruno-dasilva

Copy link
Copy Markdown
Collaborator

@lhog No rush man, but this is probably ready to go as-is. Definitely want your review but I don't think we need to worry about any further refactors here, was just me spitballing.

fyi ivand is taking a break atm, so you may need to get another reviewer if you for some reason want this in soon.

@keithharvey

Copy link
Copy Markdown
Contributor Author

@sprunk do you want to merge this? I think it might solve some weird edge cases people would just wipe their cache for (if they're experienced enough)

@sprunk sprunk merged commit c6fb86e into beyond-all-reason:master Jun 10, 2026
3 checks passed
@sprunk

sprunk commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Thanks. Maybe @lhog will still find improvements when he comes back but let's take what's there.

Mearman pushed a commit to ExaDev/RecoilEngine that referenced this pull request Jun 12, 2026
Mearman added a commit to ExaDev/RecoilEngine that referenced this pull request Jun 16, 2026
Consolidates main's non-layer content onto the canonical macos-layer base:
.github/ CI/deployment workflows plus the fork's engine customizations
(GetPrevFrameChecksum, Engine.FeatureSupport.hasChecksums, focus-surface naming,
ArchiveScanner beyond-all-reason#2961). Tree is preserved verbatim from main; only the history is
reorganised so the macOS layer is the canonical macos-layer series underneath.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants