Skip to content

Fix flaky rollback-footprint assertion in TruncateOrphansAfterRecoveryIT#1126

Open
Andrii Lomakin (andrii0lomakin) wants to merge 2 commits into
developfrom
ci-fix/20260604-173513
Open

Fix flaky rollback-footprint assertion in TruncateOrphansAfterRecoveryIT#1126
Andrii Lomakin (andrii0lomakin) wants to merge 2 commits into
developfrom
ci-fix/20260604-173513

Conversation

@andrii0lomakin

@andrii0lomakin Andrii Lomakin (andrii0lomakin) commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

Summary

Fixes a flaky failure of TruncateOrphansAfterRecoveryIT.cleanReopenAfterRollbackSkipsPassAndLeavesNoOrphan that turned the integration pipeline red on develop (failed run, macOS arm / JDK 21). The fix is test-only; production code is unchanged.

The build runs with -Dmaven.test.failure.ignore=true, so the EnricoMi "Publish Test Results" step is the gate that fails the job when any test result is a failure.

Root Cause

The test asserted that the physical .pcl File.length() measured immediately after a rolled-back transaction equals the pre-TX baseline (expected: 1024L but was: 9216L). That is not a deterministic function of the rollback:

  • A fresh cluster eagerly allocates its first data page when the first row is inserted.
  • WOWCache may flush that buffered page to disk on its background thread at any point during the doomed transaction.
  • The rollback restores the logical horizon but does not physically truncate an already-flushed page — a tolerated physical orphan, reclaimed lazily by the allocator or by the orphan-truncation pass on a dirty reopen (pinned by cleanReopenDoesNotRepairPreExistingPhysicalOrphan).

So File.length() after a rollback reflects background-flush timing, not the rollback outcome. On most platforms the eager page was still buffered when the assertion read the size (so it read the baseline); on macOS arm / JDK 21 the background flush had landed (9216 = 1024 + 8192). The same class of WOWCache lazy-flush timing bug was fixed for the commit direction in #1118.

Changes

core/src/test/java/.../TruncateOrphansAfterRecoveryIT.java (test-only):

  • Replace the racy physical-size assertion with the deterministic invariant the rollback guarantees: after a clean reopen, SELECT count(*) FROM RolledBack reads 0. Read end-to-end (survives close + reopen).
  • Guard against a vacuous pass: assert the RolledBack class still exists on reopen (it is created outside the doomed TX), so a zero count can never come from an absent class.
  • Pin the count != 0 direction in the companion positive control committedAllocationGrowsPclSoRollbackAssertionIsNonVacuous, which now also asserts the same count(*) query reads GROWTH_ROW_COUNT against a committed population.
  • Sync the method/helper Javadoc and the surviving size-assertion message (which only proves the reopen is a physical no-op, not that the rollback left no flushed page).

The storage-level contract (WAL not replayed, orphan pass skipped, .pcl size stable across the reopen) stays covered by the three unchanged assertions. The primary proof that a rollback discards its footprint remains the unit test EndAtomicOperationRollbackSkipsCommitTest.

Known unfixed: a second, unrelated flake in the same run

The same failed run also failed LocalPaginatedStorageRestoreFromWALIT.testSimpleRestore on Linux arm / JDK 21 with RecordNotFoundException: #19:121 not found during the post-restore DatabaseCompare. This is a genuine, rare WAL-restore/SI divergence (the WAL-restored copy is missing a delete the live DB applied), not a test-harness artifact — the same bug class has been patched repeatedly (YTDB-510, YTDB-545, and a prior @Ignore). It is deliberately not addressed here: masking it with a retry or @Ignore would hide a real durability divergence, and a sound root-cause fix is a separate, deep storage investigation. Both failures are flaky and unrelated to the triggering commit (YTDB-651, which touches only SQL prefilter / MATCH executor code); the same tests passed on JDK 25 across all architectures and on Linux x86 / Windows under JDK 21.

Motivation

CI failure on develop: https://github.com/JetBrains/youtrackdb/actions/runs/26942924852

Test plan

  • cleanReopenAfterRollbackSkipsPassAndLeavesNoOrphan passes locally (-Dyoutrackdb.test.env=ci)
  • committedAllocationGrowsPclSoRollbackAssertionIsNonVacuous passes (confirms the committed run persists exactly GROWTH_ROW_COUNT rows)
  • Full TruncateOrphansAfterRecoveryIT class (34 tests) passes
  • spotless:check clean; all core test sources compile
  • Dimensional review completed (test-behavior, test-completeness, test-structure, code-quality); all should-fix findings addressed

Placeholder commit — fix in progress.
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Note

Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported.

cleanReopenAfterRollbackSkipsPassAndLeavesNoOrphan compared the physical
.pcl File.length() after a rolled-back transaction against the pre-TX
baseline and expected them equal. That assertion is not a deterministic
function of the rollback: a fresh cluster eagerly allocates its first
data page on the first insert, and WOWCache may flush that page to disk
on its background thread at any point during the doomed transaction. The
rollback restores the logical horizon but does not physically truncate an
already-flushed page (a tolerated physical orphan, cleaned by the
allocator or the dirty-reopen pass). So the post-rollback file size
depends on background-flush timing, which is exactly why it raced the
assertion on macOS arm / JDK 21 (observed 9216 vs the 1024 baseline)
while passing everywhere else.

Replace the physical-size assertion with the deterministic invariant the
rollback actually guarantees: after a clean reopen the rolled-back class
holds zero committed rows. Guard against a vacuous pass (class-exists
check before the count), and pin the count != 0 direction in the
companion positive control, which now asserts the same count(*) query
reads GROWTH_ROW_COUNT against a committed population. The storage-level
behavior (gate skipped, no truncation on clean reopen, file size stable
across the reopen) stays covered by the three unchanged assertions.
@andrii0lomakin Andrii Lomakin (andrii0lomakin) changed the title [ci-fix] Investigating: flaky storage recovery ITs on JDK21/arm Fix flaky rollback-footprint assertion in TruncateOrphansAfterRecoveryIT Jun 4, 2026
@andrii0lomakin Andrii Lomakin (andrii0lomakin) marked this pull request as ready for review June 4, 2026 18:07
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown

Test Count Gate Results

Tolerance: 5% drop allowed per module

Overall: ✅ 29108 tests (baseline: 29108, +0)

Module Baseline Current Change Status
core 18560 18560 +0
docker-tests 1891 1891 +0
embedded 1931 1931 +0
examples 6 6 +0
gremlin-annotations 30 30 +0
jmh-ldbc 39 39 +0
server 5504 5504 +0
tests 1147 1147 +0

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown

Coverage Gate Results

Thresholds: 85% line, 70% branch

Line Coverage: ✅ No coverable lines in diff

Branch Coverage: ✅ No branches in changed lines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant