Fix flaky rollback-footprint assertion in TruncateOrphansAfterRecoveryIT#1126
Open
Andrii Lomakin (andrii0lomakin) wants to merge 2 commits into
Open
Fix flaky rollback-footprint assertion in TruncateOrphansAfterRecoveryIT#1126Andrii Lomakin (andrii0lomakin) wants to merge 2 commits into
Andrii Lomakin (andrii0lomakin) wants to merge 2 commits into
Conversation
Placeholder commit — fix in progress.
Contributor
|
Note Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported. |
cleanReopenAfterRollbackSkipsPassAndLeavesNoOrphan compared the physical .pcl File.length() after a rolled-back transaction against the pre-TX baseline and expected them equal. That assertion is not a deterministic function of the rollback: a fresh cluster eagerly allocates its first data page on the first insert, and WOWCache may flush that page to disk on its background thread at any point during the doomed transaction. The rollback restores the logical horizon but does not physically truncate an already-flushed page (a tolerated physical orphan, cleaned by the allocator or the dirty-reopen pass). So the post-rollback file size depends on background-flush timing, which is exactly why it raced the assertion on macOS arm / JDK 21 (observed 9216 vs the 1024 baseline) while passing everywhere else. Replace the physical-size assertion with the deterministic invariant the rollback actually guarantees: after a clean reopen the rolled-back class holds zero committed rows. Guard against a vacuous pass (class-exists check before the count), and pin the count != 0 direction in the companion positive control, which now asserts the same count(*) query reads GROWTH_ROW_COUNT against a committed population. The storage-level behavior (gate skipped, no truncation on clean reopen, file size stable across the reopen) stays covered by the three unchanged assertions.
Test Count Gate ResultsTolerance: 5% drop allowed per module Overall: ✅ 29108 tests (baseline: 29108, +0)
|
Coverage Gate ResultsThresholds: 85% line, 70% branch Line Coverage: ✅ No coverable lines in diffBranch Coverage: ✅ No branches in changed lines |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a flaky failure of
TruncateOrphansAfterRecoveryIT.cleanReopenAfterRollbackSkipsPassAndLeavesNoOrphanthat turned the integration pipeline red ondevelop(failed run, macOS arm / JDK 21). The fix is test-only; production code is unchanged.The build runs with
-Dmaven.test.failure.ignore=true, so the EnricoMi "Publish Test Results" step is the gate that fails the job when any test result is a failure.Root Cause
The test asserted that the physical
.pclFile.length()measured immediately after a rolled-back transaction equals the pre-TX baseline (expected: 1024L but was: 9216L). That is not a deterministic function of the rollback:WOWCachemay flush that buffered page to disk on its background thread at any point during the doomed transaction.cleanReopenDoesNotRepairPreExistingPhysicalOrphan).So
File.length()after a rollback reflects background-flush timing, not the rollback outcome. On most platforms the eager page was still buffered when the assertion read the size (so it read the baseline); on macOS arm / JDK 21 the background flush had landed (9216 = 1024 + 8192). The same class of WOWCache lazy-flush timing bug was fixed for the commit direction in #1118.Changes
core/src/test/java/.../TruncateOrphansAfterRecoveryIT.java(test-only):SELECT count(*) FROM RolledBackreads0. Read end-to-end (survives close + reopen).RolledBackclass still exists on reopen (it is created outside the doomed TX), so a zero count can never come from an absent class.count != 0direction in the companion positive controlcommittedAllocationGrowsPclSoRollbackAssertionIsNonVacuous, which now also asserts the samecount(*)query readsGROWTH_ROW_COUNTagainst a committed population.The storage-level contract (WAL not replayed, orphan pass skipped,
.pclsize stable across the reopen) stays covered by the three unchanged assertions. The primary proof that a rollback discards its footprint remains the unit testEndAtomicOperationRollbackSkipsCommitTest.Known unfixed: a second, unrelated flake in the same run
The same failed run also failed
LocalPaginatedStorageRestoreFromWALIT.testSimpleRestoreon Linux arm / JDK 21 withRecordNotFoundException: #19:121 not foundduring the post-restoreDatabaseCompare. This is a genuine, rare WAL-restore/SI divergence (the WAL-restored copy is missing a delete the live DB applied), not a test-harness artifact — the same bug class has been patched repeatedly (YTDB-510, YTDB-545, and a prior@Ignore). It is deliberately not addressed here: masking it with a retry or@Ignorewould hide a real durability divergence, and a sound root-cause fix is a separate, deep storage investigation. Both failures are flaky and unrelated to the triggering commit (YTDB-651, which touches only SQL prefilter / MATCH executor code); the same tests passed on JDK 25 across all architectures and on Linux x86 / Windows under JDK 21.Motivation
CI failure on
develop: https://github.com/JetBrains/youtrackdb/actions/runs/26942924852Test plan
cleanReopenAfterRollbackSkipsPassAndLeavesNoOrphanpasses locally (-Dyoutrackdb.test.env=ci)committedAllocationGrowsPclSoRollbackAssertionIsNonVacuouspasses (confirms the committed run persists exactlyGROWTH_ROW_COUNTrows)TruncateOrphansAfterRecoveryITclass (34 tests) passesspotless:checkclean; all core test sources compile