fix: deflake //rs/tests/consensus/backup:backup_manager_upgrade_test_head_nns_colocate#9711
Open
basvandijk wants to merge 1 commit intomasterfrom
Open
fix: deflake //rs/tests/consensus/backup:backup_manager_upgrade_test_head_nns_colocate#9711basvandijk wants to merge 1 commit intomasterfrom
basvandijk wants to merge 1 commit intomasterfrom
Conversation
…head_nns_colocate Increase the cold_storage_exists timeout from 120 seconds (12 * 10s) to 900 seconds (90 * 10s) to accommodate worst-case mutex blocking from rsync retries. Root cause: After the backup process is restarted, the cold_store thread needs to acquire the artifacts_guard mutex to check need_cold_storage_move(). However, rsync_spool holds this same mutex for its entire duration, including 60-second sleeps between retries on failure (up to 5 retries per node with RETRIES_RSYNC_HOST=5). With nodes_syncing=2 nodes, this can block the cold storage thread for up to ~10 minutes. The previous timeout of 120 seconds was insufficient when rsync encounters transient failures (e.g., rsync code 24: files vanished).
Contributor
There was a problem hiding this comment.
Pull request overview
This PR deflakes the consensus backup system test backup_manager_upgrade_test_head_nns_colocate by extending the time the test waits for cold storage to appear after restarting the backup process, accounting for mutex blocking caused by rsync_spool retry behavior.
Changes:
- Increased the
cold_storage_existspolling window from 120s to 900s (12 → 90 iterations at 10s each). - Added an in-code comment documenting the mutex-blocking root cause and worst-case wait rationale.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Root Cause
After the backup process is restarted during the test, the
cold_storethread inbackup_manager.rsneeds to acquire theartifacts_guardmutex to checkneed_cold_storage_move(). However,rsync_spoolholds this same mutex for its entire duration, including 60-second sleeps between retries on failure (up to 5 retries per node withRETRIES_RSYNC_HOST=5).When rsync encounters transient failures (e.g., rsync exit code 24: "some files vanished before they could be transferred"),
rsync_spoolretries while holding the mutex. Withnodes_syncing=2, this can block the cold storage thread for up to ~10 minutes. The previouscold_storage_existstimeout of 120 seconds (12 iterations × 10s) was insufficient, causing the test to panic with "No cold storage" before the cold storage thread ever got a chance to run.Fix
Increase the
cold_storage_existstimeout from 120 seconds (12 × 10s) to 900 seconds (90 × 10s) to accommodate worst-case mutex blocking from rsync retries.Evidence
Both flaky runs showed the same pattern:
rsync_spoolfails with code 24 and retries with 60-second sleeps while holding the mutexIn contrast, passing runs had rsync succeed on the first attempt, allowing the cold storage thread to proceed immediately.
This PR was created following the steps in
.claude/skills/fix-flaky-tests/SKILL.md.