fix: deflake //rs/tests/consensus/backup:backup_manager_upgrade_test_head_nns_colocate by basvandijk · Pull Request #9711 · dfinity/ic

basvandijk · 2026-04-02T08:52:39Z

Root Cause

After the backup process is restarted during the test, the cold_store thread in backup_manager.rs needs to acquire the artifacts_guard mutex to check need_cold_storage_move(). However, rsync_spool holds this same mutex for its entire duration, including 60-second sleeps between retries on failure (up to 5 retries per node with RETRIES_RSYNC_HOST=5).

When rsync encounters transient failures (e.g., rsync exit code 24: "some files vanished before they could be transferred"), rsync_spool retries while holding the mutex. With nodes_syncing=2, this can block the cold storage thread for up to ~10 minutes. The previous cold_storage_exists timeout of 120 seconds (12 iterations × 10s) was insufficient, causing the test to panic with "No cold storage" before the cold storage thread ever got a chance to run.

Fix

Increase the cold_storage_exists timeout from 120 seconds (12 × 10s) to 900 seconds (90 × 10s) to accommodate worst-case mutex blocking from rsync retries.

Evidence

Both flaky runs showed the same pattern:

After backup restart, rsync_spool fails with code 24 and retries with 60-second sleeps while holding the mutex
The cold storage thread blocks on the mutex and never runs within the 120-second window
The test panics with "No cold storage"

In contrast, passing runs had rsync succeed on the first attempt, allowing the cold storage thread to proceed immediately.

This PR was created following the steps in .claude/skills/fix-flaky-tests/SKILL.md.

…head_nns_colocate Increase the cold_storage_exists timeout from 120 seconds (12 * 10s) to 900 seconds (90 * 10s) to accommodate worst-case mutex blocking from rsync retries. Root cause: After the backup process is restarted, the cold_store thread needs to acquire the artifacts_guard mutex to check need_cold_storage_move(). However, rsync_spool holds this same mutex for its entire duration, including 60-second sleeps between retries on failure (up to 5 retries per node with RETRIES_RSYNC_HOST=5). With nodes_syncing=2 nodes, this can block the cold storage thread for up to ~10 minutes. The previous timeout of 120 seconds was insufficient when rsync encounters transient failures (e.g., rsync code 24: files vanished).

Copilot

Pull request overview

This PR deflakes the consensus backup system test backup_manager_upgrade_test_head_nns_colocate by extending the time the test waits for cold storage to appear after restarting the backup process, accounting for mutex blocking caused by rsync_spool retry behavior.

Changes:

Increased the cold_storage_exists polling window from 120s to 900s (12 → 90 iterations at 10s each).
Added an in-code comment documenting the mutex-blocking root cause and worst-case wait rationale.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

basvandijk requested a review from Copilot April 2, 2026 08:52

github-actions bot added the fix label Apr 2, 2026

Copilot started reviewing on behalf of basvandijk April 2, 2026 08:53 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

basvandijk marked this pull request as ready for review April 2, 2026 08:55

basvandijk requested a review from a team as a code owner April 2, 2026 08:55

github-actions bot added the @consensus label Apr 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: deflake //rs/tests/consensus/backup:backup_manager_upgrade_test_head_nns_colocate#9711

fix: deflake //rs/tests/consensus/backup:backup_manager_upgrade_test_head_nns_colocate#9711
basvandijk wants to merge 1 commit intomasterfrom
ai/deflake-backup_manager_upgrade_test-2026-04-02

basvandijk commented Apr 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

basvandijk commented Apr 2, 2026

Root Cause

Fix

Evidence

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants