Skip to content

fix: deflake //rs/tests/consensus/backup:backup_manager_upgrade_test_head_nns_colocate#9711

Open
basvandijk wants to merge 1 commit intomasterfrom
ai/deflake-backup_manager_upgrade_test-2026-04-02
Open

fix: deflake //rs/tests/consensus/backup:backup_manager_upgrade_test_head_nns_colocate#9711
basvandijk wants to merge 1 commit intomasterfrom
ai/deflake-backup_manager_upgrade_test-2026-04-02

Conversation

@basvandijk
Copy link
Copy Markdown
Collaborator

Root Cause

After the backup process is restarted during the test, the cold_store thread in backup_manager.rs needs to acquire the artifacts_guard mutex to check need_cold_storage_move(). However, rsync_spool holds this same mutex for its entire duration, including 60-second sleeps between retries on failure (up to 5 retries per node with RETRIES_RSYNC_HOST=5).

When rsync encounters transient failures (e.g., rsync exit code 24: "some files vanished before they could be transferred"), rsync_spool retries while holding the mutex. With nodes_syncing=2, this can block the cold storage thread for up to ~10 minutes. The previous cold_storage_exists timeout of 120 seconds (12 iterations × 10s) was insufficient, causing the test to panic with "No cold storage" before the cold storage thread ever got a chance to run.

Fix

Increase the cold_storage_exists timeout from 120 seconds (12 × 10s) to 900 seconds (90 × 10s) to accommodate worst-case mutex blocking from rsync retries.

Evidence

Both flaky runs showed the same pattern:

  • After backup restart, rsync_spool fails with code 24 and retries with 60-second sleeps while holding the mutex
  • The cold storage thread blocks on the mutex and never runs within the 120-second window
  • The test panics with "No cold storage"

In contrast, passing runs had rsync succeed on the first attempt, allowing the cold storage thread to proceed immediately.


This PR was created following the steps in .claude/skills/fix-flaky-tests/SKILL.md.

…head_nns_colocate

Increase the cold_storage_exists timeout from 120 seconds (12 * 10s) to 900 seconds
(90 * 10s) to accommodate worst-case mutex blocking from rsync retries.

Root cause: After the backup process is restarted, the cold_store thread needs to
acquire the artifacts_guard mutex to check need_cold_storage_move(). However,
rsync_spool holds this same mutex for its entire duration, including 60-second
sleeps between retries on failure (up to 5 retries per node with
RETRIES_RSYNC_HOST=5). With nodes_syncing=2 nodes, this can block the cold storage
thread for up to ~10 minutes. The previous timeout of 120 seconds was insufficient
when rsync encounters transient failures (e.g., rsync code 24: files vanished).
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR deflakes the consensus backup system test backup_manager_upgrade_test_head_nns_colocate by extending the time the test waits for cold storage to appear after restarting the backup process, accounting for mutex blocking caused by rsync_spool retry behavior.

Changes:

  • Increased the cold_storage_exists polling window from 120s to 900s (12 → 90 iterations at 10s each).
  • Added an in-code comment documenting the mutex-blocking root cause and worst-case wait rationale.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@basvandijk basvandijk marked this pull request as ready for review April 2, 2026 08:55
@basvandijk basvandijk requested a review from a team as a code owner April 2, 2026 08:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants