Skip to content

fix flake in test for restarted followers not causing an election#669

Open
tgross wants to merge 1 commit intomainfrom
test-flake-follower-removal-no-election
Open

fix flake in test for restarted followers not causing an election#669
tgross wants to merge 1 commit intomainfrom
test-flake-follower-removal-no-election

Conversation

@tgross
Copy link
Copy Markdown
Member

@tgross tgross commented Mar 19, 2026

While working on #666, I encountered a test flake caused by a race condition in TestRaft_FollowerRemovalNoElection, which is also obscuring the test intent. After we restart the follower, typically the c.ConnectFully call will complete quickly enough that the follower never enters a candidate state, so we're not exercising the intended behavior. But if the c.ConnectFully call takes too long, the follower can miss heartbeats and start an election. This is expected behavior and its request for an election will be rejected. But in that case, the follower will not have settled on the leader by the time we check it and the test will fail when it shouldn't.

Add an extra sleep before we reconnect the cluster to force the restarted follower to become a candidate so that we're exercising the intent of the test, but then wait until the follower shows an event that it's got a leader before continuing to the remaining assertions.

Note that to hit this flake you need to run this test a lot. Ex. with go test -v -failfast -count=30 . -run TestRaft_FollowerRemovalNoElection. The work I'm doing in #666 changes timing slightly, and this caused it to happen maybe 10% of the time instead of 5%.

@tgross tgross marked this pull request as ready for review March 19, 2026 20:53
@tgross tgross requested review from a team as code owners March 19, 2026 20:53
While working on #666, I encountered a test flake caused by a race condition in
`TestRaft_FollowerRemovalNoElection`, which is also obscuring the test
intent. After we restart the follower, typically the `c.ConnectFully` call will
complete quickly enough that the follower never enters a candidate state, so
we're not exercising the intended behavior. But if the `c.ConnectFully` call
takes too long, the follower can miss heartbeats and start an election. This is
expected behavior and its request for an election will be rejected. But in that
case, the follower will not have settled on the leader by the time we check it
and the test will fail when it shouldn't.

Add an extra sleep before we reconnect the cluster to force the restarted
follower to become a candidate so that we're exercising the intent of the test,
but then wait until the follower shows an event that it's got a leader before
continuing to the remaining assertions.
@tgross tgross force-pushed the test-flake-follower-removal-no-election branch from 33f120f to e2cd0a7 Compare March 20, 2026 13:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant