fix flake in test for restarted followers not causing an election#669
Open
fix flake in test for restarted followers not causing an election#669
Conversation
While working on #666, I encountered a test flake caused by a race condition in `TestRaft_FollowerRemovalNoElection`, which is also obscuring the test intent. After we restart the follower, typically the `c.ConnectFully` call will complete quickly enough that the follower never enters a candidate state, so we're not exercising the intended behavior. But if the `c.ConnectFully` call takes too long, the follower can miss heartbeats and start an election. This is expected behavior and its request for an election will be rejected. But in that case, the follower will not have settled on the leader by the time we check it and the test will fail when it shouldn't. Add an extra sleep before we reconnect the cluster to force the restarted follower to become a candidate so that we're exercising the intent of the test, but then wait until the follower shows an event that it's got a leader before continuing to the remaining assertions.
33f120f to
e2cd0a7
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
While working on #666, I encountered a test flake caused by a race condition in
TestRaft_FollowerRemovalNoElection, which is also obscuring the test intent. After we restart the follower, typically thec.ConnectFullycall will complete quickly enough that the follower never enters a candidate state, so we're not exercising the intended behavior. But if thec.ConnectFullycall takes too long, the follower can miss heartbeats and start an election. This is expected behavior and its request for an election will be rejected. But in that case, the follower will not have settled on the leader by the time we check it and the test will fail when it shouldn't.Add an extra sleep before we reconnect the cluster to force the restarted follower to become a candidate so that we're exercising the intent of the test, but then wait until the follower shows an event that it's got a leader before continuing to the remaining assertions.
Note that to hit this flake you need to run this test a lot. Ex. with
go test -v -failfast -count=30 . -run TestRaft_FollowerRemovalNoElection. The work I'm doing in #666 changes timing slightly, and this caused it to happen maybe 10% of the time instead of 5%.