feat: implement validator sync-lag duty gate#1385
Open
zemse wants to merge 11 commits into
Open
Conversation
a8ee1ff to
eab4692
Compare
eab4692 to
b19214c
Compare
01a0e34 to
c944d1f
Compare
Contributor
Author
|
I've pushed some commits on my branch that dont appear in this PR, so I'll try if close open works |
Add the test surface with minimal stubs so the tests compile. The next commit adds the implementation that makes them pass.
When the validator's local head falls too far behind wall-clock, pause block production and attestation duties until it catches up. Skip the pause if the whole network is stalled so the chain can still recover.
The loop bound 0..=4_u64 was the literal value of SYNC_LAG_THRESHOLD; reference the constant directly so the test follows future tweaks.
Plain subtraction underflows if HYSTERESIS_BAND ever exceeds SYNC_LAG_THRESHOLD, turning the resume condition into a permanently-true check after wrap. const_assert catches it at compile time.
If the node crashes while behind and restarts, init=false fires one signing attempt on the very first tick before is_synced_for_duties has a chance to close the gate. Starting paused lets the first predicate call unpause immediately when lag is within threshold, so clean-start behaviour is unchanged. The sync-lag test helper resets the flag to false explicitly because those tests exercise predicate behaviour from a running-normally state.
Four review fixes bundled because they all touch the same predicate: - Introduce DutyKind enum + AsRef<str> so the predicate, handlers, and metric labels share one source of truth instead of stringly-typed "block" / "attestation" literals scattered across the file. A future pass can swap the hand-rolled AsRef impl for strum::AsRefStr. - Add LeanDB::snapshot_head_and_max_slot so the predicate body shrinks to a single let-else and the RwLock + Mutex pair is held only for the duration of the synchronous DB reads. - Fold the FieldNotInitilized pre-genesis case into the same helper, mirroring the StoreError match used in fork_choice/beacon/handlers.rs. - Extract LeanChainService::skip_for_lag to kill the duplicate ~25-line preambles in handle_produce_block / handle_build_attestation_data; the handlers now skip-or-continue in three lines.
Two review-driven comment moves in service.rs: - The strum::AsRefStr TODO sat above the DutyKind enum but actually describes the hand-rolled AsRef<str> impl beneath it, so move it there. - The "start paused" doc on the duties_paused field is really about the constructor's initial value, not the field declaration. Move it above duties_paused: true in LeanChainService::new where the invariant is set. No behaviour change, no public API change.
The snapshot_head_and_max_slot helper added in cfdea16 didn't actually shrink the lock window beyond what the inline form gave us. Both forms take the LeanStoreWriter read lock and the inner LeanDB mutex once and do the same three reads under that scope. Moving the reads into LeanDB just spread the sync-lag gate across two crates without a real win. Inline the three reads back into is_synced_for_duties and drop LeanDB::snapshot_head_and_max_slot from crates/storage/src/db/lean.rs. Keep the StoreError::FieldNotInitilized -> Ok(SyncedForDuties::Yes) match arm at the call site so the pre-genesis case is still handled (this part was a real fix in cfdea16, not just refactor noise). No behaviour change at runtime.
…vers restart The earlier change (bbeb4ed) initialized duties_paused to true to defend against a restart-while-behind firing one duty before is_synced_for_duties could close the gate. The defense is unnecessary, because the ProduceBlock and BuildAttestationData handlers short-circuit on sync_status == Syncing before is_synced_for_duties is ever called (service.rs:422 and :436), and a fresh node always boots into sync_status = Syncing. By the time the chain service has transitioned to Synced, update_sync_status has already decided the node is caught up, so a duties_paused = false start is no closer to firing a bad duty than duties_paused = true would be. The hysteresis logic at runtime correctly engages duties_paused = true when lag exceeds SYNC_LAG_THRESHOLD, so the pause defense still kicks in when genuinely behind. Drop the start-paused defense, remove the explicit service.duties_paused = false override in the sync-lag test helper (no longer needed once the default is false).
The pause and resume-from-stall logs include max_seen_slot and network_lag; the resume-from-local-catchup log was missing both. Add them so operators have the same diagnostic shape on every transition.
73f09ee to
2018266
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What was wrong?
Fixes: #1376
The validator signs blocks and attestations even when it's catching up to the chain head, so it sends out useless or risky signatures.
How was it fixed?
Skip the duty when the local head lags wall-clock by too much. A small hysteresis band stops the state from flapping in and out, and if the whole network looks stalled the check lets duties run anyway so the chain can recover.
To-Do