Skip to content

feat: implement validator sync-lag duty gate#1385

Open
zemse wants to merge 11 commits into
ReamLabs:masterfrom
zemse:feat/validator-sync-lag-duty-gate
Open

feat: implement validator sync-lag duty gate#1385
zemse wants to merge 11 commits into
ReamLabs:masterfrom
zemse:feat/validator-sync-lag-duty-gate

Conversation

@zemse
Copy link
Copy Markdown
Contributor

@zemse zemse commented May 15, 2026

What was wrong?

Fixes: #1376

The validator signs blocks and attestations even when it's catching up to the chain head, so it sends out useless or risky signatures.

How was it fixed?

Skip the duty when the local head lags wall-clock by too much. A small hysteresis band stops the state from flapping in and out, and if the whole network looks stalled the check lets duties run anyway so the chain can recover.

To-Do

@zemse zemse force-pushed the feat/validator-sync-lag-duty-gate branch from a8ee1ff to eab4692 Compare May 15, 2026 07:26
@zemse zemse changed the title feat: implement validator sync-lag duty gate (#1376) feat: implement validator sync-lag duty gate (#1376) May 15, 2026
@zemse zemse force-pushed the feat/validator-sync-lag-duty-gate branch from eab4692 to b19214c Compare May 15, 2026 08:44
@zemse zemse force-pushed the feat/validator-sync-lag-duty-gate branch 2 times, most recently from 01a0e34 to c944d1f Compare May 24, 2026 19:20
@zemse zemse changed the title feat: implement validator sync-lag duty gate (#1376) feat: implement validator sync-lag duty gate May 24, 2026
@zemse zemse marked this pull request as ready for review May 27, 2026 11:55
@zemse zemse marked this pull request as draft May 27, 2026 12:00
@zemse
Copy link
Copy Markdown
Contributor Author

zemse commented May 27, 2026

I've pushed some commits on my branch that dont appear in this PR, so I'll try if close open works

@zemse zemse closed this May 27, 2026
@zemse zemse reopened this May 27, 2026
zemse added 11 commits May 27, 2026 21:28
Add the test surface with minimal stubs so the tests compile. The
next commit adds the implementation that makes them pass.
When the validator's local head falls too far behind wall-clock,
pause block production and attestation duties until it catches up.
Skip the pause if the whole network is stalled so the chain can
still recover.
The loop bound 0..=4_u64 was the literal value of SYNC_LAG_THRESHOLD;
reference the constant directly so the test follows future tweaks.
Plain subtraction underflows if HYSTERESIS_BAND ever exceeds
SYNC_LAG_THRESHOLD, turning the resume condition into a permanently-true
check after wrap. const_assert catches it at compile time.
If the node crashes while behind and restarts, init=false fires one
signing attempt on the very first tick before is_synced_for_duties has
a chance to close the gate. Starting paused lets the first predicate
call unpause immediately when lag is within threshold, so clean-start
behaviour is unchanged.

The sync-lag test helper resets the flag to false explicitly because
those tests exercise predicate behaviour from a running-normally state.
Four review fixes bundled because they all touch the same predicate:

- Introduce DutyKind enum + AsRef<str> so the predicate, handlers, and
  metric labels share one source of truth instead of stringly-typed
  "block" / "attestation" literals scattered across the file. A future
  pass can swap the hand-rolled AsRef impl for strum::AsRefStr.
- Add LeanDB::snapshot_head_and_max_slot so the predicate body shrinks
  to a single let-else and the RwLock + Mutex pair is held only for
  the duration of the synchronous DB reads.
- Fold the FieldNotInitilized pre-genesis case into the same helper,
  mirroring the StoreError match used in fork_choice/beacon/handlers.rs.
- Extract LeanChainService::skip_for_lag to kill the duplicate ~25-line
  preambles in handle_produce_block / handle_build_attestation_data;
  the handlers now skip-or-continue in three lines.
Two review-driven comment moves in service.rs:

- The strum::AsRefStr TODO sat above the DutyKind enum but actually
  describes the hand-rolled AsRef<str> impl beneath it, so move it
  there.
- The "start paused" doc on the duties_paused field is really about
  the constructor's initial value, not the field declaration. Move
  it above duties_paused: true in LeanChainService::new where the
  invariant is set.

No behaviour change, no public API change.
The snapshot_head_and_max_slot helper added in cfdea16 didn't actually
shrink the lock window beyond what the inline form gave us. Both forms
take the LeanStoreWriter read lock and the inner LeanDB mutex once and
do the same three reads under that scope. Moving the reads into LeanDB
just spread the sync-lag gate across two crates without a real win.

Inline the three reads back into is_synced_for_duties and drop
LeanDB::snapshot_head_and_max_slot from crates/storage/src/db/lean.rs.
Keep the StoreError::FieldNotInitilized -> Ok(SyncedForDuties::Yes)
match arm at the call site so the pre-genesis case is still handled
(this part was a real fix in cfdea16, not just refactor noise).

No behaviour change at runtime.
…vers restart

The earlier change (bbeb4ed) initialized duties_paused to true to defend
against a restart-while-behind firing one duty before is_synced_for_duties
could close the gate. The defense is unnecessary, because the ProduceBlock
and BuildAttestationData handlers short-circuit on sync_status == Syncing
before is_synced_for_duties is ever called (service.rs:422 and :436), and
a fresh node always boots into sync_status = Syncing.

By the time the chain service has transitioned to Synced, update_sync_status
has already decided the node is caught up, so a duties_paused = false start
is no closer to firing a bad duty than duties_paused = true would be. The
hysteresis logic at runtime correctly engages duties_paused = true when
lag exceeds SYNC_LAG_THRESHOLD, so the pause defense still kicks in when
genuinely behind.

Drop the start-paused defense, remove the explicit service.duties_paused
= false override in the sync-lag test helper (no longer needed once the
default is false).
The pause and resume-from-stall logs include max_seen_slot and network_lag;
the resume-from-local-catchup log was missing both. Add them so operators
have the same diagnostic shape on every transition.
@zemse zemse force-pushed the feat/validator-sync-lag-duty-gate branch 2 times, most recently from 73f09ee to 2018266 Compare May 27, 2026 16:02
@zemse zemse marked this pull request as ready for review May 27, 2026 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add validator sync-lag duty gate

1 participant