You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Follow-up investigation from devnet-4 / zeam_0Loki logs around large [clock] / [forkchoice]slot_interval … duration= spikes (same symptom class as #863). This issue tracks confirming root cause (network completion backlog vs synchronous chain.onBlock) and performance work once attribution is clear.
[clock] and [forkchoice] durations spike together within ~0.5s wall time (e.g. ~22s and ~49s examples), consistent with a single busy events.run(.until_done) draining the xev loop before the next tickInterval() / tickIntervalUnlocked() (see pkgs/node/src/clock.zig and pkgs/node/src/forkchoice.zig).
In tight Loki windows bracketing those spikes, log volume is dominated by [node], rust-bridge, [reqresp], and gossip receive paths (attestations / aggregations; fewer gossip blocks). We did not see chain-worker / onBlock log lines in those windows.
Caveat: successful onBlock may not emit per-block info logs; logs alone cannot refute long synchronous onBlock on the hot path.
Stronger batching or defer work to chain-worker; rate limits or coalescing on hot gossip handlers; review subscription / mesh params for aggregator load.
Reqresp storms
Dedup in-flight requests; cap parallel reqresp work; ensure responses chunk and do not run huge synchronous SSZ on the loop.
rust-bridge callbacks
Shrink callback work done on the xev thread; queue to worker or Zig-side bounded executor; audit for blocking or very large copies.
Slow onBlock
Ensure heavy paths use chain-worker; split STF / forkchoice steps; bound processPendingBlocks iterations (metrics already exist: lean_pending_blocks_drain_iters).
Operator clarity
Docs:slot_intervalduration is a loop-health signal, not fork-choice algorithm latency alone (#863).
Summary
Follow-up investigation from devnet-4 /
zeam_0Loki logs around large[clock]/[forkchoice]slot_interval…duration=spikes (same symptom class as #863). This issue tracks confirming root cause (network completion backlog vs synchronouschain.onBlock) and performance work once attribution is clear.Parent / context: #863.
What we already observed (logs)
[clock]and[forkchoice]durations spike together within ~0.5s wall time (e.g. ~22s and ~49s examples), consistent with a single busyevents.run(.until_done)draining the xev loop before the nexttickInterval()/tickIntervalUnlocked()(seepkgs/node/src/clock.zigandpkgs/node/src/forkchoice.zig).[node],rust-bridge,[reqresp], and gossip receive paths (attestations / aggregations; fewer gossip blocks). We did not seechain-worker/onBlocklog lines in those windows.onBlockmay not emit per-block info logs; logs alone cannot refute long synchronousonBlockon the hot path.Hypothesis to prove or disprove
events.runbusy.chain.onBlock(or other long completions) still contributing tail latency even when not visible in logs.Investigation plan (next live devnet)
lean_tick_interval_duration_seconds/zeam_fork_choice_tick_interval_duration_secondszeam_chain_onblock_duration_seconds(histogram tail /+Inf)lean_chain_queue_depth,lean_chain_queue_dropped_total,lean_chain_worker_loop_iters_total(if--chain-worker on)zeam_libp2p_swarm_command_dropped_total(back-pressure signal)events.runiteration wall-time sampling (log or metric when a single drain exceeds e.g. 500ms / 1s), attributed by completion type where feasible (Investigate slowslot_interval/ tick duration (event-loop starvation vs nominal 0.8s) #863 suggestion).Suggested performance directions (after attribution)
onBlockprocessPendingBlocksiterations (metrics already exist:lean_pending_blocks_drain_iters).slot_intervaldurationis a loop-health signal, not fork-choice algorithm latency alone (#863).References in tree
pkgs/node/src/clock.zig—Clock.run/tickIntervalpkgs/node/src/forkchoice.zig—tickIntervalUnlockedpkgs/node/src/node.zig—chain.onBlockcall sitespkgs/metrics— tick interval,zeam_chain_onblock_duration_seconds, chain-queue metricsAcceptance criteria