Skip to content

Investigate xev loop spikes: network/gossip/reqresp vs onBlock (Loki follow-up to #863) #867

Description

@ch4r10t33r

Summary

Follow-up investigation from devnet-4 / zeam_0 Loki logs around large [clock] / [forkchoice] slot_intervalduration= spikes (same symptom class as #863). This issue tracks confirming root cause (network completion backlog vs synchronous chain.onBlock) and performance work once attribution is clear.

Parent / context: #863.

What we already observed (logs)

  • [clock] and [forkchoice] durations spike together within ~0.5s wall time (e.g. ~22s and ~49s examples), consistent with a single busy events.run(.until_done) draining the xev loop before the next tickInterval() / tickIntervalUnlocked() (see pkgs/node/src/clock.zig and pkgs/node/src/forkchoice.zig).
  • In tight Loki windows bracketing those spikes, log volume is dominated by [node], rust-bridge, [reqresp], and gossip receive paths (attestations / aggregations; fewer gossip blocks). We did not see chain-worker / onBlock log lines in those windows.
  • Caveat: successful onBlock may not emit per-block info logs; logs alone cannot refute long synchronous onBlock on the hot path.

Hypothesis to prove or disprove

  1. Primary: loop starvation from bursty network completions (GossipSub-style deliveries, reqresp, rust-bridge callbacks) keeping events.run busy.
  2. Secondary (metrics): occasional multi-second chain.onBlock (or other long completions) still contributing tail latency even when not visible in logs.

Investigation plan (next live devnet)

  1. Prometheus correlation (same host, aligned timestamps as Loki spikes):
    • lean_tick_interval_duration_seconds / zeam_fork_choice_tick_interval_duration_seconds
    • zeam_chain_onblock_duration_seconds (histogram tail / +Inf)
    • lean_chain_queue_depth, lean_chain_queue_dropped_total, lean_chain_worker_loop_iters_total (if --chain-worker on)
    • zeam_libp2p_swarm_command_dropped_total (back-pressure signal)
  2. Optional: per-completion or per-events.run iteration wall-time sampling (log or metric when a single drain exceeds e.g. 500ms / 1s), attributed by completion type where feasible (Investigate slow slot_interval / tick duration (event-loop starvation vs nominal 0.8s) #863 suggestion).
  3. Infra sanity: CPU steal, disk, noisy neighbours on the validator VM (rule out before deep Zig changes).

Suggested performance directions (after attribution)

If attribution shows… Possible improvements
Gossip / mesh flood Stronger batching or defer work to chain-worker; rate limits or coalescing on hot gossip handlers; review subscription / mesh params for aggregator load.
Reqresp storms Dedup in-flight requests; cap parallel reqresp work; ensure responses chunk and do not run huge synchronous SSZ on the loop.
rust-bridge callbacks Shrink callback work done on the xev thread; queue to worker or Zig-side bounded executor; audit for blocking or very large copies.
Slow onBlock Ensure heavy paths use chain-worker; split STF / forkchoice steps; bound processPendingBlocks iterations (metrics already exist: lean_pending_blocks_drain_iters).
Operator clarity Docs: slot_interval duration is a loop-health signal, not fork-choice algorithm latency alone (#863).

References in tree

  • pkgs/node/src/clock.zigClock.run / tickInterval
  • pkgs/node/src/forkchoice.zigtickIntervalUnlocked
  • pkgs/node/src/node.zigchain.onBlock call sites
  • pkgs/metrics — tick interval, zeam_chain_onblock_duration_seconds, chain-queue metrics

Acceptance criteria

  • Documented Prometheus + (optional) trace evidence for at least one spike window naming the dominant contributor(s).
  • Actionable issue(s) or PR(s) for the top 1–2 fixes, or explicit “infra / load test artefact” conclusion with monitoring recommendations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions