Skip to content

libp2p-glue: soft wedge of swarm task on memory-limited (≤4 GiB) hosts after first reconnect-storm #958

Description

@ch4r10t33r

Symptom

On hosts where the zeam container runs with a small memory limit (≤4 GiB), the libp2p Rust bridge's swarm task soft-wedges ~5 minutes after startup, around the time of the first reconnect-storm (multiple peers hitting MAX_RECONNECT_ATTEMPTS=5 in close succession).

  • Process stays alive — Docker's restart policy does NOT fire (it only restarts on process death).
  • The Zig libxev / clock thread keeps ticking; the validator keeps trying to construct attestations but fails with "behind peers".
  • The libp2p swarm task stops processing entirely: no gossip ingress, no reqresp responses, no [#942 arm] diag heartbeat.
  • head stays frozen indefinitely (observed for 1h+ without recovery).

Only an external restart (or fleet redeploy) un-wedges the node.

Evidence (2026-05-31 devnet, image 2f54ab43 = commit e789de6f on PR #953 branch)

Same image, same code path on every node. The differentiator is environmental:

node host RAM container mem limit uptime head @ clock=1183 status
zeam_8 30 GiB 30.59 GiB 1 hour 1175 ✅ healthy, in lockstep with ethlambda/grandine
zeam_9 15 GiB 4 GiB 1 sec 53 OOM-restart cycle
zeam_10 15 GiB 4 GiB 1 hour 53 ⚠️ soft wedge (head unchanged for 30+ min)
zeam_11 15 GiB 4 GiB 2 sec 61 OOM-restart cycle
zeam_12 15 GiB 4 GiB 11 sec 5 restarted, fresh sync
zeam_13 15 GiB 4 GiB 1 hour 53 ⚠️ soft wedge (head unchanged for 30+ min)
zeam_14 15 GiB (no limit) Restarting (137) OOM-killed
zeam_15 15 GiB 4 GiB 12 sec 0 restarted, fresh sync
ethlambda_1 1175 ✅ reference
grandine_2 1175 ✅ reference

OOM restarts on 4 GiB hosts are accepted (shared machines). The wedges on zeam_10 / zeam_13 are NOT OOMs — both processes are alive for 1+ hour with the swarm task dead.

What's been ruled out

The libp2p-glue re-entrant Mutex deadlock that PR #953 fixed (commits 71f24084, c6cf2241, 03b4ef39, e789de6f) is NOT the cause of this wedge:

  • zeam_8 (30 GiB) hits the same trigger — the first "Max reconnection attempts (5) reached for peer …" warning fires at slot 57 — and the swarm task survives cleanly, then handles 9 more of the same warnings over the next hour.
  • zeam_13 (4 GiB) fires the same warning at slot 55 (11 seconds earlier) and the swarm task wedges immediately afterward, never to recover.
  • The same fix code, same call sequence, same data — only the host/container memory limit differs.

Diagnostic signature on wedged hosts

The strongest fingerprint from the debug-level file log (/opt/lean-quickstart/data/<container>/consensus.log):

On wedged nodes (zeam_10, zeam_13), the tokio-spawned diag task (spawn_quic_diag_emitter) NEVER fires once in the entire log file (zero #942 arm strings in 24+ MB of log).

On healthy zeam_8 with the same image, the same diag task fires within 0.6 seconds of container start and continues firing every 10 seconds for the full uptime (74,802 #942 strings in 457 MB).

So the wedge isn't libp2p-internal — it's a tokio runtime symptom: spawned tasks never get polled. The swarm task itself runs healthily for ~5 minutes (we see normal peer-connect / Status request / Status response events), then stops.

Hypotheses

  1. Tokio runtime worker starvation under memory pressure. With new_multi_thread().worker_threads(2), if the system is paging or close to the cgroup memory limit, worker threads may not get scheduled. The spawn_quic_diag_emitter heartbeat never firing is consistent with this.
  2. Zig allocator / chain-state memory growth crowding out tokio. zeam's Zig side (chain worker, fork choice, validator) is memory-hungry. On 4 GiB limit, head room for tokio's per-task allocations (mostly small but numerous: each reconnect schedules a tokio::time::Sleep future) might be too tight.
  3. A blocking syscall in the swarm task's path under load. A synchronous lock or Drop somewhere in the OutgoingConnectionError / reqresp path that's fast on a fast machine but slow under memory pressure. None identified by code audit but worth re-checking.

Why this isn't a blocker for PR #953

PR #953 closes the original #942 race / deadlock chain and is verified healthy on adequately-provisioned hosts (zeam_8 at 30 GiB: 1 hour clean, 9× reconnect-storm survived). The soft wedge on memory-limited hosts is a distinct issue.

Suggested investigations

  • Reproduce locally with a 4 GiB cgroup limit and capture a Tokio runtime trace (RUSTFLAGS=--cfg tokio_unstable, tokio-console).
  • Audit allocations in the libp2p-glue hot path; consider preallocating RECONNECT_DELAYS_SECS Sleep futures or capping concurrent reconnect tasks.
  • Profile the Zig side's memory footprint during the catch-up burst — chain-worker queue, fork-choice tree, message_id_fn cache — and consider per-component caps.
  • Compare tokio::time::interval behavior on the diag emitter task vs a std::thread::sleep loop in a separate OS thread to confirm whether the runtime or the spawn primitive is at fault.

Operational mitigation in the interim

Add a docker healthcheck that detects the soft wedge and triggers Docker's restart policy. Example: confirm lean_head_slot from the metrics endpoint advances over a window, or that the [#942 arm] heartbeat in /opt/lean-quickstart/data/<container>/consensus.log has fired in the last N seconds. Operationally equivalent to the OOM auto-recovery we already accept on small hosts.

Repro environment

Related work in PR #953:

  • 71f24084 — replaced RECONNECT_QUEUE delay-map with tokio::spawn + SwarmCommand::DialReconnect
  • c6cf2241 — moved reqresp FFI dispatch off swarm task
  • 03b4ef39 — let-binding before if let to release MutexGuard
  • e789de6f — removed redundant lock acquisition in schedule_reconnection max-attempts branch

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions