Symptom
On hosts where the zeam container runs with a small memory limit (≤4 GiB), the libp2p Rust bridge's swarm task soft-wedges ~5 minutes after startup, around the time of the first reconnect-storm (multiple peers hitting MAX_RECONNECT_ATTEMPTS=5 in close succession).
- Process stays alive — Docker's restart policy does NOT fire (it only restarts on process death).
- The Zig libxev / clock thread keeps ticking; the validator keeps trying to construct attestations but fails with "behind peers".
- The libp2p swarm task stops processing entirely: no gossip ingress, no reqresp responses, no
[#942 arm] diag heartbeat.
head stays frozen indefinitely (observed for 1h+ without recovery).
Only an external restart (or fleet redeploy) un-wedges the node.
Evidence (2026-05-31 devnet, image 2f54ab43 = commit e789de6f on PR #953 branch)
Same image, same code path on every node. The differentiator is environmental:
| node |
host RAM |
container mem limit |
uptime |
head @ clock=1183 |
status |
| zeam_8 |
30 GiB |
30.59 GiB |
1 hour |
1175 |
✅ healthy, in lockstep with ethlambda/grandine |
| zeam_9 |
15 GiB |
4 GiB |
1 sec |
53 |
OOM-restart cycle |
| zeam_10 |
15 GiB |
4 GiB |
1 hour |
53 |
⚠️ soft wedge (head unchanged for 30+ min) |
| zeam_11 |
15 GiB |
4 GiB |
2 sec |
61 |
OOM-restart cycle |
| zeam_12 |
15 GiB |
4 GiB |
11 sec |
5 |
restarted, fresh sync |
| zeam_13 |
15 GiB |
4 GiB |
1 hour |
53 |
⚠️ soft wedge (head unchanged for 30+ min) |
| zeam_14 |
15 GiB |
(no limit) |
Restarting (137) |
— |
OOM-killed |
| zeam_15 |
15 GiB |
4 GiB |
12 sec |
0 |
restarted, fresh sync |
| ethlambda_1 |
— |
— |
— |
1175 |
✅ reference |
| grandine_2 |
— |
— |
— |
1175 |
✅ reference |
OOM restarts on 4 GiB hosts are accepted (shared machines). The wedges on zeam_10 / zeam_13 are NOT OOMs — both processes are alive for 1+ hour with the swarm task dead.
What's been ruled out
The libp2p-glue re-entrant Mutex deadlock that PR #953 fixed (commits 71f24084, c6cf2241, 03b4ef39, e789de6f) is NOT the cause of this wedge:
- zeam_8 (30 GiB) hits the same trigger — the first "Max reconnection attempts (5) reached for peer …" warning fires at slot 57 — and the swarm task survives cleanly, then handles 9 more of the same warnings over the next hour.
- zeam_13 (4 GiB) fires the same warning at slot 55 (11 seconds earlier) and the swarm task wedges immediately afterward, never to recover.
- The same fix code, same call sequence, same data — only the host/container memory limit differs.
Diagnostic signature on wedged hosts
The strongest fingerprint from the debug-level file log (/opt/lean-quickstart/data/<container>/consensus.log):
On wedged nodes (zeam_10, zeam_13), the tokio-spawned diag task (spawn_quic_diag_emitter) NEVER fires once in the entire log file (zero #942 arm strings in 24+ MB of log).
On healthy zeam_8 with the same image, the same diag task fires within 0.6 seconds of container start and continues firing every 10 seconds for the full uptime (74,802 #942 strings in 457 MB).
So the wedge isn't libp2p-internal — it's a tokio runtime symptom: spawned tasks never get polled. The swarm task itself runs healthily for ~5 minutes (we see normal peer-connect / Status request / Status response events), then stops.
Hypotheses
- Tokio runtime worker starvation under memory pressure. With
new_multi_thread().worker_threads(2), if the system is paging or close to the cgroup memory limit, worker threads may not get scheduled. The spawn_quic_diag_emitter heartbeat never firing is consistent with this.
- Zig allocator / chain-state memory growth crowding out tokio. zeam's Zig side (chain worker, fork choice, validator) is memory-hungry. On 4 GiB limit, head room for tokio's per-task allocations (mostly small but numerous: each reconnect schedules a
tokio::time::Sleep future) might be too tight.
- A blocking syscall in the swarm task's path under load. A synchronous lock or Drop somewhere in the OutgoingConnectionError / reqresp path that's fast on a fast machine but slow under memory pressure. None identified by code audit but worth re-checking.
Why this isn't a blocker for PR #953
PR #953 closes the original #942 race / deadlock chain and is verified healthy on adequately-provisioned hosts (zeam_8 at 30 GiB: 1 hour clean, 9× reconnect-storm survived). The soft wedge on memory-limited hosts is a distinct issue.
Suggested investigations
- Reproduce locally with a 4 GiB cgroup limit and capture a Tokio runtime trace (
RUSTFLAGS=--cfg tokio_unstable, tokio-console).
- Audit allocations in the libp2p-glue hot path; consider preallocating
RECONNECT_DELAYS_SECS Sleep futures or capping concurrent reconnect tasks.
- Profile the Zig side's memory footprint during the catch-up burst — chain-worker queue, fork-choice tree, message_id_fn cache — and consider per-component caps.
- Compare
tokio::time::interval behavior on the diag emitter task vs a std::thread::sleep loop in a separate OS thread to confirm whether the runtime or the spawn primitive is at fault.
Operational mitigation in the interim
Add a docker healthcheck that detects the soft wedge and triggers Docker's restart policy. Example: confirm lean_head_slot from the metrics endpoint advances over a window, or that the [#942 arm] heartbeat in /opt/lean-quickstart/data/<container>/consensus.log has fired in the last N seconds. Operationally equivalent to the OOM auto-recovery we already accept on small hosts.
Repro environment
Related work in PR #953:
71f24084 — replaced RECONNECT_QUEUE delay-map with tokio::spawn + SwarmCommand::DialReconnect
c6cf2241 — moved reqresp FFI dispatch off swarm task
03b4ef39 — let-binding before if let to release MutexGuard
e789de6f — removed redundant lock acquisition in schedule_reconnection max-attempts branch
Symptom
On hosts where the zeam container runs with a small memory limit (≤4 GiB), the libp2p Rust bridge's swarm task soft-wedges ~5 minutes after startup, around the time of the first reconnect-storm (multiple peers hitting
MAX_RECONNECT_ATTEMPTS=5in close succession).[#942 arm]diag heartbeat.headstays frozen indefinitely (observed for 1h+ without recovery).Only an external restart (or fleet redeploy) un-wedges the node.
Evidence (2026-05-31 devnet, image
2f54ab43= commite789de6fon PR #953 branch)Same image, same code path on every node. The differentiator is environmental:
OOM restarts on 4 GiB hosts are accepted (shared machines). The wedges on zeam_10 / zeam_13 are NOT OOMs — both processes are alive for 1+ hour with the swarm task dead.
What's been ruled out
The libp2p-glue re-entrant Mutex deadlock that PR #953 fixed (commits
71f24084,c6cf2241,03b4ef39,e789de6f) is NOT the cause of this wedge:Diagnostic signature on wedged hosts
The strongest fingerprint from the debug-level file log (
/opt/lean-quickstart/data/<container>/consensus.log):So the wedge isn't libp2p-internal — it's a tokio runtime symptom: spawned tasks never get polled. The swarm task itself runs healthily for ~5 minutes (we see normal peer-connect / Status request / Status response events), then stops.
Hypotheses
new_multi_thread().worker_threads(2), if the system is paging or close to the cgroup memory limit, worker threads may not get scheduled. Thespawn_quic_diag_emitterheartbeat never firing is consistent with this.tokio::time::Sleepfuture) might be too tight.Why this isn't a blocker for PR #953
PR #953 closes the original #942 race / deadlock chain and is verified healthy on adequately-provisioned hosts (zeam_8 at 30 GiB: 1 hour clean, 9× reconnect-storm survived). The soft wedge on memory-limited hosts is a distinct issue.
Suggested investigations
RUSTFLAGS=--cfg tokio_unstable,tokio-console).RECONNECT_DELAYS_SECSSleep futures or capping concurrent reconnect tasks.tokio::time::intervalbehavior on the diag emitter task vs astd::thread::sleeploop in a separate OS thread to confirm whether the runtime or the spawn primitive is at fault.Operational mitigation in the interim
Add a docker healthcheck that detects the soft wedge and triggers Docker's restart policy. Example: confirm
lean_head_slotfrom the metrics endpoint advances over a window, or that the[#942 arm]heartbeat in/opt/lean-quickstart/data/<container>/consensus.loghas fired in the last N seconds. Operationally equivalent to the OOM auto-recovery we already accept on small hosts.Repro environment
e789de6f(head ofperf/blocks-by-range-threshold-942-followup, PR node, cli, types: blocks_by_range threshold 64→4 + per-call aggregator-children cap (#942, #940) #953)2f54ab43fd5eae...Related work in PR #953:
71f24084— replacedRECONNECT_QUEUEdelay-map withtokio::spawn+SwarmCommand::DialReconnectc6cf2241— moved reqresp FFI dispatch off swarm task03b4ef39— let-binding beforeif letto release MutexGuarde789de6f— removed redundant lock acquisition inschedule_reconnectionmax-attempts branch