libp2p-glue: soft wedge of swarm task on memory-limited (≤4 GiB) hosts after first reconnect-storm

## Symptom

On hosts where the zeam container runs with a small memory limit (≤4 GiB), the libp2p Rust bridge's swarm task **soft-wedges** ~5 minutes after startup, around the time of the first reconnect-storm (multiple peers hitting `MAX_RECONNECT_ATTEMPTS=5` in close succession).

- Process stays alive — Docker's restart policy does NOT fire (it only restarts on process death).
- The Zig libxev / clock thread keeps ticking; the validator keeps trying to construct attestations but fails with "behind peers".
- The libp2p swarm task stops processing entirely: no gossip ingress, no reqresp responses, no `[#942 arm]` diag heartbeat.
- `head` stays frozen indefinitely (observed for 1h+ without recovery).

Only an external restart (or fleet redeploy) un-wedges the node.

## Evidence (2026-05-31 devnet, image `2f54ab43` = commit `e789de6f` on PR #953 branch)

Same image, same code path on every node. The differentiator is environmental:

| node | host RAM | container mem limit | uptime | head @ clock=1183 | status |
|---|---|---|---|---|---|
| **zeam_8** | **30 GiB** | **30.59 GiB** | 1 hour | **1175** | ✅ healthy, in lockstep with ethlambda/grandine |
| zeam_9 | 15 GiB | 4 GiB | 1 sec | 53 | OOM-restart cycle |
| **zeam_10** | 15 GiB | **4 GiB** | **1 hour** | **53** | ⚠️ soft wedge (head unchanged for 30+ min) |
| zeam_11 | 15 GiB | 4 GiB | 2 sec | 61 | OOM-restart cycle |
| zeam_12 | 15 GiB | 4 GiB | 11 sec | 5 | restarted, fresh sync |
| **zeam_13** | 15 GiB | **4 GiB** | **1 hour** | **53** | ⚠️ soft wedge (head unchanged for 30+ min) |
| zeam_14 | 15 GiB | (no limit) | Restarting (137) | — | OOM-killed |
| zeam_15 | 15 GiB | 4 GiB | 12 sec | 0 | restarted, fresh sync |
| ethlambda_1 | — | — | — | 1175 | ✅ reference |
| grandine_2 | — | — | — | 1175 | ✅ reference |

OOM restarts on 4 GiB hosts are accepted (shared machines). The wedges on **zeam_10 / zeam_13 are NOT OOMs** — both processes are alive for 1+ hour with the swarm task dead.

## What's been ruled out

The libp2p-glue re-entrant Mutex deadlock that PR #953 fixed (commits `71f24084`, `c6cf2241`, `03b4ef39`, `e789de6f`) is **NOT** the cause of this wedge:

- zeam_8 (30 GiB) **hits the same trigger** — the first "Max reconnection attempts (5) reached for peer …" warning fires at slot 57 — and the swarm task survives cleanly, then handles **9 more** of the same warnings over the next hour.
- zeam_13 (4 GiB) fires the same warning at slot 55 (11 seconds earlier) and the swarm task wedges immediately afterward, **never to recover**.
- The same fix code, same call sequence, same data — only the host/container memory limit differs.

## Diagnostic signature on wedged hosts

The strongest fingerprint from the debug-level file log (`/opt/lean-quickstart/data/<container>/consensus.log`):

> On wedged nodes (zeam_10, zeam_13), the tokio-spawned diag task (`spawn_quic_diag_emitter`) **NEVER fires once** in the entire log file (zero `#942 arm` strings in 24+ MB of log).
>
> On healthy zeam_8 with the same image, the same diag task fires within **0.6 seconds** of container start and continues firing every 10 seconds for the full uptime (74,802 `#942` strings in 457 MB).

So the wedge isn't libp2p-internal — it's a tokio runtime symptom: spawned tasks never get polled. The swarm task itself runs healthily for ~5 minutes (we see normal peer-connect / Status request / Status response events), then stops.

## Hypotheses

1. **Tokio runtime worker starvation under memory pressure.** With `new_multi_thread().worker_threads(2)`, if the system is paging or close to the cgroup memory limit, worker threads may not get scheduled. The `spawn_quic_diag_emitter` heartbeat never firing is consistent with this.
2. **Zig allocator / chain-state memory growth crowding out tokio**. zeam's Zig side (chain worker, fork choice, validator) is memory-hungry. On 4 GiB limit, head room for tokio's per-task allocations (mostly small but numerous: each reconnect schedules a `tokio::time::Sleep` future) might be too tight.
3. **A blocking syscall in the swarm task's path under load.** A synchronous lock or Drop somewhere in the OutgoingConnectionError / reqresp path that's fast on a fast machine but slow under memory pressure. None identified by code audit but worth re-checking.

## Why this isn't a blocker for PR #953

PR #953 closes the original #942 race / deadlock chain and is verified healthy on adequately-provisioned hosts (zeam_8 at 30 GiB: 1 hour clean, 9× reconnect-storm survived). The soft wedge on memory-limited hosts is a distinct issue.

## Suggested investigations

- Reproduce locally with a 4 GiB cgroup limit and capture a Tokio runtime trace (`RUSTFLAGS=--cfg tokio_unstable`, `tokio-console`).
- Audit allocations in the libp2p-glue hot path; consider preallocating `RECONNECT_DELAYS_SECS` Sleep futures or capping concurrent reconnect tasks.
- Profile the Zig side's memory footprint during the catch-up burst — chain-worker queue, fork-choice tree, message_id_fn cache — and consider per-component caps.
- Compare `tokio::time::interval` behavior on the diag emitter task vs a `std::thread::sleep` loop in a separate OS thread to confirm whether the runtime or the spawn primitive is at fault.

## Operational mitigation in the interim

Add a docker healthcheck that detects the soft wedge and triggers Docker's restart policy. Example: confirm `lean_head_slot` from the metrics endpoint advances over a window, or that the `[#942 arm]` heartbeat in `/opt/lean-quickstart/data/<container>/consensus.log` has fired in the last N seconds. Operationally equivalent to the OOM auto-recovery we already accept on small hosts.

## Repro environment

- zeam commit: `e789de6f` (head of `perf/blocks-by-range-threshold-942-followup`, PR #953)
- Image SHA: `2f54ab43fd5eae...`
- Devnet hosts: see ansible-devnet config; reproduces reliably on the 15 GiB hosts with 4 GiB container limit (zeam_10, _13).
- Healthy reference: zeam_8 (30 GiB host, 30.59 GiB container limit).

Related work in PR #953:
- `71f24084` — replaced `RECONNECT_QUEUE` delay-map with `tokio::spawn` + `SwarmCommand::DialReconnect`
- `c6cf2241` — moved reqresp FFI dispatch off swarm task
- `03b4ef39` — let-binding before `if let` to release MutexGuard
- `e789de6f` — removed redundant lock acquisition in `schedule_reconnection` max-attempts branch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

libp2p-glue: soft wedge of swarm task on memory-limited (≤4 GiB) hosts after first reconnect-storm #958

Symptom

Evidence (2026-05-31 devnet, image `2f54ab43` = commit `e789de6f` on PR #953 branch)

What's been ruled out

Diagnostic signature on wedged hosts

Hypotheses

Why this isn't a blocker for PR #953

Suggested investigations

Operational mitigation in the interim

Repro environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

node	host RAM	container mem limit	uptime	head @ clock=1183	status
zeam_8	30 GiB	30.59 GiB	1 hour	1175	✅ healthy, in lockstep with ethlambda/grandine
zeam_9	15 GiB	4 GiB	1 sec	53	OOM-restart cycle
zeam_10	15 GiB	4 GiB	1 hour	53	⚠️ soft wedge (head unchanged for 30+ min)
zeam_11	15 GiB	4 GiB	2 sec	61	OOM-restart cycle
zeam_12	15 GiB	4 GiB	11 sec	5	restarted, fresh sync
zeam_13	15 GiB	4 GiB	1 hour	53	⚠️ soft wedge (head unchanged for 30+ min)
zeam_14	15 GiB	(no limit)	Restarting (137)	—	OOM-killed
zeam_15	15 GiB	4 GiB	12 sec	0	restarted, fresh sync
ethlambda_1	—	—	—	1175	✅ reference
grandine_2	—	—	—	1175	✅ reference

Uh oh!

libp2p-glue: soft wedge of swarm task on memory-limited (≤4 GiB) hosts after first reconnect-storm #958

Description

Symptom

Evidence (2026-05-31 devnet, image 2f54ab43 = commit e789de6f on PR #953 branch)

What's been ruled out

Diagnostic signature on wedged hosts

Hypotheses

Why this isn't a blocker for PR #953

Suggested investigations

Operational mitigation in the interim

Repro environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Evidence (2026-05-31 devnet, image `2f54ab43` = commit `e789de6f` on PR #953 branch)