Skip to content

network: replace rust libp2p-glue with zig-libp2p v0.1.34#968

Draft
ch4r10t33r wants to merge 38 commits into
mainfrom
feat/replace-libp2p-glue
Draft

network: replace rust libp2p-glue with zig-libp2p v0.1.34#968
ch4r10t33r wants to merge 38 commits into
mainfrom
feat/replace-libp2p-glue

Conversation

@ch4r10t33r

@ch4r10t33r ch4r10t33r commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Summary

Replaces zeam's Rust libp2p-glue FFI stack with a pure-Zig networking path built on zig-libp2p v0.1.34 (transitively zquic v1.6.15). zeam node beam, simtest, and the Docker devnet entry point now use EthLibp2pV2 over QUIC + libp2p TLS.

Net change: −~8,000 LOC of Rust glue removed; +~1,900 LOC of native Zig networking (ethlibp2p_v2.zig, gossip_codec.zig).

What changed

Networking (pkgs/network/src/ethlibp2p_v2.zig)

  • QUIC transport via zig-libp2p QuicRuntime: listen/dial, multistream-select, gossipsub, req/resp.
  • Host identity: secp256k1 private key read from --node-key (lean-quickstart .key format), matching eth-beacon-genesis ENR-derived /p2p/... peer ids.
  • In-memory TLS PEM (no cert files on disk; works in FROM scratch containers).
  • Gossipsub: SSZ + snappy encode/decode, publish/subscribe, #942 snappy hardening reused from gossip_codec.zig.
  • Req/resp: status, blocks_by_root, blocks_by_range — SSZ framing, callback dispatch, in-flight failure on peer disconnect.
  • Peer events: connect/disconnect/dial-failure with base58 PeerIds matching the legacy Rust path.
  • Bootnodes: registered as host.registerKnownPeer and host.addDirectPeer so they are always mesh-eligible and GRAFTed on the next gossipsub heartbeat instead of waiting for a SUBSCRIBE/GRAFT round trip.
  • Metrics parity: lean_gossip_mesh_peers, zeam_libp2p_swarm_command_dropped_total.
  • Observability: gossip / SUBSCRIBE / mesh-status logs land on the debug channel (only registered bootnode stays at info).

Removed

  • rust/libp2p-glue/ (~5,700 LOC) and pkgs/network/src/ethlibp2p.zig (~2,400 LOC).
  • zeam-glue libp2p feature flag from the Rust prover workspace (prover staticlibs unchanged).

Other zeam fixes on this branch

  • node/chain: fix use-after-free in produceBlock / finalizeProposalIfReadywrapOwnedStateIntoRc frees the post_state wrapper, but forkChoice.onBlock was still passed the dangling pointer (segfault at first block proposal under ReleaseSafe).
  • node: publish blocks before the redundant local onBlock, fix mesh-peer metric registration, fix gossip-source attribution so the proposer-coverage report no longer reports gossip=none.
  • cli, build: restore the mock simtest harness and params-scaled timeouts.
  • leanSpec: bump submodule to e8014f9 for devnet5 SSZ fixtures.
  • Docker: exclude **/target/ from build context (avoids copying multi-GB rust/target/).
  • AGENTS.md: document the "commit and push when implementation work is complete" workflow.

Upstream pins (zig-libp2p v0.1.11 → v0.1.34)

Initial bring-up:

Release What zeam needed
v0.1.11 zquic v1.6.13; zig↔go QUIC ping; gossipsub cross-impl fixes
v0.1.12 zquic v1.6.14 Initial retransmit replay (ethlambda inbound dials)
v0.1.13 Inbound gossip publish; zquic v1.6.15 client Handshake CRYPTO reassembly
v0.1.14 Fix CertificateNotYetValid for rcgen/quinn certs (RFC 5280 UTCTime two-digit year)
v0.1.15 Raise gossip/req-resp wire caps for hash-sig blocks; skip oversize gossip frames
v0.1.16 unified-testing TCP+TLS+Yamux transport interop (#186) + interop_quic cross-impl fixes (#183/#184)
v0.1.17 persistent per-peer /meshsub/1.0.0 stream (#189)

Multistream / req-resp parity with rust-libp2p:

Release What zeam needed
v0.1.18 Delimited req/resp and /ipfs/id/1.0.0 + /ipfs/ping/1.0.0 inbound for rust-libp2p
v0.1.19 Persist multistream accumulator across drive ticks (was losing inbound bytes on reactor boundary)
v0.1.20 Forward multistream-select tail into dispatch accs (handle byte loss after \n)
v0.1.21 Fix '/' vs delimited framing collision in initiator/responder negotiate
v0.1.22 Respond to inbound /ipfs/id/push/1.0.0 from rust-libp2p (otherwise the connection RST'd)
v0.1.23 Release inbound req/resp raw-app slots after response (slot leak wedged status RPCs)
v0.1.24 Surface remote-initiated streams on outbound connections (rust-libp2p opens inbound req/resp on the dialer side)
v0.1.25 Drain gossipsub outbox GRAFT/PRUNE to persistent mesh streams (heartbeat had nowhere to write)
v0.1.26 Read server-initiated streams from Client.raw_app_recv (mirror of v0.1.24 in the raw-app layer)
v0.1.27 Accept /meshsub/{1.0,1.1,1.2,1.3}.0 in the multistream responder (rust-libp2p offers 1.1 / 1.2)

Persistent-stream stability + protobuf forward-compat:

Release What zeam needed
v0.1.28 Recreate wedged persistent gossip streams on handshake/write failure
v0.1.29 Route gossip publishes via per-message streams (superseded by v0.1.30 — overran rust-libp2p's MAX_SUBSTREAM_ATTEMPTS = 1)
v0.1.30 Single persistent /meshsub stream per peer for the connection lifetime; mark broken on failure instead of opening a second stream
v0.1.31 Surface remote outbound QUIC CONNECTION_CLOSE to host.onConnectionClosed so the connection_manager can redial with backoff
v0.1.32 Treat unknown SubOpts protobuf fields as forward-compat (skip, not UnsupportedWireType); isolate inbound RPC decode phases so a subscription decode error no longer drops the embedded control/publish frames
v0.1.33 Log unknown protobuf fields once per (scope, field_number, wire_type) so we can file upstream issues for the fields rust-libp2p actually sends
v0.1.34 Collapse persistent /meshsub handshake into one reactor tick; eager-submit the first dial from registerKnownPeer; fan out to=null SUBSCRIBE/UNSUBSCRIBE broadcasts in drainGossipsubOutbox instead of dropping them

Devnet verification (local-docker, zeam ↔ ethlambda)

Verified on lean-quickstart local-devnet with 0xpartha/zeam:zig:

Check Result
zeam ↔ ethlambda QUIC/TLS handshake Peer connections establish both directions
CertificateNotYetValid Fixed (v0.1.14)
Persistent /meshsub stream Stable for the lifetime of the QUIC connection (v0.1.30)
GRAFT / PRUNE round trip Drained to peers (v0.1.25)
Cross-client gossip mesh lean_gossip_mesh_peers > 0, block + attestation propagation observed
Cross-client req/resp (status, blocks_by_*) Working both directions
Inbound /ipfs/id/push Replied (v0.1.22)
Remote QUIC close detection host.onConnectionClosed fires; connection_manager redials with backoff (v0.1.31)
Forward-compatible protobuf Unknown SubOpts fields skipped + logged once instead of breaking the RPC (v0.1.32 / v0.1.33)
Block production All zeam nodes propose blocks without segfault (chain UAF fix)
Chain progress Slots advance, finality observed on healthy runs

Known limitations (follow-up, not blockers for this PR)

  • Chain forks under mixed-client load: Even with stable gossip and req/resp, the chain can still fork on busy slots. Suspect is zeam's fork-choice tie-breaking / safe-target selection in pkgs/node/src/chain.zig (consensus layer, not transport). Tracked separately.
  • Hash-sig block size: blocks reach ~3 MiB snappy-compressed; v0.1.15 raises wire caps to 16–64 MiB headroom.

Test plan

  • zig build test --summary all
  • zig build simtest — three nodes over real QUIC (all-to-all mesh, blocks propagate)
  • EthLibp2pV2 unit tests (protocol mapping, peer events, bootnode parsing, QuicRuntime bring-up)
  • Docker devnet: zeam ↔ ethlambda peering, gossip, block production
  • Persistent /meshsub stream survives cross-client GRAFT/PRUNE under sustained load
  • Full cross-client justification/finality across long-running devnets (gated on the fork-choice follow-up above)

@ch4r10t33r ch4r10t33r changed the title feat(network): replace Rust libp2p-glue with zig-libp2p v0.1.0 (skeleton) feat(network): begin replacing Rust libp2p-glue with zig-libp2p v0.1.0 Jun 2, 2026
@ch4r10t33r ch4r10t33r changed the title feat(network): begin replacing Rust libp2p-glue with zig-libp2p v0.1.0 feat(network): begin replacing Rust libp2p-glue with zig-libp2p Jun 3, 2026
@ch4r10t33r ch4r10t33r changed the title feat(network): begin replacing Rust libp2p-glue with zig-libp2p feat(network): replace Rust libp2p-glue with zig-libp2p v0.1.2 Jun 3, 2026
@ch4r10t33r ch4r10t33r changed the title feat(network): replace Rust libp2p-glue with zig-libp2p v0.1.2 feat(network): replace rust libp2p-glue with zig-libp2p v0.1.2 Jun 3, 2026
@ch4r10t33r ch4r10t33r force-pushed the feat/replace-libp2p-glue branch from 0382486 to a21b7ad Compare June 4, 2026 23:32
@ch4r10t33r ch4r10t33r changed the title feat(network): replace rust libp2p-glue with zig-libp2p v0.1.2 network: replace rust libp2p-glue with zig-libp2p v0.1.15 Jun 9, 2026
@ch4r10t33r ch4r10t33r marked this pull request as ready for review June 9, 2026 09:13
@ch4r10t33r ch4r10t33r changed the title network: replace rust libp2p-glue with zig-libp2p v0.1.15 network: replace rust libp2p-glue with zig-libp2p v0.1.16 Jun 9, 2026
@ch4r10t33r ch4r10t33r changed the title network: replace rust libp2p-glue with zig-libp2p v0.1.16 network: replace rust libp2p-glue with zig-libp2p v0.1.17 Jun 10, 2026
@ch4r10t33r ch4r10t33r marked this pull request as draft June 10, 2026 21:40
Replaces zeam's Rust libp2p-glue FFI stack with a pure-Zig networking path
built on zig-libp2p v0.1.34 (transitively zquic v1.6.15). `zeam node beam`,
simtest, and the Docker devnet entry point now use `EthLibp2pV2` over QUIC +
libp2p TLS. Net change: ~8,000 LOC of Rust glue removed; ~1,900 LOC of native
Zig networking added (`ethlibp2p_v2.zig`, `gossip_codec.zig`).

Networking (`pkgs/network/src/ethlibp2p_v2.zig`):
  - QUIC transport via QuicRuntime: listen/dial, multistream-select,
    gossipsub, req/resp.
  - Host identity: secp256k1 private key from `--node-key` matching
    eth-beacon-genesis ENR-derived `/p2p/...` peer ids.
  - In-memory TLS PEM (no cert files on disk; works in `FROM scratch`).
  - Gossipsub: SSZ + snappy encode/decode, publish/subscribe with #942
    snappy hardening reused from `gossip_codec.zig`.
  - Req/resp: status, blocks_by_root, blocks_by_range with SSZ framing,
    callback dispatch, in-flight failure on peer disconnect.
  - Peer events: connect/disconnect/dial-failure with base58 PeerIds.
  - Bootnodes promoted to gossipsub direct peers so they are always mesh-
    eligible and GRAFTed on the next heartbeat instead of waiting for a
    SUBSCRIBE/GRAFT round trip.
  - Metrics parity: `lean_gossip_mesh_peers`,
    `zeam_libp2p_swarm_command_dropped_total`.
  - Gossip / SUBSCRIBE / mesh-status logs land on debug; only
    `registered bootnode` stays at info.

Removed:
  - `rust/libp2p-glue/` (~5,700 LOC) and `pkgs/network/src/ethlibp2p.zig`
    (~2,400 LOC).
  - `zeam-glue` libp2p feature flag from the Rust prover workspace.

Other zeam fixes on this branch:
  - node/chain: fix use-after-free in `produceBlock` /
    `finalizeProposalIfReady` (wrapOwnedStateIntoRc freed `post_state`
    while forkChoice.onBlock was still passed the dangling pointer).
  - node: publish blocks before redundant local onBlock; fix mesh-peer
    metric registration; fix gossip-source attribution so the proposer-
    coverage report no longer reports `gossip=none`.
  - cli, build: restore mock simtest harness and params-scaled timeouts.
  - leanSpec: bump submodule to e8014f9 for devnet5 SSZ fixtures.
  - Docker: exclude `**/target/` from build context.
  - AGENTS.md: document the "commit and push when implementation work is
    complete" workflow.

Upstream pins (zig-libp2p v0.1.11 -> v0.1.34) span initial bring-up,
multistream / req-resp parity with rust-libp2p, and persistent-stream
stability + protobuf forward-compat. See PR description for the per-release
breakdown.
@ch4r10t33r ch4r10t33r force-pushed the feat/replace-libp2p-glue branch from 6cfbdf6 to 3394f6f Compare June 11, 2026 13:51
@ch4r10t33r ch4r10t33r changed the title network: replace rust libp2p-glue with zig-libp2p v0.1.17 network: replace rust libp2p-glue with zig-libp2p v0.1.34 Jun 11, 2026
Picks up the gossipsub fix that keeps direct peers in the topic mesh on
inbound PRUNE. Without it, rust-libp2p PRUNEing the attestation subnet
mesh evicted the (direct) ethlambda peer from zeam's mesh, silently
killing zeam->ethlambda attestation delivery a few slots after startup
and stalling justification/finalization.
While propose_inflight is set, queue locally produced attestations
that reference the in-flight head and flush them immediately after
the block hits gossip. Prevents peers from seeing Unknown head block
when the Type-2 proof delay pushes block gossip past interval-1
attestation publish. Bump zig-libp2p to v0.1.36.
Outbound-only persistent gossip publish fixes zeam→ethlambda interop
when the rust peer dials first.
Detects outbound QUIC connection close on draining flag instead of
waiting for phase==.closed. Fixes the case where ethlambda closes its
inbound legs to zeam after ~44s; previously zeam silently kept
publishing on the dead connection forever.
The success branch in `publishBlock` borrowed a cached post-state via
`chain.statesGet(block_root)` and registered the LIFO-pair sentinel
`defer state_borrow.assertReleasedOrPanic();` but never registered the
matching `defer state_borrow.deinit();`. Per the contract documented at
`locking.zig:1182`, the assert is intentionally registered FIRST so it
runs LAST — observing `released = true` only after `deinit` has already
dropped the underlying lock + refcount. Without the second defer the
borrow is never released, so the sentinel panics on every successful
publish path:

    thread N panic: BorrowedState dropped without release; backing=none
    pkgs/node/src/locking.zig:1205 in assertReleasedOrPanic
    pkgs/node/src/node.zig:3586    in publishBlock

This trips
`Node: publishBlock persists locally produced blocks for blocks-by-root
sync` and aborts the node test binary with SIGABRT. The fix is the
missing companion defer plus a comment pointing at the contract so a
future refactor does not drop it again.
zig-libp2p v0.1.42 ships zquic v1.6.17, which adds RFC 9000 §10.1.2
keepalive PINGs. zquic's previous `checkPto` only emitted PINGs when
`bytes_in_flight > 0`; in zeam's gossipsub-heavy workload that meant
no ACK-eliciting packet left the QUIC client between our own slot
publishes while a rust-libp2p peer kept publishing, so the peer's
idle timer silently expired and rust-libp2p closed the connection
with an error-class reason ~44s after handshake. The bump fixes the
recurring zeam <-> ethlambda drops observed in the local devnet.

Also picks up the gossipsub IHAVE->IWANT handler, the
GRAFT-during-backoff score penalty, and the autonat v2 amplification
cost fix from zig-libp2p v0.1.41 (#192 upstream).
Pulls in two fixes for the recurring ~80s reason=error connection
closes between zeam and ethlambda on stable mesh topics:

* App-layer keepalive on persistent /meshsub streams: a 20s
  empty-control gossipsub RPC heartbeat that keeps rust-libp2p's
  connection handler from idle-closing when our gossipsub layer has
  nothing else to say (transport quic_runtime fix).

* zquic v1.6.18: declare connection lost after 2x max_idle_timeout
  without ACKs, so detectOutboundConnectionClose evicts and redials
  even when CONNECTION_CLOSE itself is dropped (kernel UDP buffer
  overflow, NAT rebind, peer crash).
Pulls in zquic v1.7.0 via zig-libp2p v0.1.44, which buffers
flow-control-blocked raw STREAM bytes instead of silently dropping
them (RFC 9000 §4, §19.9, §19.13).  This is the actual root-cause
fix for the zeam ↔ ethlambda gossipsub stream wedge.

Before: rust-libp2p / quinn's default 128 KiB per-stream receive
window was easily exceeded by a single 188 KB aggregation or 235 KB
block; zquic would drop the tail of the frame and the raw-stream
writer adapter would advance send_offset past the gap, permanently
misaligning the receiver.  All earlier mitigations (v1.6.17
keepalive PINGs, v1.6.18 connection-lost detection, zig-libp2p
v0.1.43 app-layer gossip keepalive) addressed downstream symptoms.

zeam tests: all suites pass on the new dependency.
Pulls send_offset fix and zquic client CC gating so zeam→quinn gossip
does not wedge on STREAM offset holes after congestion blocks.
v0.1.50 recorded zquic as a .path git URL, breaking Docker fetches.
Byte-granular pacing lets sub-MSS gossip frames drain instead of
stalling behind an MSS token floor on the outbound client leg.
Coalesced pending STREAM entries now capped at one MTU, so the byte-granular
pacer always has enough credit to drain head-of-queue under loopback bursts.
Brings in:

  * zquic v1.7.9: `Server.sendRawStreamData` now returns `usize`
    (bytes accepted).  Previously the server-side stream send silently
    dropped bytes when the pending-stream-send queue was exhausted;
    the embedder then advanced its `send_offset` past those bytes and
    the peer hung forever on the resulting STREAM gap.

  * zig-libp2p v0.1.54: every server-side `sendRawStreamData` call
    site now honors the accepted-bytes return so that refusals leave
    the per-stream offset unchanged and the next tick can retry.

Combined with the earlier client-side pacer/coalesce fixes (zquic
1.7.7/1.7.8), this closes both halves of the asymmetric gossip
wedge between zeam (zquic) and ethlambda (quinn).
Pulls in zquic v1.7.10: pacer burst budget now scales with cwnd
(`max(16 × MSS, cwnd / 8)`).  On loopback / high-bdp devnets this
clears a 200-frame gossipsub block in one drain pass instead of
fragmenting it across ~12 ms of pacer refills, which was the
remaining cause of pacer-stall bursts (`135 entries / 161 KB`)
observed after the v1.7.9 server-side fix.
zquic v1.7.10's cwnd-scaled pacer burst regressed goodput on
loopback devnets: kernel UDP buffer drops -> RACK -> Cubic cut
cwnd -> LD ring + pending queue overran -> embedder saw silent
queue-full rejections.

v1.7.11 reverts to the v1.7.9 16 x MSS burst cap, which on the
devnet:

  * eliminated all queue_full warnings
  * restored zeam finalization to slot 60+
  * restored ethlambda gossip-receive throughput
Picks up the prioritized + log.warn-visible "declare conn lost
after 2× idle_timeout" branch in both `Client.checkPto` and
`Server.checkPto`.  Fixes the silent 10-minute wedge where a zeam
outbound to ethlambda stayed pinned at `ld=2048/2048, cc_bif=2.5MB`
after ethlambda's quinn evicted its end of the peer record —
`detectOutboundConnectionClose` will now flip `draining` and
`connection_manager` will redial within a minute instead of never.
…v0.1.65

`cli/main.zig`: add a runtime `DEBUG_QUIC` gate for the noisy QUIC-stack
log scopes (`.zquic`, `.quic_runtime`, `.quic_dcutr`, `.quic_relay`,
`.connection_manager`, `.tls`).  A small `quicAwareLogFn` installed via
`std_options.logFn` consults a `quic_debug_enabled` atomic and drops
`info`/`debug` messages from those scopes when it's false; `warn` and
`err` always pass through so genuine problems remain visible.  The
atomic is flipped to true at startup when the `DEBUG_QUIC` env var is
set to `1` / `true` / `yes` / `on` (matching `lean-quickstart`'s
shell-side contract for ethlambda/quinn).  Without this the recent
backpressure work in zig-libp2p emitted tens of MB of `quic_runtime:
persistent gossip outbox paused` lines per node on a sustained devnet
run, dwarfing the actually-useful chain logs.

`build.zig.zon`: bump zig-libp2p 0.1.64 → 0.1.65, which pulls in zquic
v1.7.19 and its three raw-stream retx alias-free guards (the latent
UAF that surfaced as a jemalloc SIGSEGV on the previous wedged-cwnd
devnet run).  No source-side changes required for the bump.

Zig 0.16 removed `std.process.getEnvVarOwned`; the CLI binary already
links libc (for the rust glue + rocksdb), so the env read uses
`std.c.getenv` directly — same pattern as `pkgs/xmss/src/shadow_cost.zig`.
Picks up the proactive persistent-gossip wedge timer in
quic_runtime: the /meshsub outbox is now declared wedged after 20 s
of fully-stuck backpressure (instead of waiting for zquic's 60 s
no-ACK conn-lost timer), which drives the existing
markPersistentGossipBroken -> closePeerConnectionForGossipRecovery ->
connection_manager redial -> replaySubscribeToPeer recovery path 3x
faster. Also promotes five recovery events from log.info to log.warn
so they survive the QUIC-stack log filter zeam installs by default
(previously the entire wedge -> close -> redial -> migrate sequence
was invisible without DEBUG_QUIC=1).

Fixes the asymmetric-gossip-induced finalization stall observed
between zeam and ethlambda on local-devnet, where the FFG aggregator
saw only 2/3 instead of 3/3 attestations and finalization stayed at
genesis even though the head kept advancing to slot 150+.
Picks up the redial dedupe fix in quic_runtime.handleDial: the
pre-dial check was using peerHasActiveConnection (inbound OR outbound)
instead of outbound_by_peer.contains, so every redial submitted by
connection_manager after the v0.1.66 wedge timer tore down the
outbound was being silently no-op'd while the peer's inbound leg was
still alive. That left gossip publish permanently dead to the peer
even though req/resp on the inbound kept working -- exactly the
observed local-devnet pattern where the wedge timer fired correctly
but ethlambda saw zero fresh zeam attestations afterwards and stayed
at Justified=1/Finalized=0 while zeam's pair self-justified to head
400+.
Pulls in zquic v1.7.20 which adds secp256r1 to the ClientHello
supported_groups so QUIC dials to lantern (ngtcp2 + BoringSSL libp2p
server with ECDSA-P256 cert) complete the handshake instead of being
silently dropped at stalled_phase=initial.
Pulls in zquic v1.7.21: routes incoming QUIC Initials by client-chosen
DCID (init_dcid) instead of peer address, so peers like lantern /
c-lean-libp2p (ngtcp2) that retry the handshake with a fresh DCID get
their Initial keys re-derived correctly instead of silently failing AEAD
against the previous attempt's keys.  Fixes zeam <-> lantern peering in
the local devnet.
Picks up zquic v1.7.23 via zig-libp2p v0.1.71. The client now mints an
18-byte DCID for its first Initial so that ngtcp2 + AWS-LC servers
(lantern) stop silently dropping our handshake. zeam<->zeam and
zeam<->ethlambda paths were unaffected by the previous 8-byte DCID.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant