network: replace rust libp2p-glue with zig-libp2p v0.1.34#968
Draft
ch4r10t33r wants to merge 38 commits into
Draft
network: replace rust libp2p-glue with zig-libp2p v0.1.34#968ch4r10t33r wants to merge 38 commits into
ch4r10t33r wants to merge 38 commits into
Conversation
5 tasks
3 tasks
0382486 to
a21b7ad
Compare
Replaces zeam's Rust libp2p-glue FFI stack with a pure-Zig networking path
built on zig-libp2p v0.1.34 (transitively zquic v1.6.15). `zeam node beam`,
simtest, and the Docker devnet entry point now use `EthLibp2pV2` over QUIC +
libp2p TLS. Net change: ~8,000 LOC of Rust glue removed; ~1,900 LOC of native
Zig networking added (`ethlibp2p_v2.zig`, `gossip_codec.zig`).
Networking (`pkgs/network/src/ethlibp2p_v2.zig`):
- QUIC transport via QuicRuntime: listen/dial, multistream-select,
gossipsub, req/resp.
- Host identity: secp256k1 private key from `--node-key` matching
eth-beacon-genesis ENR-derived `/p2p/...` peer ids.
- In-memory TLS PEM (no cert files on disk; works in `FROM scratch`).
- Gossipsub: SSZ + snappy encode/decode, publish/subscribe with #942
snappy hardening reused from `gossip_codec.zig`.
- Req/resp: status, blocks_by_root, blocks_by_range with SSZ framing,
callback dispatch, in-flight failure on peer disconnect.
- Peer events: connect/disconnect/dial-failure with base58 PeerIds.
- Bootnodes promoted to gossipsub direct peers so they are always mesh-
eligible and GRAFTed on the next heartbeat instead of waiting for a
SUBSCRIBE/GRAFT round trip.
- Metrics parity: `lean_gossip_mesh_peers`,
`zeam_libp2p_swarm_command_dropped_total`.
- Gossip / SUBSCRIBE / mesh-status logs land on debug; only
`registered bootnode` stays at info.
Removed:
- `rust/libp2p-glue/` (~5,700 LOC) and `pkgs/network/src/ethlibp2p.zig`
(~2,400 LOC).
- `zeam-glue` libp2p feature flag from the Rust prover workspace.
Other zeam fixes on this branch:
- node/chain: fix use-after-free in `produceBlock` /
`finalizeProposalIfReady` (wrapOwnedStateIntoRc freed `post_state`
while forkChoice.onBlock was still passed the dangling pointer).
- node: publish blocks before redundant local onBlock; fix mesh-peer
metric registration; fix gossip-source attribution so the proposer-
coverage report no longer reports `gossip=none`.
- cli, build: restore mock simtest harness and params-scaled timeouts.
- leanSpec: bump submodule to e8014f9 for devnet5 SSZ fixtures.
- Docker: exclude `**/target/` from build context.
- AGENTS.md: document the "commit and push when implementation work is
complete" workflow.
Upstream pins (zig-libp2p v0.1.11 -> v0.1.34) span initial bring-up,
multistream / req-resp parity with rust-libp2p, and persistent-stream
stability + protobuf forward-compat. See PR description for the per-release
breakdown.
6cfbdf6 to
3394f6f
Compare
Picks up the gossipsub fix that keeps direct peers in the topic mesh on inbound PRUNE. Without it, rust-libp2p PRUNEing the attestation subnet mesh evicted the (direct) ethlambda peer from zeam's mesh, silently killing zeam->ethlambda attestation delivery a few slots after startup and stalling justification/finalization.
While propose_inflight is set, queue locally produced attestations that reference the in-flight head and flush them immediately after the block hits gossip. Prevents peers from seeing Unknown head block when the Type-2 proof delay pushes block gossip past interval-1 attestation publish. Bump zig-libp2p to v0.1.36.
Outbound-only persistent gossip publish fixes zeam→ethlambda interop when the rust peer dials first.
Detects outbound QUIC connection close on draining flag instead of waiting for phase==.closed. Fixes the case where ethlambda closes its inbound legs to zeam after ~44s; previously zeam silently kept publishing on the dead connection forever.
The success branch in `publishBlock` borrowed a cached post-state via
`chain.statesGet(block_root)` and registered the LIFO-pair sentinel
`defer state_borrow.assertReleasedOrPanic();` but never registered the
matching `defer state_borrow.deinit();`. Per the contract documented at
`locking.zig:1182`, the assert is intentionally registered FIRST so it
runs LAST — observing `released = true` only after `deinit` has already
dropped the underlying lock + refcount. Without the second defer the
borrow is never released, so the sentinel panics on every successful
publish path:
thread N panic: BorrowedState dropped without release; backing=none
pkgs/node/src/locking.zig:1205 in assertReleasedOrPanic
pkgs/node/src/node.zig:3586 in publishBlock
This trips
`Node: publishBlock persists locally produced blocks for blocks-by-root
sync` and aborts the node test binary with SIGABRT. The fix is the
missing companion defer plus a comment pointing at the contract so a
future refactor does not drop it again.
zig-libp2p v0.1.42 ships zquic v1.6.17, which adds RFC 9000 §10.1.2 keepalive PINGs. zquic's previous `checkPto` only emitted PINGs when `bytes_in_flight > 0`; in zeam's gossipsub-heavy workload that meant no ACK-eliciting packet left the QUIC client between our own slot publishes while a rust-libp2p peer kept publishing, so the peer's idle timer silently expired and rust-libp2p closed the connection with an error-class reason ~44s after handshake. The bump fixes the recurring zeam <-> ethlambda drops observed in the local devnet. Also picks up the gossipsub IHAVE->IWANT handler, the GRAFT-during-backoff score penalty, and the autonat v2 amplification cost fix from zig-libp2p v0.1.41 (#192 upstream).
Pulls in two fixes for the recurring ~80s reason=error connection closes between zeam and ethlambda on stable mesh topics: * App-layer keepalive on persistent /meshsub streams: a 20s empty-control gossipsub RPC heartbeat that keeps rust-libp2p's connection handler from idle-closing when our gossipsub layer has nothing else to say (transport quic_runtime fix). * zquic v1.6.18: declare connection lost after 2x max_idle_timeout without ACKs, so detectOutboundConnectionClose evicts and redials even when CONNECTION_CLOSE itself is dropped (kernel UDP buffer overflow, NAT rebind, peer crash).
Pulls in zquic v1.7.0 via zig-libp2p v0.1.44, which buffers flow-control-blocked raw STREAM bytes instead of silently dropping them (RFC 9000 §4, §19.9, §19.13). This is the actual root-cause fix for the zeam ↔ ethlambda gossipsub stream wedge. Before: rust-libp2p / quinn's default 128 KiB per-stream receive window was easily exceeded by a single 188 KB aggregation or 235 KB block; zquic would drop the tail of the frame and the raw-stream writer adapter would advance send_offset past the gap, permanently misaligning the receiver. All earlier mitigations (v1.6.17 keepalive PINGs, v1.6.18 connection-lost detection, zig-libp2p v0.1.43 app-layer gossip keepalive) addressed downstream symptoms. zeam tests: all suites pass on the new dependency.
Pulls send_offset fix and zquic client CC gating so zeam→quinn gossip does not wedge on STREAM offset holes after congestion blocks.
v0.1.50 recorded zquic as a .path git URL, breaking Docker fetches.
Byte-granular pacing lets sub-MSS gossip frames drain instead of stalling behind an MSS token floor on the outbound client leg.
Coalesced pending STREAM entries now capped at one MTU, so the byte-granular pacer always has enough credit to drain head-of-queue under loopback bursts.
Brings in:
* zquic v1.7.9: `Server.sendRawStreamData` now returns `usize`
(bytes accepted). Previously the server-side stream send silently
dropped bytes when the pending-stream-send queue was exhausted;
the embedder then advanced its `send_offset` past those bytes and
the peer hung forever on the resulting STREAM gap.
* zig-libp2p v0.1.54: every server-side `sendRawStreamData` call
site now honors the accepted-bytes return so that refusals leave
the per-stream offset unchanged and the next tick can retry.
Combined with the earlier client-side pacer/coalesce fixes (zquic
1.7.7/1.7.8), this closes both halves of the asymmetric gossip
wedge between zeam (zquic) and ethlambda (quinn).
Pulls in zquic v1.7.10: pacer burst budget now scales with cwnd (`max(16 × MSS, cwnd / 8)`). On loopback / high-bdp devnets this clears a 200-frame gossipsub block in one drain pass instead of fragmenting it across ~12 ms of pacer refills, which was the remaining cause of pacer-stall bursts (`135 entries / 161 KB`) observed after the v1.7.9 server-side fix.
zquic v1.7.10's cwnd-scaled pacer burst regressed goodput on loopback devnets: kernel UDP buffer drops -> RACK -> Cubic cut cwnd -> LD ring + pending queue overran -> embedder saw silent queue-full rejections. v1.7.11 reverts to the v1.7.9 16 x MSS burst cap, which on the devnet: * eliminated all queue_full warnings * restored zeam finalization to slot 60+ * restored ethlambda gossip-receive throughput
Picks up the prioritized + log.warn-visible "declare conn lost after 2× idle_timeout" branch in both `Client.checkPto` and `Server.checkPto`. Fixes the silent 10-minute wedge where a zeam outbound to ethlambda stayed pinned at `ld=2048/2048, cc_bif=2.5MB` after ethlambda's quinn evicted its end of the peer record — `detectOutboundConnectionClose` will now flip `draining` and `connection_manager` will redial within a minute instead of never.
…shsub, zquic v1.7.17)
…v0.1.65 `cli/main.zig`: add a runtime `DEBUG_QUIC` gate for the noisy QUIC-stack log scopes (`.zquic`, `.quic_runtime`, `.quic_dcutr`, `.quic_relay`, `.connection_manager`, `.tls`). A small `quicAwareLogFn` installed via `std_options.logFn` consults a `quic_debug_enabled` atomic and drops `info`/`debug` messages from those scopes when it's false; `warn` and `err` always pass through so genuine problems remain visible. The atomic is flipped to true at startup when the `DEBUG_QUIC` env var is set to `1` / `true` / `yes` / `on` (matching `lean-quickstart`'s shell-side contract for ethlambda/quinn). Without this the recent backpressure work in zig-libp2p emitted tens of MB of `quic_runtime: persistent gossip outbox paused` lines per node on a sustained devnet run, dwarfing the actually-useful chain logs. `build.zig.zon`: bump zig-libp2p 0.1.64 → 0.1.65, which pulls in zquic v1.7.19 and its three raw-stream retx alias-free guards (the latent UAF that surfaced as a jemalloc SIGSEGV on the previous wedged-cwnd devnet run). No source-side changes required for the bump. Zig 0.16 removed `std.process.getEnvVarOwned`; the CLI binary already links libc (for the rust glue + rocksdb), so the env read uses `std.c.getenv` directly — same pattern as `pkgs/xmss/src/shadow_cost.zig`.
Picks up the proactive persistent-gossip wedge timer in quic_runtime: the /meshsub outbox is now declared wedged after 20 s of fully-stuck backpressure (instead of waiting for zquic's 60 s no-ACK conn-lost timer), which drives the existing markPersistentGossipBroken -> closePeerConnectionForGossipRecovery -> connection_manager redial -> replaySubscribeToPeer recovery path 3x faster. Also promotes five recovery events from log.info to log.warn so they survive the QUIC-stack log filter zeam installs by default (previously the entire wedge -> close -> redial -> migrate sequence was invisible without DEBUG_QUIC=1). Fixes the asymmetric-gossip-induced finalization stall observed between zeam and ethlambda on local-devnet, where the FFG aggregator saw only 2/3 instead of 3/3 attestations and finalization stayed at genesis even though the head kept advancing to slot 150+.
Picks up the redial dedupe fix in quic_runtime.handleDial: the pre-dial check was using peerHasActiveConnection (inbound OR outbound) instead of outbound_by_peer.contains, so every redial submitted by connection_manager after the v0.1.66 wedge timer tore down the outbound was being silently no-op'd while the peer's inbound leg was still alive. That left gossip publish permanently dead to the peer even though req/resp on the inbound kept working -- exactly the observed local-devnet pattern where the wedge timer fired correctly but ethlambda saw zero fresh zeam attestations afterwards and stayed at Justified=1/Finalized=0 while zeam's pair self-justified to head 400+.
Pulls in zquic v1.7.20 which adds secp256r1 to the ClientHello supported_groups so QUIC dials to lantern (ngtcp2 + BoringSSL libp2p server with ECDSA-P256 cert) complete the handshake instead of being silently dropped at stalled_phase=initial.
Pulls in zquic v1.7.21: routes incoming QUIC Initials by client-chosen DCID (init_dcid) instead of peer address, so peers like lantern / c-lean-libp2p (ngtcp2) that retry the handshake with a fresh DCID get their Initial keys re-derived correctly instead of silently failing AEAD against the previous attempt's keys. Fixes zeam <-> lantern peering in the local devnet.
Picks up zquic v1.7.23 via zig-libp2p v0.1.71. The client now mints an 18-byte DCID for its first Initial so that ngtcp2 + AWS-LC servers (lantern) stop silently dropping our handshake. zeam<->zeam and zeam<->ethlambda paths were unaffected by the previous 8-byte DCID.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces zeam's Rust
libp2p-glueFFI stack with a pure-Zig networking path built onzig-libp2pv0.1.34 (transitively zquic v1.6.15).zeam node beam, simtest, and the Docker devnet entry point now useEthLibp2pV2over QUIC + libp2p TLS.Net change: −~8,000 LOC of Rust glue removed; +~1,900 LOC of native Zig networking (
ethlibp2p_v2.zig,gossip_codec.zig).What changed
Networking (
pkgs/network/src/ethlibp2p_v2.zig)zig-libp2pQuicRuntime: listen/dial, multistream-select, gossipsub, req/resp.--node-key(lean-quickstart.keyformat), matchingeth-beacon-genesisENR-derived/p2p/...peer ids.FROM scratchcontainers).#942snappy hardening reused fromgossip_codec.zig.host.registerKnownPeerandhost.addDirectPeerso they are always mesh-eligible and GRAFTed on the next gossipsub heartbeat instead of waiting for a SUBSCRIBE/GRAFT round trip.lean_gossip_mesh_peers,zeam_libp2p_swarm_command_dropped_total.debugchannel (onlyregistered bootnodestays atinfo).Removed
rust/libp2p-glue/(~5,700 LOC) andpkgs/network/src/ethlibp2p.zig(~2,400 LOC).zeam-gluelibp2p feature flag from the Rust prover workspace (prover staticlibs unchanged).Other zeam fixes on this branch
node/chain: fix use-after-free inproduceBlock/finalizeProposalIfReady—wrapOwnedStateIntoRcfrees thepost_statewrapper, butforkChoice.onBlockwas still passed the dangling pointer (segfault at first block proposal under ReleaseSafe).node: publish blocks before the redundant localonBlock, fix mesh-peer metric registration, fix gossip-source attribution so the proposer-coverage report no longer reportsgossip=none.cli,build: restore the mock simtest harness and params-scaled timeouts.leanSpec: bump submodule toe8014f9for devnet5 SSZ fixtures.**/target/from build context (avoids copying multi-GBrust/target/).AGENTS.md: document the "commit and push when implementation work is complete" workflow.Upstream pins (zig-libp2p v0.1.11 → v0.1.34)
Initial bring-up:
CertificateNotYetValidfor rcgen/quinn certs (RFC 5280 UTCTime two-digit year)/meshsub/1.0.0stream (#189)Multistream / req-resp parity with rust-libp2p:
/ipfs/id/1.0.0+/ipfs/ping/1.0.0inbound for rust-libp2p\n)'/' vs delimited framingcollision in initiator/responder negotiate/ipfs/id/push/1.0.0from rust-libp2p (otherwise the connection RST'd)Client.raw_app_recv(mirror of v0.1.24 in the raw-app layer)/meshsub/{1.0,1.1,1.2,1.3}.0in the multistream responder (rust-libp2p offers 1.1 / 1.2)Persistent-stream stability + protobuf forward-compat:
MAX_SUBSTREAM_ATTEMPTS = 1)/meshsubstream per peer for the connection lifetime; markbrokenon failure instead of opening a second streamCONNECTION_CLOSEtohost.onConnectionClosedso the connection_manager can redial with backoffSubOptsprotobuf fields as forward-compat (skip, notUnsupportedWireType); isolate inbound RPC decode phases so a subscription decode error no longer drops the embedded control/publish frames(scope, field_number, wire_type)so we can file upstream issues for the fields rust-libp2p actually sends/meshsubhandshake into one reactor tick; eager-submit the first dial fromregisterKnownPeer; fan outto=nullSUBSCRIBE/UNSUBSCRIBE broadcasts indrainGossipsubOutboxinstead of dropping themDevnet verification (local-docker, zeam ↔ ethlambda)
Verified on
lean-quickstartlocal-devnetwith0xpartha/zeam:zig:CertificateNotYetValid/meshsubstreamlean_gossip_mesh_peers > 0, block + attestation propagation observed/ipfs/id/pushhost.onConnectionClosedfires; connection_manager redials with backoff (v0.1.31)SubOptsfields skipped + logged once instead of breaking the RPC (v0.1.32 / v0.1.33)Known limitations (follow-up, not blockers for this PR)
pkgs/node/src/chain.zig(consensus layer, not transport). Tracked separately.Test plan
zig build test --summary allzig build simtest— three nodes over real QUIC (all-to-all mesh, blocks propagate)EthLibp2pV2unit tests (protocol mapping, peer events, bootnode parsing, QuicRuntime bring-up)/meshsubstream survives cross-client GRAFT/PRUNE under sustained load