Skip to content

gossipsub/req-resp: rust-libp2p parity stability batch (flood-publish, backoff slack, O(1) IDONTWANT, prune decorrelation, inbound reaper)#277

Merged
ch4r10t33r merged 4 commits into
mainfrom
fix/rust-parity-stability-batch
Jun 28, 2026
Merged

gossipsub/req-resp: rust-libp2p parity stability batch (flood-publish, backoff slack, O(1) IDONTWANT, prune decorrelation, inbound reaper)#277
ch4r10t33r merged 4 commits into
mainfrom
fix/rust-parity-stability-batch

Conversation

@ch4r10t33r

Copy link
Copy Markdown
Collaborator

Stability fixes from a systematic rust-libp2p gap analysis across the subsystems zeam uses. Each is rust-parity; the whole batch was adversarially reviewed (two real bugs caught + fixed before merge).

Included

  • C2 flood-publish — originator delivers its own published message to ALL subscribed peers (rust default), not just the current mesh → decouples first-hop attestation coverage from mesh convergence (a sub-quorum cause).
  • C1 backoff slack — pruned peer stays GRAFT-ineligible 1 heartbeat past exact expiry (rust is_backoff_with_slack) so both ends don't re-graft in lockstep → kills the synchronized GRAFT/PRUNE flap on dense topics. (Correct narrow form; not the previously-reverted subscriber bypass.)
  • H5 decorrelated prune — per-node salted prune tie-break so equal-scored peers aren't pruned in identical byte order network-wide (was correlated → fed the flap).
  • P1 O(1) IDONTWANT — was an O(n) ArrayList scanned per-forward × per-mesh-peer (the single-owner-thread ceiling); now a keyed HashMap.
  • P4 per-RPC publish-count cap (unbounded decode → bounded).
  • R1 inbound req/resp reaper — stale inbound streams leaked raw-app slots forever → node silently stopped serving sync; now reaped (gated to req/resp + pre-handshake so persistent gossip/relay streams are never touched).

Adversarial review (done, two bugs fixed)

  • R1 was initially broken (reaped persistent /meshsub gossip every 30s) → gated to req/resp protocol indices only.
  • C1 was initially inert (pruneBackoff GC at exact expiry defeated the slack) → fixed + slack-boundary test added.
  • C2/H5/P1/P4 confirmed memory-safe and correct.

Deferred (own focused PRs — documented, not forgotten)

H4 (mesh_outbound_min/direction floor) + M8 (opportunistic rotation) need connection-direction threaded into gossipsub independent of the scoring flag; M6 subsumed by C2; P2 (gossip promises) + P3 (per-heartbeat IHAVE caps) are adversarial-only; R2/R3/N1 need zquic reset-state / close-reason APIs; N3 (outbound-leg reconciler) + pull_fifo/recent_seen O(1) follow P1. Keep conn-limits null (N2).

Build clean; 504/506 tests. Pure zig-libp2p.

… (C1), decorrelated prune (H5)

C2: flood_publish (rust default on) — originator delivers its own message to
ALL subscribed peers (remote_interest), not just the current mesh, so a
validator's attestation reaches every subnet member regardless of mesh
convergence state. Decouples first-hop coverage from mesh size.

C1: backoff_slack_heartbeats (rust is_backoff_with_slack, default 1) — a
pruned peer stays GRAFT-ineligible 1 heartbeat past exact expiry so both ends
don't re-graft in lockstep -> re-collide -> re-prune (synchronized flap on
dense topics). Slack applies to GRAFT-SEND only; GRAFT-ACCEPT stays exact.

H5: per-node salted prune tie-break — equal-scored peers (all 0 with scoring
off) were pruned in raw peer-id byte order, identical on every node ->
network-wide correlated pruning. Salt with our own peer id to decorrelate.

H3 (inbound GRAFT mesh_n_high ceiling) intentionally NOT included: it would
obsolete the still-valid heartbeat-prune path + the flap it targets is fixed
by C1. 503/505 tests.
P1: IDONTWANT was an O(n) ArrayList scanned (and mutated) per-forward ×
per-mesh-peer over up to max_idontwant_entries (16384) — a direct multiplier
on the single gossipsub-owner-thread ceiling that has recurred as ACK-drop
churn under attestation load (worsened by flood-publish fanning to more
peers). Converted to a keyed HashMap<{peer,id}, expires_ms>: O(1)
peerWantsNotPublish (self-sweeping on lookup) + O(1) rememberIDontWant; bounded
via pruneIDontWant in the heartbeat sweep + cap-evict. rust uses hashed
structures throughout.

P4: decodePublishes had no per-RPC message-count cap (one 128 MiB frame could
decode ~3M tiny publishes → unbounded alloc + owner-thread forward loop).
Capped at max_publishes_per_rpc (8192), rejecting absurd frames.

503/505 tests.
…t silently stops sync serving

Inbound req/resp streams had NO timeout and NO age reap (unlike the outbound
OutboundRequest.deadline_ms reaper). A peer that opens a /blocks_by_range or
/status stream, completes multistream-select, then stalls (no request body,
never FINs) pinned the InboundStream + its zquic raw-app slot FOREVER. Across
peers this leaks toward the 64-slot raw_app_streams table -> RawAppStreamSlotsFull
-> the node silently stops answering peers' sync requests (slow mesh-wide
finality degradation).

Fix: InboundStream.created_ms + inbound_request_reap_ms (30s, 2x the outbound
reaper so a slow legitimate response is never falsely reaped); advanceInboundStreams
reaps any not-yet-response_fin_sent stream past the deadline via the existing
removeInboundStreamAt (releases the raw slot + frees the stream). 503/505 tests.

R2/R3/R4 (surface RESET_STREAM, cap concurrent outbound streams, bound
resp_acc) deferred: R4 is already lifetime-bounded by the outbound reaper; R2
needs a zquic reset-state API; R3 needs an app-level stream-budget accountant —
each its own focused change.
…koff slack effective (C1)

Adversarial review of the batch found two real bugs:

R1 (CRITICAL): the age-based inbound reaper did not discriminate stream type.
Persistent /meshsub gossip (protocol_index 0..3) + relay-hop/stop + dcutr
streams NEVER set response_fin_sent, so the 30s reaper tore EVERY inbound
gossip stream down every 30s (then it re-surfaced and failed multistream-select
against mid-stream bytes) — far worse than the slot leak it fixes. Gated the
reaper to req/resp protocol indices (proto_meshsub_last_index < p <
proto_relay_hop) and pre-handshake stalls (protocol_index == null) only.

C1 (inert): pruneBackoff() GC'd backoff entries at EXACT expiry before the
heartbeat's graft selection, so the slack window was never observable — the
GRAFT/PRUNE-flap damping was dead code. pruneBackoff now GCs at expires+slack.
Added a test probing the slack boundary (was missing; the existing test jumped
past it).

Review confirmed C2/H5/P1/P4 sound. 504/506 tests.
@ch4r10t33r ch4r10t33r merged commit a62d52f into main Jun 28, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant