gossipsub/req-resp: rust-libp2p parity stability batch (flood-publish, backoff slack, O(1) IDONTWANT, prune decorrelation, inbound reaper)#277
Merged
Conversation
… (C1), decorrelated prune (H5) C2: flood_publish (rust default on) — originator delivers its own message to ALL subscribed peers (remote_interest), not just the current mesh, so a validator's attestation reaches every subnet member regardless of mesh convergence state. Decouples first-hop coverage from mesh size. C1: backoff_slack_heartbeats (rust is_backoff_with_slack, default 1) — a pruned peer stays GRAFT-ineligible 1 heartbeat past exact expiry so both ends don't re-graft in lockstep -> re-collide -> re-prune (synchronized flap on dense topics). Slack applies to GRAFT-SEND only; GRAFT-ACCEPT stays exact. H5: per-node salted prune tie-break — equal-scored peers (all 0 with scoring off) were pruned in raw peer-id byte order, identical on every node -> network-wide correlated pruning. Salt with our own peer id to decorrelate. H3 (inbound GRAFT mesh_n_high ceiling) intentionally NOT included: it would obsolete the still-valid heartbeat-prune path + the flap it targets is fixed by C1. 503/505 tests.
P1: IDONTWANT was an O(n) ArrayList scanned (and mutated) per-forward ×
per-mesh-peer over up to max_idontwant_entries (16384) — a direct multiplier
on the single gossipsub-owner-thread ceiling that has recurred as ACK-drop
churn under attestation load (worsened by flood-publish fanning to more
peers). Converted to a keyed HashMap<{peer,id}, expires_ms>: O(1)
peerWantsNotPublish (self-sweeping on lookup) + O(1) rememberIDontWant; bounded
via pruneIDontWant in the heartbeat sweep + cap-evict. rust uses hashed
structures throughout.
P4: decodePublishes had no per-RPC message-count cap (one 128 MiB frame could
decode ~3M tiny publishes → unbounded alloc + owner-thread forward loop).
Capped at max_publishes_per_rpc (8192), rejecting absurd frames.
503/505 tests.
…t silently stops sync serving Inbound req/resp streams had NO timeout and NO age reap (unlike the outbound OutboundRequest.deadline_ms reaper). A peer that opens a /blocks_by_range or /status stream, completes multistream-select, then stalls (no request body, never FINs) pinned the InboundStream + its zquic raw-app slot FOREVER. Across peers this leaks toward the 64-slot raw_app_streams table -> RawAppStreamSlotsFull -> the node silently stops answering peers' sync requests (slow mesh-wide finality degradation). Fix: InboundStream.created_ms + inbound_request_reap_ms (30s, 2x the outbound reaper so a slow legitimate response is never falsely reaped); advanceInboundStreams reaps any not-yet-response_fin_sent stream past the deadline via the existing removeInboundStreamAt (releases the raw slot + frees the stream). 503/505 tests. R2/R3/R4 (surface RESET_STREAM, cap concurrent outbound streams, bound resp_acc) deferred: R4 is already lifetime-bounded by the outbound reaper; R2 needs a zquic reset-state API; R3 needs an app-level stream-budget accountant — each its own focused change.
…koff slack effective (C1) Adversarial review of the batch found two real bugs: R1 (CRITICAL): the age-based inbound reaper did not discriminate stream type. Persistent /meshsub gossip (protocol_index 0..3) + relay-hop/stop + dcutr streams NEVER set response_fin_sent, so the 30s reaper tore EVERY inbound gossip stream down every 30s (then it re-surfaced and failed multistream-select against mid-stream bytes) — far worse than the slot leak it fixes. Gated the reaper to req/resp protocol indices (proto_meshsub_last_index < p < proto_relay_hop) and pre-handshake stalls (protocol_index == null) only. C1 (inert): pruneBackoff() GC'd backoff entries at EXACT expiry before the heartbeat's graft selection, so the slack window was never observable — the GRAFT/PRUNE-flap damping was dead code. pruneBackoff now GCs at expires+slack. Added a test probing the slack boundary (was missing; the existing test jumped past it). Review confirmed C2/H5/P1/P4 sound. 504/506 tests.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stability fixes from a systematic rust-libp2p gap analysis across the subsystems zeam uses. Each is rust-parity; the whole batch was adversarially reviewed (two real bugs caught + fixed before merge).
Included
is_backoff_with_slack) so both ends don't re-graft in lockstep → kills the synchronized GRAFT/PRUNE flap on dense topics. (Correct narrow form; not the previously-reverted subscriber bypass.)Adversarial review (done, two bugs fixed)
/meshsubgossip every 30s) → gated to req/resp protocol indices only.pruneBackoffGC at exact expiry defeated the slack) → fixed + slack-boundary test added.Deferred (own focused PRs — documented, not forgotten)
H4 (mesh_outbound_min/direction floor) + M8 (opportunistic rotation) need connection-direction threaded into gossipsub independent of the scoring flag; M6 subsumed by C2; P2 (gossip promises) + P3 (per-heartbeat IHAVE caps) are adversarial-only; R2/R3/N1 need zquic reset-state / close-reason APIs; N3 (outbound-leg reconciler) + pull_fifo/recent_seen O(1) follow P1. Keep conn-limits null (N2).
Build clean; 504/506 tests. Pure zig-libp2p.