Skip to content

gossipsub: gate flood-publish by message size — fix drive saturation + outbox overflow (coverage ceiling)#279

Merged
ch4r10t33r merged 1 commit into
mainfrom
fix/flood-publish-size-gate
Jun 28, 2026
Merged

gossipsub: gate flood-publish by message size — fix drive saturation + outbox overflow (coverage ceiling)#279
ch4r10t33r merged 1 commit into
mainfrom
fix/flood-publish-size-gate

Conversation

@ch4r10t33r

Copy link
Copy Markdown
Collaborator

After v0.2.52 lifted coverage to ~28 (no more total collapse), the residual +0/-N decay (late=none) traced to drive-loop saturation, with the churn downstream. Live steady-state log: SLOW drive iter ... inbound_streams=134ms + persistent gossip priority cap (1024) / bulk cap (64) outbox hits dropping frames.

Root cause

v0.2.51 flood_publish applied to every topic with no gating. On the dense block topic (~31 subscribers, ~3 MiB blocks) the proposer fanned its block to all 31 peers (≈4× mesh_n) on the 64-slot bulk lane → bulk cap hit (block drops) + ~4× inbound fan-in → advanceInboundStreams 134ms → ~200ms iterations → ACK starvation → no-ACK teardown churn. Flooding the dense/large block topic is net-harmful: blocks propagate fine via the mesh, and coverage is about small attestations — so it is pure volume for zero coverage benefit.

Fix

Gate flood_publish to messages ≤ flood_publish_max_message_bytes (128 KiB, configurable). Small attestations/aggregations still flood (the intended first-hop-coverage win that took coverage 8→28); large blocks fall back to the proven mesh-only forward path (pre-flood behavior). One condition selecting between two already-reviewed code paths → low risk. Cuts the dominant volume, relieving both the outbox overflow and the 134ms saturation — confirmed the dominant lever by two independent live investigations.

Build clean; 504/506 tests. Pure zig-libp2p.

…tion + outbox overflow

After v0.2.52 (which fixed the dominant non-delivery path, lifting coverage to
~28), the residual +0/-N decay (late=none) traced to DRIVE-LOOP SATURATION, not
churn (the churn is downstream). Live steady-state log: SLOW drive iter
inbound_streams=120-134ms + persistent gossip priority(1024)/bulk(64) outbox
caps hit (dropping attestation/block frames).

ROOT: v0.2.51 flood_publish applied to EVERY topic with no gating. On the dense
block topic (~31 subscribers, ~3 MiB blocks) the proposer fanned its block to
all 31 peers (~4x mesh_n) on the 64-slot bulk lane → bulk cap hit (block drops)
+ ~4x inbound fan-in → advanceInboundStreams 134ms → 200ms iters → ACK
starvation → no-ACK teardown churn. flood on the dense/large block topic is
net-harmful (blocks propagate fine via mesh; coverage is about small
attestations) — zero coverage benefit for the volume.

Fix: gate flood_publish to messages <= flood_publish_max_message_bytes (128 KiB,
configurable). Small attestations/aggregations still flood (the intended
first-hop-coverage benefit); large blocks fall back to the proven mesh-only
forward path (pre-flood behavior). Cuts the dominant volume → relieves BOTH the
outbox overflow AND the 134ms saturation (confirmed dominant lever by two
independent live investigations). Build clean; 504/506 tests. Pure zig-libp2p.
@ch4r10t33r ch4r10t33r merged commit fe2a816 into main Jun 28, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant