Skip to content

transport: gate persistent /meshsub teardown on last-leg — fix +0/-N coverage decay (residual finality blocker)#278

Merged
ch4r10t33r merged 1 commit into
mainfrom
fix/persistent-gossip-last-leg
Jun 28, 2026
Merged

transport: gate persistent /meshsub teardown on last-leg — fix +0/-N coverage decay (residual finality blocker)#278
ch4r10t33r merged 1 commit into
mainfrom
fix/persistent-gossip-last-leg

Conversation

@ch4r10t33r

Copy link
Copy Markdown
Collaborator

THE residual finality blocker after v0.2.51. Coverage went 8→21 with the prior batch but parked 1–2 short of the 22 quorum and finality stalled. The live aggregator log was decisive: late=none + diff=+0/-7 → attestations not delivered (not late), aggregate only loses, never gains.

Root cause (transport analog of the v0.2.45 connection_manager fix)

The per-peer persistent /meshsub stream (sh.persistent_gossip, keyed by peer, bound to ONE leg — outbound-preferred) was destroyed unconditionally on ANY leg close. Under sharded QUIC a peer holds 2 legs; a flap of the non-stream-bearing leg destroyed a live gossip stream on the surviving leg. Meanwhile connection_manager.onConnectionClosed correctly returned "not fully disconnected" (other leg up) → gossipsub.onPeerDisconnected did NOT fire → no re-establish, no SUBSCRIBE replay → that peers attestations stopped flowing → monotonic +0/-N coverage decay, never restored. v0.2.45 fixed the gossipsub-state wipe; the transport stream teardown was never given the same last-leg discipline.

Fix

Gate destroyPersistentGossipStream at both close sites on last-leg: destroy only if the stream was on THIS (closing) leg (g.raw tag match) OR liveLegShardForPeer(peer) == null. If the stream was on the closing leg but the peer survives, replay SUBSCRIBE (lazy reopen on next publish; cross-shard re-route via fanDirectedGossip). Outbound site reordered so the live-leg probe excludes the closing leg, with the conn freed after the destroy (map values are pointers → raw.release touches a live client).

Adversarial review (done, no changes needed)

No UAF (verified the fetchRemove-before-destroy window keeps the client alive), no leak (stream + leg co-located per shard; last-leg close always reaps it; deinit backstop), leg-liveness correct including the cross-shard straddle, tag comparison valid, and it genuinely stops the decay.

Build clean; 504/506 tests. Pure zig-libp2p.

…coverage decay

THE residual finality blocker after v0.2.51 (coverage held ~21 but parked 1-2
short of the 22 quorum; aggregator logs showed late=none + diff=+0/-7, i.e.
attestations NOT delivered, never recovered).

ROOT CAUSE (transport analog of the v0.2.45 connection_manager fix, one layer
down): the per-peer persistent /meshsub stream (sh.persistent_gossip, keyed by
peer, bound to ONE leg — outbound-preferred) was destroyed UNCONDITIONALLY on
ANY leg close. Under sharded QUIC a peer holds 2 legs; a flap of the
non-stream-bearing leg destroyed a LIVE gossip stream on the surviving leg,
while connection_manager correctly kept peer-level state (other leg up) so no
re-establish/SUBSCRIBE-replay fired → that peer's attestations stopped flowing
→ monotonic +0/-N coverage decay, never restored.

Fix: gate destroyPersistentGossipStream at both close sites (onLifecycleClosed
inbound + detectOutboundConnectionClose outbound) on last-leg — destroy only if
the stream was on THIS (closing) leg (g.raw tag match) OR liveLegShardForPeer ==
null. If the stream was on the closing leg but the peer survives, replay
SUBSCRIBE so it re-learns interest (lazy reopen on next publish / cross-shard
re-route via fanDirectedGossip). Outbound site reordered: fetchRemove before
the gate so liveLegShardForPeer excludes the closing leg; conn freed after the
destroy (map values are pointers, so raw.release touches a live client).

Adversarially reviewed: no UAF, no leak, leg-liveness correct incl. cross-shard
straddle, stops the decay. Build clean; 504/506 tests.
@ch4r10t33r ch4r10t33r merged commit b9b2659 into main Jun 28, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant