transport: gate persistent /meshsub teardown on last-leg — fix +0/-N coverage decay (residual finality blocker)#278
Merged
Conversation
…coverage decay THE residual finality blocker after v0.2.51 (coverage held ~21 but parked 1-2 short of the 22 quorum; aggregator logs showed late=none + diff=+0/-7, i.e. attestations NOT delivered, never recovered). ROOT CAUSE (transport analog of the v0.2.45 connection_manager fix, one layer down): the per-peer persistent /meshsub stream (sh.persistent_gossip, keyed by peer, bound to ONE leg — outbound-preferred) was destroyed UNCONDITIONALLY on ANY leg close. Under sharded QUIC a peer holds 2 legs; a flap of the non-stream-bearing leg destroyed a LIVE gossip stream on the surviving leg, while connection_manager correctly kept peer-level state (other leg up) so no re-establish/SUBSCRIBE-replay fired → that peer's attestations stopped flowing → monotonic +0/-N coverage decay, never restored. Fix: gate destroyPersistentGossipStream at both close sites (onLifecycleClosed inbound + detectOutboundConnectionClose outbound) on last-leg — destroy only if the stream was on THIS (closing) leg (g.raw tag match) OR liveLegShardForPeer == null. If the stream was on the closing leg but the peer survives, replay SUBSCRIBE so it re-learns interest (lazy reopen on next publish / cross-shard re-route via fanDirectedGossip). Outbound site reordered: fetchRemove before the gate so liveLegShardForPeer excludes the closing leg; conn freed after the destroy (map values are pointers, so raw.release touches a live client). Adversarially reviewed: no UAF, no leak, leg-liveness correct incl. cross-shard straddle, stops the decay. Build clean; 504/506 tests.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
THE residual finality blocker after v0.2.51. Coverage went 8→21 with the prior batch but parked 1–2 short of the 22 quorum and finality stalled. The live aggregator log was decisive:
late=none+diff=+0/-7→ attestations not delivered (not late), aggregate only loses, never gains.Root cause (transport analog of the v0.2.45 connection_manager fix)
The per-peer persistent
/meshsubstream (sh.persistent_gossip, keyed by peer, bound to ONE leg — outbound-preferred) was destroyed unconditionally on ANY leg close. Under sharded QUIC a peer holds 2 legs; a flap of the non-stream-bearing leg destroyed a live gossip stream on the surviving leg. Meanwhileconnection_manager.onConnectionClosedcorrectly returned "not fully disconnected" (other leg up) →gossipsub.onPeerDisconnecteddid NOT fire → no re-establish, no SUBSCRIBE replay → that peers attestations stopped flowing → monotonic+0/-Ncoverage decay, never restored. v0.2.45 fixed the gossipsub-state wipe; the transport stream teardown was never given the same last-leg discipline.Fix
Gate
destroyPersistentGossipStreamat both close sites on last-leg: destroy only if the stream was on THIS (closing) leg (g.rawtag match) ORliveLegShardForPeer(peer) == null. If the stream was on the closing leg but the peer survives, replay SUBSCRIBE (lazy reopen on next publish; cross-shard re-route viafanDirectedGossip). Outbound site reordered so the live-leg probe excludes the closing leg, with the conn freed after the destroy (map values are pointers →raw.releasetouches a live client).Adversarial review (done, no changes needed)
No UAF (verified the fetchRemove-before-destroy window keeps the client alive), no leak (stream + leg co-located per shard; last-leg close always reaps it; deinit backstop), leg-liveness correct including the cross-shard straddle, tag comparison valid, and it genuinely stops the decay.
Build clean; 504/506 tests. Pure zig-libp2p.