Skip to content

In-house discv5#11728

Open
flcl42 wants to merge 193 commits into
masterfrom
generic-kad-2
Open

In-house discv5#11728
flcl42 wants to merge 193 commits into
masterfrom
generic-kad-2

Conversation

@flcl42

@flcl42 flcl42 commented May 22, 2026

Copy link
Copy Markdown
Contributor

Changes

  • Even more generic KAD
  • Our own discv5 that shares code with discv4
  • Optimized and checked for vulnerabilities

Types of changes

What types of changes does your code introduce?

  • Bugfix (a non-breaking change that fixes an issue)
  • New feature (a non-breaking change that adds functionality)
  • Breaking change (a change that causes existing functionality not to work as expected)
  • Optimization
  • Refactoring
  • Documentation update
  • Build-related changes
  • Other: Description

Testing

Requires testing

  • Yes
  • No

If yes, did you write tests?

  • Yes
  • No

Remarks

Change list

  • Extracted the reusable Kademlia routing logic into Nethermind.Kademlia: routing table, lookup algorithm, node health tracking, distance abstractions, and generic node/key operators now live outside discv4-specific code.
  • Rewired discv4 to consume the shared Kademlia core through a protocol adapter, while keeping discv4-specific bonding, signed RLP messages, FINDNODE/NEIGHBORS, ENR request/response, and XDC extension hooks in the discv4 layer.
  • Replaced the external Lantern-based discv5 implementation with native Nethermind code: UDP packet codec, WHOAREYOU handshake, session/key derivation, message codec, PING/PONG, FINDNODE/NODES, TALKREQ/TALKRESP, and ENR propagation.
  • Added a separate discv5 Kademlia instance and adapter so discv5 FINDNODE distance requests map to GetAllAtDistance, while discv4 keeps nearest-node lookup semantics.
  • Unified discovery lifecycle shape: discv4 and discv5 both run through discovery app lifecycle, child Autofac protocol scopes, Kademlia node sources, persistence, channel activation, and node-removal handling.
  • Kept execution peer candidates separate from discovery routing endpoints: discv5 routing uses ENR UDP endpoints, while peer-pool candidates are emitted only from usable ENR TCP endpoints.
  • Hardened ENR handling: preserved unknown ENR entries, rejected consensus-only eth2 records before execution peer admission, skipped malformed records, and prevented stale ENR sequence downgrades.
  • Hardened UDP receive paths and hot allocations: cheap pre-copy packet checks, global/per-IP discv4 inbound limiting, bounded receive queues, pooled/owned discv5 message decode, direct RLP length calculations, and reduced serializer buffer copies.
  • Bounded discv5 state that can grow under remote input: sessions, sent challenges, pending nonces, response handlers, known ENRs, recent-node dedupe, and NODES response accumulation.
  • Improved shutdown/disposal behavior so discovery node-source enumeration stops background discovery work when the caller disposes enumeration without externally cancelling.
  • Refactored discovery code organization for reviewability: discv4-specific messages/serializers/handlers stay in discv4 folders, discv5 serializers are per-message, concrete leaf types are sealed where not used as extension hooks.
  • Updated network defaults and smoke-validated behavior: mainnet uses both discoveries, testnets use V5-only, and mainnet/Hoodi runs showed discv5 discovery traffic without discv5 packet-loop errors.
  • Added focused interoperability and regression coverage, including go-ethereum discv5 wire vectors, packet codec/session tests, serializer round-trips, Kademlia adapter/source tests, ENR endpoint filtering tests, and discovery lifecycle/disposal tests.

Architecture overview

The new shape separates reusable Kademlia table mechanics from protocol transport. Nethermind.Kademlia owns generic routing, lookup, health tracking, and distance services; Nethermind.Network.Discovery supplies Ethereum discovery-specific adapters, persistence, node sources, and lifecycle. Discv4 and discv5 each create their own protocol child scope and own Kademlia instance, then bind protocol-specific message senders and node-source behavior on top of the same core. Discv4 keeps signed RLP packet handling, bonding, and nearest-neighbour discovery, while discv5 owns packet/session encryption, ENR-based messages, and distance-bucket FINDNODE semantics. Both versions feed discovered execution peer candidates through role-specific ENR conversion so UDP discovery endpoints stay in routing and TCP endpoints go to the peer pool.

High-level architecture

flowchart TB
    NetworkConfig["Network config<br/>DiscoveryVersion + bootnodes"] --> Composite["CompositeDiscoveryApp"]
    Composite --> V4App["Discv4 DiscoveryApp"]
    Composite --> V5App["Discv5 DiscoveryV5App"]

    V4App --> V4Scope["discv4 child scope"]
    V5App --> V5Scope["discv5 child scope"]

    V4Scope --> V4Adapter["discv4 KademliaAdapter<br/>bonding + FINDNODE nearest lookup"]
    V5Scope --> V5Adapter["discv5 KademliaAdapter<br/>session-aware FINDNODE distances"]

    V4Adapter --> Kad4["Kademlia<PublicKey, Node><br/>discv4 instance"]
    V5Adapter --> Kad5["Kademlia<PublicKey, Node><br/>discv5 instance"]

    Kad4 --> Core["Nethermind.Kademlia<br/>routing table + lookup + health"]
    Kad5 --> Core

    V4App --> V4Handler["NettyDiscoveryHandler<br/>signed RLP UDP packets"]
    V5App --> V5Handler["NettyDiscoveryV5Handler"]
    V5Handler --> PacketCodec["PacketCodec<br/>WHOAREYOU + sessions + AES-GCM"]
    PacketCodec --> MessageCodec["MessageCodec<br/>PING/PONG/FINDNODE/NODES/TALK"]

    V4Adapter --> NodeSource4["discv4 NodeSource"]
    V5Adapter --> NodeSource5["discv5 NodeSource"]
    NodeSource4 --> PeerPool["PeerPool candidates"]
    NodeSource5 --> PeerPool
Loading

Key moment: discv5 packet handling

sequenceDiagram
    participant UDP as NettyDiscoveryV5Handler
    participant Codec as PacketCodec
    participant Adapter as Discv5 KademliaAdapter
    participant Kad as Kademlia

    UDP->>UDP: cheap size checks and bounded queue admission
    UDP->>Codec: accepted packet bytes
    Codec->>Codec: decode ordinary, WHOAREYOU, or handshake packet
    alt no usable session
        Codec-->>UDP: send WHOAREYOU challenge
    else authenticated message
        Codec->>Adapter: Discv5Message
        Adapter->>Kad: mark inbound health / add or refresh node
        alt FINDNODE
            Adapter->>Kad: GetAllAtDistance(requested distances)
            Kad-->>Adapter: matching nodes
            Adapter-->>Codec: NODES responses with ENRs
        else PING/PONG/TALK
            Adapter-->>Codec: protocol response or handler completion
        end
        Codec-->>UDP: encrypted response packet
    end
Loading

Key moment: ENR endpoint separation

flowchart LR
    ENR["ENR"] --> Validate["validate secp256k1<br/>node id, routability, sequence"]
    Validate --> Eth2{"consensus-only<br/>eth2 record?"}
    Eth2 -- yes --> Reject["reject for execution discovery"]
    Eth2 -- no --> Record["NodeRecord"]

    Record --> Routing["TryFromDiscoveryEnr<br/>UDP endpoint"]
    Routing --> KadTable["discv5 Kademlia table"]

    Record --> Candidate{"has usable TCP endpoint?"}
    Candidate -- yes --> PeerNode["TryFromEnr<br/>TCP endpoint"]
    PeerNode --> PeerPool["PeerPool"]
    Candidate -- no --> SkipPeer["skip RLPx peer candidate"]

    Record --> Sequence{"newer ENR sequence?"}
    Sequence -- yes --> Cache["replace known ENR"]
    Sequence -- no --> Keep["keep newer cached ENR"]
Loading

Key moment: receive-path hardening

flowchart TD
    Packet["UDP packet"] --> Cheap["read size/type from buffer"]
    Cheap --> Valid{"valid enough<br/>for this protocol?"}
    Valid -- no --> DropOrForward["drop or forward without copy"]
    Valid -- yes --> Limit["global + per-IP limits"]
    Limit --> Accepted{"within limits<br/>and queue capacity?"}
    Accepted -- no --> Drop["drop before packet copy/expensive work"]
    Accepted -- yes --> Queue["bounded channel"]
    Queue --> Workers["limited dispatch workers"]
    Workers --> Decode["deserialize / decrypt accepted traffic"]
    Decode --> Adapter["protocol adapter"]
    Adapter --> Kad["Kademlia + response handlers"]
Loading

@LukaszRozmej LukaszRozmej left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up review pass on the new commits (8d91fb8c, d00b47fd, fb775f47, 34a5c467).

Comment thread src/Nethermind/Nethermind.Network/NetworkStorage.cs
Comment thread src/Nethermind/Nethermind.Network.Enr/NodeRecord.cs Outdated
Comment thread src/Nethermind/Nethermind.Network/PeerPool.cs Outdated
Comment thread src/Nethermind/Nethermind.Network.Stats/Model/Node.cs
Comment thread src/Nethermind/Nethermind.Network/NetworkStorage.cs Outdated
Comment thread src/Nethermind/Nethermind.Core/Caching/LruCache.cs

@LukaszRozmej LukaszRozmej left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went through 50%.

I know it is a huge PR and it is hard to have full mastery of it. But my hunch is that while it is a good PR, there is still room for polish and it is not yet polished as it could be on all aspects.

Comment thread src/Nethermind/Nethermind.Core/Caching/LruCache.cs
Comment thread src/Nethermind/Nethermind.Crypto/SecP256k1Ecdh.cs
Comment thread src/Nethermind/Nethermind.Kademlia/DoubleEndedLru.cs Outdated
Comment thread src/Nethermind/Nethermind.Kademlia/DoubleEndedLru.cs Outdated
Comment thread src/Nethermind/Nethermind.Kademlia/DoubleEndedLru.cs Outdated
Comment thread src/Nethermind/Nethermind.Network.Discovery/Discv5/Kademlia/KademliaAdapter.cs Outdated
Comment thread src/Nethermind/Nethermind.Network.Discovery/Discv5/Kademlia/KademliaAdapter.cs Outdated
flcl42 and others added 12 commits June 9, 2026 13:40
Six near-identical NodeRecord builders across DiscoveryV5AppTests,
Discv4/Kademlia, and Discv5 test files are replaced by a single shared
TestEnrBuilder.BuildSigned helper. Two duplicate TestEth2Entry nested
classes are merged into one shared internal type.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
KBucketTests, KademliaTests, and NodeHealthTrackerTests each defined a
private static ToHash(ValueHash256) that resolved to the same expression
as the existing IdentityNodeHashProvider.ToHash and
ValueHashKeyOperator<T>.ToHash. Call the shared helpers directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts:
#	src/Nethermind/Nethermind.Runner.Test/DatabasePurgerTests.cs
@flcl42 flcl42 requested review from LukaszRozmej, ak88 and asdacap June 11, 2026 08:45
@flcl42

flcl42 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

@claude review again

@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Claude finished @flcl42's task in 9m 11s —— View job


PR Review: In-house discv5 — Round 4

  • Gather context and load rules
  • Check status of previous findings (LukaszRozmej round 2)
  • Review recent commits and changed files
  • Review key files (KademliaAdapter, PacketCodec, NodesMsgSerializer, Session, NetworkStorage, DoubleEndedLru)
  • Post inline comments
  • Final summary

Summary

This is the fourth review cycle. The PR is in very good shape — the latest commits (9af85fc9 More fixes, 7aeafbd0 More locals init skipped, 1abe8a97 More pooled, 43cc5d25 Remove unused discovery usings) address all remaining findings from the previous rounds.

Status of LukaszRozmej Round 2 findings

Finding Status
PruneLastBucketRefreshTicks unsound <= short-circuit ✅ Fixed — now iterates all keys and removes inactive ones
DiscoveryPersistenceManager double-registered ✅ Fixed — Discv4 KademliaModule no longer re-registers it
SendMessage/SendResponse race with stale disposed session ✅ Fixed — TryRemove on disposed-session path eliminated; falls through to re-handshake
Session.WriteNextNonce not lock-guarded ✅ Documented — <remarks> documents the required precondition (callers must hold a valid write key copy)
LruCache.Clear() lock-ordering hazard ✅ Fixed — NotifyEvictedValues called outside the lock
LruCache.NotifyEvictedValues vs NotifyEvicted inconsistency ✅ Fixed — both paths route through NotifyEvicted which null-guards
DiscoveryPersistenceManager.RunDiscoveryPersistenceCommit batch safety ✅ Acceptable — StartBatch() auto-discards any prior batch via DiscardBatchNoLock
NetworkStorage.Commit() catch leaves dirty in-memory view ✅ Fixed — ClearLocalCacheNoLock now sets _loadedFromDb = false, forcing reload on next access
Redundant DiscardBatch() in PeerPool / DiscoveryPersistenceManager ✅ Fixed — DiscardBatch removed from INetworkStorage interface entirely
NodeRecord.DiscoveryIp/DiscoveryPort for ip + udp6 ℹ️ Remains — acceptable; no real client emits such a record
NodeSource.cs missing IsConsensusOnlyNodeRecord check ✅ Fixed — check present at line 98
DoubleEndedLru sub-optimal with LinkedList ✅ Fixed — rewritten with a preallocated intrusive array + free list; O(1) access to both ends, no steady-state allocation
SkipLocalsInit on hot methods ✅ Fixed — applied across PacketCodec, KademliaAdapter, SecP256k1Ecdh
Lock class usage ✅ Fixed — Lock used throughout KBucketTree, NodeHealthTracker, DoubleEndedLru, KademliaAdapter
NodesMsgSerializer throws entire NODES message on bad ENR ✅ Fixed — TryDecodeNodeRecord wraps FromBytes and skips malformed entries

Remaining findings

Low

Defensive null guards for non-nullable Node (project-rule violation)LookupKNearestNeighbour.cs:64, KademliaAdapter.cs:86, KademliaAdapter.cs:634

Null guards exist for a type constrained to notnull. Corresponding tests (KademliaAdapterTests.cs:67,140-142) still inject null! to exercise them. Per the project rule: "Don't add error handling, fallbacks, or validation for scenarios that can't happen." Cleaner to remove the guards and fix the mocks to not use null!. (Inline comment posted.)

What's well done in this revision

  • DoubleEndedLru rewrite: intrusive array-based linked list with a slot free list — zero steady-state allocations, better cache locality than LinkedList<T>. O(1) all operations.
  • LruCache.OnEvict as event: the promotion from Action<TValue> ctor param to event Action<TValue>? OnEvict is cleaner; callers can subscribe/unsubscribe.
  • All evict callbacks now called outside the cache lock: Set, Delete, Clear, SetOrGet all release the McsLock before invoking onEvict. No lock-ordering hazard.
  • NodesMsgSerializer.TryDecodeNodeRecord: per-ENR try/catch with a specific exception filter (RlpException | ArgumentException | InvalidOperationException | FormatException) correctly isolates bad ENRs without silently swallowing unexpected failures.
  • PacketCodec._decodeMaskingAesPool: ObjectPool<Aes> correctly handles the multi-worker decode path introduced in round 3.
  • NetworkStorage recovery: the _loadedFromDb flag introduced in commit 84a7beed closes the gap where ClearLocalCache after a failed commit would leave the in-memory view stale.
  • DiscoveryKademliaConfigFactory placeholder comment: the one-line "The table only needs the local node identity here; its endpoint is never contacted" remark correctly documents why 127.0.0.1:9999 is intentional.

Overall

All High and Medium findings across four review rounds have been resolved. The one remaining Low finding (null guards for non-nullable types) is a stylistic violation of the project rule that does not affect correctness or safety. This PR is ready to merge.

Comment thread src/Nethermind/Nethermind.Kademlia/LookupKNearestNeighbour.cs Outdated
Comment thread src/Nethermind/Nethermind.Network.Discovery/KademliaDiscoveryApp.cs Outdated
flcl42 and others added 3 commits June 12, 2026 12:42
A throw from packetCodec.TryDecode ran outside the per-packet handler's
catch, so a single crafted packet could terminate one of the packet
workers. Catch per iteration and rethrow only on shutdown cancellation.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants