Skip to content

feat(raft): MuxAcceptor — single port per node (M2.T3)#591

Merged
osvaldoandrade merged 2 commits into
mainfrom
feat/raft-mux-transport
May 18, 2026
Merged

feat(raft): MuxAcceptor — single port per node (M2.T3)#591
osvaldoandrade merged 2 commits into
mainfrom
feat/raft-mux-transport

Conversation

@osvaldoandrade

Copy link
Copy Markdown
Owner

Summary

Closes M2.T3. Every Pebble shard's raft group can now share one TCP listener per node, opt-in via `cfg.Raft.MuxEnabled` (env: `RAFT_MUX_ENABLED=true`). 4-shard 3-node deployment goes from 12 listeners to 3.

The non-mux path (M1/M2 per-shard +offset) stays intact so existing deployments don't break — flipping the flag on a live cluster requires re-bootstrap because the wire format gains a 4-byte BE group ID prefix.

Architecture

```
node-1
└── TCP listener :7000 (MuxAcceptor)
├── conn[groupID=0] → shard 0 raft group
├── conn[groupID=1] → shard 1 raft group
├── conn[groupID=2] → shard 2 raft group
└── conn[groupID=3] → shard 3 raft group
```

Wire format on each accept: 4-byte BE `uint32` group ID, then raft's existing `NetworkTransport` protocol takes over. No new RPCs, no protobuf — the demux is at the bottom of the stack.

What landed

Part 1 — `MuxAcceptor` foundation (commit 1 in the branch)

  • `internal/raft/mux_transport.go` — `MuxAcceptor` owns one `net.Listener` + per-group queues; `RegisterGroup(groupID)` returns a `hraft.StreamLayer`. `Accept()` pops connections routed to its group; `Dial()` writes the group prefix then hands off the raw `net.Conn` to hashicorp/raft.
  • 5 unit tests: two-group routing, duplicate registration error, unknown-group cleanup, accept-unblocks-on-close, concurrent traffic (10 dials per group × 2 groups).

Part 2 — wireup (commit 2)

  • `internal/raft.Config` gains `StreamLayer` (optional). When set, `openInternal` uses `hraft.NewNetworkTransport` on top of it; otherwise it falls back to the existing TCP transport.
  • `pkg/config.RaftConfig.MuxEnabled` + `RAFT_MUX_ENABLED` env var.
  • `application_pebble.go` opens one `MuxAcceptor` per node when enabled, registers a group per shard, threads the `StreamLayer` through. Peer addresses pass through unchanged (no `+shardIdx` offset in mux mode).
  • Shutdown order: `raft.Close → muxAcceptor.Close → pebble.Close`. Both startup-failure cleanup and `TracingShutdown` honor it.

Test plan

  • 5 `MuxAcceptor` unit tests pass
  • `TestRaft_Mux_3Node_4Shard` — 3 nodes × 4 shards = 12 raft groups via 3 listeners. Same failover correctness as M2.T5 (kill node, re-elect, 60 tasks consistent on survivors). ~2.8 s.
  • All pre-existing raft + pebble + app tests still pass — non-mux path unchanged
  • Manual: enable `muxEnabled: true` on the deploy compose template, verify a 3-node bring-up only opens port 7000 on each container

🤖 Generated with Claude Code

…art 1)

The first piece of M2.T3: a multiplexed transport so every Pebble
shard's raft group can share one TCP port instead of binding
shardIdx-offset listeners. Reduces operational complexity (one
firewall rule, one network policy entry per node) without changing
the hashicorp/raft protocol — the wire shape is just a 4-byte BE
group ID prefix followed by raft's existing NetworkTransport bytes.

Layered to plug into hraft.NewNetworkTransport via its StreamLayer
abstraction:

  acceptor := NewMuxAcceptor(":7000", logOut)
  shard0 := acceptor.RegisterGroup(0)  // hraft.StreamLayer
  shard1 := acceptor.RegisterGroup(1)
  transport0 := hraft.NewNetworkTransport(shard0, ...)
  transport1 := hraft.NewNetworkTransport(shard1, ...)

What the acceptor does:
- Opens one TCP listener on bindAddr.
- Each accepted connection's first 4 bytes are read as a BE uint32
  group ID (1-second deadline, then handed off raw to the matching
  registered StreamLayer's accept queue).
- Unknown group IDs close the connection silently — a malformed peer
  doesn't block the route goroutine.
- Close() unwinds the listener + all registered StreamLayer queues;
  pending Accept() calls return immediately.

Each StreamLayer:
- Accept() pops connections routed to its group; blocks until one
  arrives or the acceptor closes.
- Dial(addr, timeout) opens a TCP connection to addr, writes the
  group ID prefix, then yields the raw net.Conn so hashicorp/raft's
  NetworkTransport runs its handshake on top.

Wire format: 4-byte BE uint32 group ID, then raft.NetworkTransport
bytes. Backward-incompatible with hashicorp/raft's stock
NewTCPTransport (which doesn't write the prefix), so M2.T3-part2
(wiring) must flip all shards atomically — not a rolling upgrade.

Tests (5 passing):
- TwoGroupsRouteIndependently: dial + accept round-trip per group
  with no crossover.
- DuplicateRegistrationErrors: same groupID twice → error.
- UnknownGroupClosesConn: connection with an unregistered groupID
  is dropped cleanly.
- AcceptUnblocksOnClose: pending Accept returns on acceptor.Close.
- ConcurrentTraffic: 10 simultaneous dials per group × 2 groups
  all route correctly under contention.

Next: wire MuxAcceptor into application_pebble.go's raft startup so
all shards share one port. That's M2.T3 part 2.
The big win behind M2.T3: every Pebble shard's raft group now shares
one TCP listener per node when cfg.Raft.MuxEnabled=true. The non-mux
path keeps the M1/M2 per-shard +offset behavior so existing
deployments don't break.

Wiring:
- internal/raft.Config gains StreamLayer (optional hraft.StreamLayer).
  openInternal uses hraft.NewNetworkTransport on top when set,
  otherwise falls back to hraft.NewTCPTransport with cfg.BindAddr.
- pkg/config.RaftConfig gains MuxEnabled bool (default false) +
  RAFT_MUX_ENABLED env override.
- pkg/app/application_pebble.go: when cfg.Raft.MuxEnabled, opens one
  MuxAcceptor at cfg.Raft.BindAddr, registers a group per shardIdx,
  passes the StreamLayer through to raftpkg.OpenWithPebble. Every
  shard binds the same port (the acceptor's); peers come through
  cfg.Raft.Peers unchanged (no per-shard offset). Non-mux path
  unchanged.
- Shutdown order: raft.Close → muxAcceptor.Close → pebble.Close.
  cleanupStartupFailure and TracingShutdown both honor the order.

Wire format: 4-byte BE uint32 group ID prefix, then raft's
NetworkTransport bytes. Incompatible with hraft's stock TCPTransport
(which doesn't write the prefix), so flipping MuxEnabled on a live
cluster requires re-bootstrap. M1/M2 deployments stay on the legacy
path until they opt in.

Tests:
- TestRaft_Mux_3Node_4Shard: 3 nodes × 4 shards = 12 raft groups
  across just 3 listeners. Same failover semantics as
  TestRaft_MultiShard_3Node (kill node, re-elect, 60 tasks consistent
  on survivors) but using mux throughout. Runs in ~2.8 s.
- Pre-existing raft tests (non-mux path) all still pass — the legacy
  flag-off route is unchanged.

This closes M2.T3 as a feature: future deployments use mux for the
cleaner single-port story, legacy ones get there at next re-bootstrap.
@osvaldoandrade osvaldoandrade merged commit f1c2332 into main May 18, 2026
2 checks passed
@osvaldoandrade osvaldoandrade deleted the feat/raft-mux-transport branch May 18, 2026 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant