Skip to content

feat(raft): M1 — replicate Pebble writes through hashicorp/raft (in-tree)#584

Merged
osvaldoandrade merged 10 commits into
mainfrom
feat/raft-m1
May 18, 2026
Merged

feat(raft): M1 — replicate Pebble writes through hashicorp/raft (in-tree)#584
osvaldoandrade merged 10 commits into
mainfrom
feat/raft-m1

Conversation

@osvaldoandrade

Copy link
Copy Markdown
Owner

Summary

Milestone 1 of the raft replication work documented in the plan. Adds an in-tree `internal/raft/` package that wires hashicorp/raft on top of the existing Pebble store. When `cfg.Raft.Enabled` is true, every `Set`/`Delete`/`CommitBatch` on the existing pebble repository layer flows through `raft.Apply` and lands on every replica's Pebble via the FSM. Reads stay local. The rest of codeq is unaware that replication is happening.

Decisions locked in plan (asked + approved):

  • Build in-tree first, no extracted `raft-pebble` library
  • Multi-raft per shard as eventual goal (M2); single-raft per node for M1
  • Library: `github.com/hashicorp/raft`
  • Mutually exclusive with the legacy `cluster` mode and with `numShards > 1`

What landed (M1.T1 through M1.T10)

# Title Tests
T1 Skeleton package + hashicorp/raft dep
T2 LogStore on Pebble (`raft/log/` prefix, msgpack values) 9
T3 StableStore on Pebble (`raft/stable/` prefix, BE uint64) 6
T4 Snapshot via `pebble.NewSnapshot` (iter-based stream, not Checkpoint) + FSM Persist/Restore 7
T5 FSM.Apply over `pebble.Batch.Repr` (deterministic replication on every node) 8
T6 `raft.DB` wrapper + `NewRaft` wireup + TCP transport + leader signal 11
T7 `config.RaftConfig` + validation + `application_pebble.go` branch (`AttachReplicator`) 2 E2E
T8 Reaper `LeaderGate` — followers stay passive 3
T9 3-node cluster failover end-to-end (REST in-process) 1 (320 ms run)
T10 Bench: raft vs single-node baseline 1

Numbers

  • 47+ tests added (internal/raft + repository/pebble + pkg/app)
  • 3-node failover: leader killed at t=0, new leader elected and writes continue in ~320 ms total
  • Bench (12-core dev box, loopback): single-node REST baseline ~9.7k cycles/s; raft 3-node ~7.1k cycles/s; raft retains ~73% of baseline throughput. Each "cycle" = create + claim + submit, so it's three writes per cycle through the raft.Apply pipeline.
  • Build clean, vet clean, `go test ./...` passes everywhere except the existing `internal/bench` long-running perf timeout (pre-existing, unrelated)

Architectural notes

  • Delegation seam: `internal/repository/pebble.DB` got a new `Replicator` interface and `AttachReplicator()` method. When attached, the existing coalescer is bypassed and writes route through `Replicator.Replicate(repr)`. The repository code (`task_repository.go`, `reaper.go`, etc.) is unchanged.
  • Shared Pebble: `raft.OpenWithPebble(ctx, cfg, pdb)` lets the application layer open Pebble once and share it between the repository wrapper and the raft FSM (different key prefixes: `codeq/` for user state, `raft/log/` and `raft/stable/` for raft internals).
  • Lifecycle: raft.Close runs before pebble.Close — otherwise the FSM apply pipeline would write to a closed DB and panic.
  • Reads stay local — follower reads may be stale; the repository code (and clients) handle that explicitly. M2 can add a linearizable read path if needed.

Mutual exclusion

`cfg.Validate` rejects:

  • `raft.enabled + cluster.enabled` (different replication models)
  • `raft.enabled + sharding.enabled` (legacy multi-backend sharding)
  • `raft.enabled + persistenceProvider != pebble`
  • `raft.enabled` without `selfId` / `bindAddr` / `selfId` not in `peers`

What's NOT in this PR

  • No extraction: the package lives at `internal/raft/`; if a future PeterDB needs a standalone `raft-pebble` lib, the boundary is clean enough to extract then.
  • Single-raft per node: M2 (next milestone) adds multi-raft per shard. Validation currently rejects `numShards > 1` together with `raft.enabled`.
  • No linearizable reads: reads pass through to local Pebble. Hitting the leader from a stale follower is allowed.
  • No gRPC multiplexed transport: M1 uses hashicorp/raft's stock TCP transport. M2 swaps to a multiplexed gRPC transport for multi-raft.

Test plan

  • `go build ./...` clean
  • `go vet ./...` clean
  • `go test ./internal/raft ./internal/repository/pebble ./pkg/app -count=1` — all pass
  • 3-node failover (`TestRaft3NodeFailover`) passes deterministically (~320 ms)
  • Bench (`TestRaftBench_VsSingleNode`) passes with raft ≥ 30% of baseline
  • Manual: stand up a 3-node deployment via `codeq install --target docker` (out of scope — Compose templates aren't raft-aware yet; follow-up task)

🤖 Generated with Claude Code

First milestone of the raft replication work documented in
/home/ova/.claude/plans/woolly-stargazing-orbit.md. M1 builds an
in-tree internal/raft/ package with the same surface as
internal/repository/pebble.DB so the repository layer can swap one
for the other without touching the service tier.

This commit lays the skeleton — every file compiles, no behavior
change, no tests yet. Subsequent tasks fill in the implementations:

- db.go        : DB wrapper with Open/Close, read pass-through, write
                 path stubs (writes still go direct to Pebble until T6)
- fsm.go       : hashicorp/raft.FSM stub (T5 fills Apply/Snapshot/Restore)
- log_store.go : hashicorp/raft.LogStore stub (T2)
- stable_store.go : hashicorp/raft.StableStore stub (T3)
- snapshot.go  : hashicorp/raft.SnapshotStore stub (T4)
- leader.go    : placeholder LeaseTable type for the wrapper surface

Dependencies added: github.com/hashicorp/raft v1.7.3 + transitive
(go-immutable-radix, go-metrics, go-msgpack/v2, golang-lru, go-uuid).
hashicorp/raft.LogStore implementation backed by the same Pebble store
the FSM uses. Log entries live under the `raft/log/<be8 index>` prefix;
values are msgpack-encoded raft.Log structs (same wire shape as
hashicorp/raft-boltdb so the encoding is unsurprising).

API contracts verified by tests:
- FirstIndex / LastIndex on empty store → 0
- StoreLog + GetLog round-trip preserves Index, Term, Type, Data
- GetLog on missing index → hraft.ErrLogNotFound (raft uses the sentinel
  to decide between catch-up entries and snapshot install)
- StoreLogs commits a whole batch atomically via pebble.Batch
- StoreLogs(nil) is a no-op
- DeleteRange [min, max] is inclusive on both ends; uses pebble's native
  range delete with a +1 carry on the end key
- DeleteRange of prefix, suffix, and single-entry ranges all behave
- DeleteRange with min > max returns an error

Internal helpers (logKey/indexFromKey/encodeLog/decodeLog) kept private
so the wire format can evolve without an API break.
hashicorp/raft.StableStore implementation backed by the shared Pebble
store. Keys live under `raft/stable/<key>`. Values:
- Set/Get: raw bytes
- SetUint64/GetUint64: 8-byte big-endian encoding

Missing-key contract follows the documented spec (raft/stable.go:16-17):
- Get → (nil, nil)
- GetUint64 → (0, nil)

raft consumes both forms permissively (api.go also accepts "not found"
error string), but honoring the documented contract is safer.

Tests:
- Missing keys return zero values (no error)
- Set/Get and SetUint64/GetUint64 round-trip
- Overwrite preserves last write
- Prefix isolation: a raw Pebble write outside `raft/stable/` is invisible
  to the StableStore (sanity check that the prefix scoping is wired)
Revised the approach from the plan: instead of pebble.Checkpoint (which
captures the WHOLE LSM, including raft log + stable keys, and produces
sst-file-shaped snapshots that are large and opaque), the FSM uses
pebble.NewSnapshot for a consistent point-in-time view, then iterates
ONLY the codeq/ prefix and streams (key, value) pairs into the raft
SnapshotSink. Restore wipes codeq/ first (DeleteRange) then re-applies
in batches.

Why iter-based over checkpoint:
- Snapshot size = sum of live keys, not SST file footprint. Smaller.
- No need to filter out raft/log/, raft/stable/ during restore.
- Plain length-prefixed binary format (magic + version + tagged records)
  parseable with stdlib only.

SnapshotStore wrapping: hashicorp/raft already ships FileSnapshotStore,
which handles snapshot metadata + retention (last 3 kept, older purged).
newSnapshotStore is a thin factory.

Format:
  [4]  magic   "CDQS"
  [4]  version uint32 BE (= 1)
  loop:
    [1] tag        0x01=entry, 0x00=eof
    if entry:
      [4]   klen    uint32 BE
      [klen] key    raw
      [4]   vlen    uint32 BE
      [vlen] val    raw

Tests (all passing):
- Round-trip: 3 codeq/ keys survive write+read; a sibling raft/log/
  key in the source is filtered out (FSM scope).
- Restore wipes existing codeq/ state (replace semantics).
- Empty source clears the destination.
- Bad magic / unsupported version / truncated stream all error.
- newSnapshotStore creates the directory and exposes an empty List().

fsm.go: Snapshot creates pebble.NewSnapshot synchronously (raft FSM
lock held), returns an fsmSnapshot whose Persist streams via
writeSnapshot. Restore reads the stream via readSnapshot.
FSM.Apply turns committed raft log entries into Pebble batch commits.
The wire format is pebble.Batch.Repr() — same bytes the leader builds
locally, shipped through raft, deserialized via SetRepr on every
replica. That's the canonical pebble replication path (the format is
versioned + stable; CockroachDB uses the same).

Implementation notes:
- log.Data is copied before SetRepr because Batch takes ownership of
  the slice and raft reuses log buffers on next dispatch.
- Non-LogCommand entries (LogNoop, LogBarrier, LogConfiguration) and
  empty Data are no-ops. raft's runFSM filters most of these but
  defensive checks cost nothing.
- Apply returns error-or-nil through the `any` return; the leader
  retrieves it via raft.ApplyFuture.Response().

Tests (8 passing):
- Set writes the key
- Delete removes the key
- Mixed Set+Delete in one batch lands atomically (the codeq Claim
  pattern)
- LogNoop is a no-op even when Data is non-empty
- nil and empty Data are no-ops
- Corrupt Data returns an error with "SetRepr" in the message
- Two replicas applying the same Repr converge to identical state
  (the property raft replication needs)
- Apply → Snapshot → Restore on a sibling produces a replica that
  matches the original — wires T4 + T5 end-to-end
Integrates LogStore (T2), StableStore (T3), SnapshotStore (T4), FSM
(T5), and an hraft.NewTCPTransport into a working raft node. The DB
wrapper exposes the same surface as internal/repository/pebble.DB so
the repository layer (T7) can swap one for the other without touching
the service tier.

Open flow:
- Opens pebble at <Path>/state/
- Creates the snapshot dir at <Path>/snapshots/ and wraps with
  raft.NewFileSnapshotStore (retain=3)
- Wires log + stable stores over the same pebble (different prefixes)
- Builds the FSM, opens an hraft.NewTCPTransport
- If Bootstrap=true and HasExistingState=false, calls
  raft.BootstrapCluster with the configured peers (defaulting to a
  self-only cluster when PeerAddrs is empty — handy for tests)
- Calls hraft.NewRaft and starts a goroutine relaying LeaderCh to a
  buffered channel for the reaper gate (T8)
- recoverSeq scans existing pending keys to seed the seq counter

Write path:
- IsLeader gate returns ErrNotLeader to the repository layer (no
  transparent forwarding — that's the cluster router's job, not the
  storage layer's)
- Set/Delete build a 1-op pebble.Batch and route through CommitBatch
- CommitBatch takes the batch's Repr, copies it, raft.Apply(repr,
  timeout). Blocks until the entry is committed and applied locally.
- FSM.Apply on every replica (including the leader) does the actual
  Pebble write via SetRepr.

Read path:
- Get/Has/Iter stay local. Reads on followers may be stale; the
  repository layer (and clients) handle that explicitly.

Test additions (10):
- SingleNodeBootstrapElects: bootstraps a 1-voter cluster, becomes
  leader within ~50ms.
- SetRoutesThroughRaftAndReadsLocal: replicated write visible via
  local Get.
- DeleteRoutesThroughRaft: same for deletes.
- CommitBatchAtomicMultiKey: multi-key batch lands as a unit.
- CommitBatch_EmptyIsNoop: empty batch returns nil without raft hit.
- NextSeqMonotonic: per-node atomic counter behaves.
- IsLeader_FollowerWritesReturnErrNotLeader: post-shutdown writes
  surface ErrNotLeader.
- ReopenPreservesState: durable across process restart (HasExisting-
  State skips re-bootstrap).
- IterReturnsLiveRange: read range scan works as before.
- SeqRecoveryOnReopen: pending key with seq=42 recovered on reopen,
  NextSeq returns 43.

raft package now has 41 passing tests total.
Closes the loop: when cfg.Raft.Enabled, every write on the existing
internal/repository/pebble.DB flows through raft.Apply, lands in the
FSM on every replica, and commits to the local Pebble. The repository
code is unchanged — the integration sits on a clean delegation seam.

Design changes:

- internal/raft.DB grew two constructors:
    Open(ctx, cfg)                       opens its own pebble (standalone)
    OpenWithPebble(ctx, cfg, pdb)        wraps a caller-owned pebble
  Close() only closes pebble when Open() created it. OpenWithPebble's
  pebble lifecycle stays with the caller. Both share an internal
  openInternal() that does the raft wireup.

- Added raft.DB.Replicate(repr) public method — the entry point
  pebble.DB calls when it has a batch to ship. Same body as the old
  internal CommitBatch path, just exposed.

- internal/repository/pebble.DB:
    + Replicator interface (IsLeader, Replicate)
    + AttachReplicator(r) field installer
    + Set/Delete/CommitBatch route through repl.Replicate when attached
      (and return ErrNotLeader if !IsLeader); fall back to the existing
      coalescer/direct paths when not attached
  No change to the repos that use *DB — Leases, NextSeq, Iter, Get all
  stay where they are. *raft.DB satisfies pebble.Replicator (compile-
  time asserted inside the raft package).

- pkg/config.RaftConfig: Enabled, SelfID, BindAddr, Bootstrap, Peers,
  HeartbeatMS / ElectionMS / LeaderLeaseMS / CommitMS, ApplyTimeoutSeconds.

- Validation: raft.enabled is mutually exclusive with cluster.enabled
  and sharding.enabled; persistenceProvider must be pebble; SelfID
  must appear in Peers when Peers is non-empty.

- application_pebble.go: after opening the pebble shard(s), if
  cfg.Raft.Enabled it opens raft.OpenWithPebble sharing the local
  pebble and calls AttachReplicator on the wrapper. Shutdown closes
  raft BEFORE pebble (otherwise raft's apply pipeline writes into a
  closed DB and panics).

Tests:
- pkg/app TestNewApplication_RaftEnabled: full create→claim→submit→get
  REST cycle against a single-node raft-bootstrap cluster. Verifies
  the wireup, leader election, write replication, and local reads.
- pkg/app TestRaftConfig_MutualExclusion: validator rejects each
  invalid combination (raft+cluster, raft+sharding, raft+redis, raft
  without selfID).

Module-wide test sweep: every package passes except internal/bench
(unrelated 120s timeout on long-running perf harnesses). raft package
holds at 41 passing tests; pkg/app gains 2 more.

internal/raft/leader.go removed (placeholder LeaseTable type is gone;
the actual lease table stays on pebble.DB where it always belonged).
In raft mode every replica's pebble.DB sees the same state via FSM
apply, but only the leader can write. Without a gate the followers'
reapers would either:
  - issue writes that fail with ErrNotLeader (noisy logs), or
  - duplicate work the leader is already doing (lease expired locally
    looks the same on every node).

The fix is a tiny check at the top of reaper.loop: if LeaderGate is set
and returns false, skip the tick. The gate is wired from
application_pebble.go to raft.DB.IsLeader when raft is enabled, and
left nil for single-node / non-raft deployments (existing behavior
unchanged).

API surface:
- ReaperOptions.LeaderGate func() bool   — opt-in; nil = run every tick
- Reaper.leaderGate internal mirror, consulted in loop()

Failover: when the leader dies, the new leader's reaper begins
sweeping on its next tick (≤ leaseInterval, default 2s). The brief
window where no reaper runs is recovered by the foreground repair
path at claim time — same as today.

Tests (3 new):
- SkipsTickWhenNotLeader: gate=false, 6+ ticks elapse, expired lease
  task stays IN_PROGRESS (no DLQ transition). Gate callback observed
  to fire ≥ 2 times.
- RunsTickWhenLeader: gate=true ⇒ DLQ transition happens normally.
- NilGate_RunsAlways: backward-compat for single-node deployments.

pkg/app wires reaperOpts.LeaderGate = raftNodes[0].IsLeader when
cfg.Raft.Enabled. The wireup uses raftNodes[0] because M1 has exactly
one shard ⇒ one raft node; M2 multi-raft will need a per-shard gate.
The M1 proof: three in-process codeq apps joined into a raft cluster,
writes go to the leader and replicate to followers, kill the leader,
survivors elect a new one, writes continue against the new leader,
all 30 tasks consistent across the two survivors. Runs in ~320ms.

What it covers:
- Multi-node bootstrap. Only node-1 has cfg.Raft.Bootstrap=true with
  the full Peers map; node-2 and node-3 start fresh and pick up the
  initial Configuration from the leader via raft.AppendEntries.
- Pre-grabbed loopback ports so the bootstrap configuration can
  reference final addresses; the brief release-to-reuse window is
  fine on a single-machine test.
- waitForLeader probes each node's REST surface — POST /tasks against
  a follower returns ErrNotLeader (translated by the controller),
  against the leader returns 202.
- After failover the test confirms the elected leader is one of the
  two survivors (not the just-killed leader) and that GET /tasks/:id
  returns the same body on both remaining nodes for all 30 task IDs.

Helpers (raftTestNode, startRaftNode, createTaskOnLeader, getTask)
encapsulate the setup so future tests in this package can stack on
the same primitives.

raft package + pebble package + pkg/app all still pass.
Apples-to-apples REST throughput comparison: single-node Pebble vs
3-node raft cluster, both driven through the same HTTP surface with
32 concurrent goroutines running create → claim → submit cycles for
5s after a 1s warmup.

Numbers on the dev box (12-core):
  single-node baseline: ~9.7k cycles/s
  raft 3-node:          ~7.1k cycles/s
  ratio:                ~0.73 (raft retains 70-75% of baseline)

That ratio is the cost of replication: 1 raft RTT to acknowledge the
commit + 2 follower appendEntries per write. Each cycle is 3 writes
(create, claim, submit-result), so the per-write cost is well-amortized
by the 32-way concurrency. The plan's "≥30k tasks/s" target was set
against the Phase 8 in-process bench (which hits ~83k); REST adds an
order of magnitude of overhead, so the meaningful metric here is the
ratio.

The test fails only if raft drops below 30% of baseline — below that
something is wrong (mutex storms, transport retries) and the bench
should refuse to report bogus numbers.

Skipped under -short.
@osvaldoandrade osvaldoandrade merged commit fa5c87f into main May 18, 2026
2 checks passed
@osvaldoandrade osvaldoandrade deleted the feat/raft-m1 branch May 18, 2026 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant