feat(raft): M1 — replicate Pebble writes through hashicorp/raft (in-tree)#584
Merged
Conversation
First milestone of the raft replication work documented in
/home/ova/.claude/plans/woolly-stargazing-orbit.md. M1 builds an
in-tree internal/raft/ package with the same surface as
internal/repository/pebble.DB so the repository layer can swap one
for the other without touching the service tier.
This commit lays the skeleton — every file compiles, no behavior
change, no tests yet. Subsequent tasks fill in the implementations:
- db.go : DB wrapper with Open/Close, read pass-through, write
path stubs (writes still go direct to Pebble until T6)
- fsm.go : hashicorp/raft.FSM stub (T5 fills Apply/Snapshot/Restore)
- log_store.go : hashicorp/raft.LogStore stub (T2)
- stable_store.go : hashicorp/raft.StableStore stub (T3)
- snapshot.go : hashicorp/raft.SnapshotStore stub (T4)
- leader.go : placeholder LeaseTable type for the wrapper surface
Dependencies added: github.com/hashicorp/raft v1.7.3 + transitive
(go-immutable-radix, go-metrics, go-msgpack/v2, golang-lru, go-uuid).
hashicorp/raft.LogStore implementation backed by the same Pebble store the FSM uses. Log entries live under the `raft/log/<be8 index>` prefix; values are msgpack-encoded raft.Log structs (same wire shape as hashicorp/raft-boltdb so the encoding is unsurprising). API contracts verified by tests: - FirstIndex / LastIndex on empty store → 0 - StoreLog + GetLog round-trip preserves Index, Term, Type, Data - GetLog on missing index → hraft.ErrLogNotFound (raft uses the sentinel to decide between catch-up entries and snapshot install) - StoreLogs commits a whole batch atomically via pebble.Batch - StoreLogs(nil) is a no-op - DeleteRange [min, max] is inclusive on both ends; uses pebble's native range delete with a +1 carry on the end key - DeleteRange of prefix, suffix, and single-entry ranges all behave - DeleteRange with min > max returns an error Internal helpers (logKey/indexFromKey/encodeLog/decodeLog) kept private so the wire format can evolve without an API break.
hashicorp/raft.StableStore implementation backed by the shared Pebble store. Keys live under `raft/stable/<key>`. Values: - Set/Get: raw bytes - SetUint64/GetUint64: 8-byte big-endian encoding Missing-key contract follows the documented spec (raft/stable.go:16-17): - Get → (nil, nil) - GetUint64 → (0, nil) raft consumes both forms permissively (api.go also accepts "not found" error string), but honoring the documented contract is safer. Tests: - Missing keys return zero values (no error) - Set/Get and SetUint64/GetUint64 round-trip - Overwrite preserves last write - Prefix isolation: a raw Pebble write outside `raft/stable/` is invisible to the StableStore (sanity check that the prefix scoping is wired)
Revised the approach from the plan: instead of pebble.Checkpoint (which
captures the WHOLE LSM, including raft log + stable keys, and produces
sst-file-shaped snapshots that are large and opaque), the FSM uses
pebble.NewSnapshot for a consistent point-in-time view, then iterates
ONLY the codeq/ prefix and streams (key, value) pairs into the raft
SnapshotSink. Restore wipes codeq/ first (DeleteRange) then re-applies
in batches.
Why iter-based over checkpoint:
- Snapshot size = sum of live keys, not SST file footprint. Smaller.
- No need to filter out raft/log/, raft/stable/ during restore.
- Plain length-prefixed binary format (magic + version + tagged records)
parseable with stdlib only.
SnapshotStore wrapping: hashicorp/raft already ships FileSnapshotStore,
which handles snapshot metadata + retention (last 3 kept, older purged).
newSnapshotStore is a thin factory.
Format:
[4] magic "CDQS"
[4] version uint32 BE (= 1)
loop:
[1] tag 0x01=entry, 0x00=eof
if entry:
[4] klen uint32 BE
[klen] key raw
[4] vlen uint32 BE
[vlen] val raw
Tests (all passing):
- Round-trip: 3 codeq/ keys survive write+read; a sibling raft/log/
key in the source is filtered out (FSM scope).
- Restore wipes existing codeq/ state (replace semantics).
- Empty source clears the destination.
- Bad magic / unsupported version / truncated stream all error.
- newSnapshotStore creates the directory and exposes an empty List().
fsm.go: Snapshot creates pebble.NewSnapshot synchronously (raft FSM
lock held), returns an fsmSnapshot whose Persist streams via
writeSnapshot. Restore reads the stream via readSnapshot.
FSM.Apply turns committed raft log entries into Pebble batch commits. The wire format is pebble.Batch.Repr() — same bytes the leader builds locally, shipped through raft, deserialized via SetRepr on every replica. That's the canonical pebble replication path (the format is versioned + stable; CockroachDB uses the same). Implementation notes: - log.Data is copied before SetRepr because Batch takes ownership of the slice and raft reuses log buffers on next dispatch. - Non-LogCommand entries (LogNoop, LogBarrier, LogConfiguration) and empty Data are no-ops. raft's runFSM filters most of these but defensive checks cost nothing. - Apply returns error-or-nil through the `any` return; the leader retrieves it via raft.ApplyFuture.Response(). Tests (8 passing): - Set writes the key - Delete removes the key - Mixed Set+Delete in one batch lands atomically (the codeq Claim pattern) - LogNoop is a no-op even when Data is non-empty - nil and empty Data are no-ops - Corrupt Data returns an error with "SetRepr" in the message - Two replicas applying the same Repr converge to identical state (the property raft replication needs) - Apply → Snapshot → Restore on a sibling produces a replica that matches the original — wires T4 + T5 end-to-end
Integrates LogStore (T2), StableStore (T3), SnapshotStore (T4), FSM (T5), and an hraft.NewTCPTransport into a working raft node. The DB wrapper exposes the same surface as internal/repository/pebble.DB so the repository layer (T7) can swap one for the other without touching the service tier. Open flow: - Opens pebble at <Path>/state/ - Creates the snapshot dir at <Path>/snapshots/ and wraps with raft.NewFileSnapshotStore (retain=3) - Wires log + stable stores over the same pebble (different prefixes) - Builds the FSM, opens an hraft.NewTCPTransport - If Bootstrap=true and HasExistingState=false, calls raft.BootstrapCluster with the configured peers (defaulting to a self-only cluster when PeerAddrs is empty — handy for tests) - Calls hraft.NewRaft and starts a goroutine relaying LeaderCh to a buffered channel for the reaper gate (T8) - recoverSeq scans existing pending keys to seed the seq counter Write path: - IsLeader gate returns ErrNotLeader to the repository layer (no transparent forwarding — that's the cluster router's job, not the storage layer's) - Set/Delete build a 1-op pebble.Batch and route through CommitBatch - CommitBatch takes the batch's Repr, copies it, raft.Apply(repr, timeout). Blocks until the entry is committed and applied locally. - FSM.Apply on every replica (including the leader) does the actual Pebble write via SetRepr. Read path: - Get/Has/Iter stay local. Reads on followers may be stale; the repository layer (and clients) handle that explicitly. Test additions (10): - SingleNodeBootstrapElects: bootstraps a 1-voter cluster, becomes leader within ~50ms. - SetRoutesThroughRaftAndReadsLocal: replicated write visible via local Get. - DeleteRoutesThroughRaft: same for deletes. - CommitBatchAtomicMultiKey: multi-key batch lands as a unit. - CommitBatch_EmptyIsNoop: empty batch returns nil without raft hit. - NextSeqMonotonic: per-node atomic counter behaves. - IsLeader_FollowerWritesReturnErrNotLeader: post-shutdown writes surface ErrNotLeader. - ReopenPreservesState: durable across process restart (HasExisting- State skips re-bootstrap). - IterReturnsLiveRange: read range scan works as before. - SeqRecoveryOnReopen: pending key with seq=42 recovered on reopen, NextSeq returns 43. raft package now has 41 passing tests total.
Closes the loop: when cfg.Raft.Enabled, every write on the existing
internal/repository/pebble.DB flows through raft.Apply, lands in the
FSM on every replica, and commits to the local Pebble. The repository
code is unchanged — the integration sits on a clean delegation seam.
Design changes:
- internal/raft.DB grew two constructors:
Open(ctx, cfg) opens its own pebble (standalone)
OpenWithPebble(ctx, cfg, pdb) wraps a caller-owned pebble
Close() only closes pebble when Open() created it. OpenWithPebble's
pebble lifecycle stays with the caller. Both share an internal
openInternal() that does the raft wireup.
- Added raft.DB.Replicate(repr) public method — the entry point
pebble.DB calls when it has a batch to ship. Same body as the old
internal CommitBatch path, just exposed.
- internal/repository/pebble.DB:
+ Replicator interface (IsLeader, Replicate)
+ AttachReplicator(r) field installer
+ Set/Delete/CommitBatch route through repl.Replicate when attached
(and return ErrNotLeader if !IsLeader); fall back to the existing
coalescer/direct paths when not attached
No change to the repos that use *DB — Leases, NextSeq, Iter, Get all
stay where they are. *raft.DB satisfies pebble.Replicator (compile-
time asserted inside the raft package).
- pkg/config.RaftConfig: Enabled, SelfID, BindAddr, Bootstrap, Peers,
HeartbeatMS / ElectionMS / LeaderLeaseMS / CommitMS, ApplyTimeoutSeconds.
- Validation: raft.enabled is mutually exclusive with cluster.enabled
and sharding.enabled; persistenceProvider must be pebble; SelfID
must appear in Peers when Peers is non-empty.
- application_pebble.go: after opening the pebble shard(s), if
cfg.Raft.Enabled it opens raft.OpenWithPebble sharing the local
pebble and calls AttachReplicator on the wrapper. Shutdown closes
raft BEFORE pebble (otherwise raft's apply pipeline writes into a
closed DB and panics).
Tests:
- pkg/app TestNewApplication_RaftEnabled: full create→claim→submit→get
REST cycle against a single-node raft-bootstrap cluster. Verifies
the wireup, leader election, write replication, and local reads.
- pkg/app TestRaftConfig_MutualExclusion: validator rejects each
invalid combination (raft+cluster, raft+sharding, raft+redis, raft
without selfID).
Module-wide test sweep: every package passes except internal/bench
(unrelated 120s timeout on long-running perf harnesses). raft package
holds at 41 passing tests; pkg/app gains 2 more.
internal/raft/leader.go removed (placeholder LeaseTable type is gone;
the actual lease table stays on pebble.DB where it always belonged).
In raft mode every replica's pebble.DB sees the same state via FSM
apply, but only the leader can write. Without a gate the followers'
reapers would either:
- issue writes that fail with ErrNotLeader (noisy logs), or
- duplicate work the leader is already doing (lease expired locally
looks the same on every node).
The fix is a tiny check at the top of reaper.loop: if LeaderGate is set
and returns false, skip the tick. The gate is wired from
application_pebble.go to raft.DB.IsLeader when raft is enabled, and
left nil for single-node / non-raft deployments (existing behavior
unchanged).
API surface:
- ReaperOptions.LeaderGate func() bool — opt-in; nil = run every tick
- Reaper.leaderGate internal mirror, consulted in loop()
Failover: when the leader dies, the new leader's reaper begins
sweeping on its next tick (≤ leaseInterval, default 2s). The brief
window where no reaper runs is recovered by the foreground repair
path at claim time — same as today.
Tests (3 new):
- SkipsTickWhenNotLeader: gate=false, 6+ ticks elapse, expired lease
task stays IN_PROGRESS (no DLQ transition). Gate callback observed
to fire ≥ 2 times.
- RunsTickWhenLeader: gate=true ⇒ DLQ transition happens normally.
- NilGate_RunsAlways: backward-compat for single-node deployments.
pkg/app wires reaperOpts.LeaderGate = raftNodes[0].IsLeader when
cfg.Raft.Enabled. The wireup uses raftNodes[0] because M1 has exactly
one shard ⇒ one raft node; M2 multi-raft will need a per-shard gate.
The M1 proof: three in-process codeq apps joined into a raft cluster, writes go to the leader and replicate to followers, kill the leader, survivors elect a new one, writes continue against the new leader, all 30 tasks consistent across the two survivors. Runs in ~320ms. What it covers: - Multi-node bootstrap. Only node-1 has cfg.Raft.Bootstrap=true with the full Peers map; node-2 and node-3 start fresh and pick up the initial Configuration from the leader via raft.AppendEntries. - Pre-grabbed loopback ports so the bootstrap configuration can reference final addresses; the brief release-to-reuse window is fine on a single-machine test. - waitForLeader probes each node's REST surface — POST /tasks against a follower returns ErrNotLeader (translated by the controller), against the leader returns 202. - After failover the test confirms the elected leader is one of the two survivors (not the just-killed leader) and that GET /tasks/:id returns the same body on both remaining nodes for all 30 task IDs. Helpers (raftTestNode, startRaftNode, createTaskOnLeader, getTask) encapsulate the setup so future tests in this package can stack on the same primitives. raft package + pebble package + pkg/app all still pass.
Apples-to-apples REST throughput comparison: single-node Pebble vs 3-node raft cluster, both driven through the same HTTP surface with 32 concurrent goroutines running create → claim → submit cycles for 5s after a 1s warmup. Numbers on the dev box (12-core): single-node baseline: ~9.7k cycles/s raft 3-node: ~7.1k cycles/s ratio: ~0.73 (raft retains 70-75% of baseline) That ratio is the cost of replication: 1 raft RTT to acknowledge the commit + 2 follower appendEntries per write. Each cycle is 3 writes (create, claim, submit-result), so the per-write cost is well-amortized by the 32-way concurrency. The plan's "≥30k tasks/s" target was set against the Phase 8 in-process bench (which hits ~83k); REST adds an order of magnitude of overhead, so the meaningful metric here is the ratio. The test fails only if raft drops below 30% of baseline — below that something is wrong (mutex storms, transport retries) and the bench should refuse to report bogus numbers. Skipped under -short.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Milestone 1 of the raft replication work documented in the plan. Adds an in-tree `internal/raft/` package that wires hashicorp/raft on top of the existing Pebble store. When `cfg.Raft.Enabled` is true, every `Set`/`Delete`/`CommitBatch` on the existing pebble repository layer flows through `raft.Apply` and lands on every replica's Pebble via the FSM. Reads stay local. The rest of codeq is unaware that replication is happening.
Decisions locked in plan (asked + approved):
What landed (M1.T1 through M1.T10)
Numbers
Architectural notes
Mutual exclusion
`cfg.Validate` rejects:
What's NOT in this PR
Test plan
🤖 Generated with Claude Code