feat(raft): M1 — replicate Pebble writes through hashicorp/raft (in-tree) by osvaldoandrade · Pull Request #584 · osvaldoandrade/codeq

osvaldoandrade · 2026-05-18T12:53:37Z

Summary

Milestone 1 of the raft replication work documented in the plan. Adds an in-tree `internal/raft/` package that wires hashicorp/raft on top of the existing Pebble store. When `cfg.Raft.Enabled` is true, every `Set`/`Delete`/`CommitBatch` on the existing pebble repository layer flows through `raft.Apply` and lands on every replica's Pebble via the FSM. Reads stay local. The rest of codeq is unaware that replication is happening.

Decisions locked in plan (asked + approved):

Build in-tree first, no extracted `raft-pebble` library
Multi-raft per shard as eventual goal (M2); single-raft per node for M1
Library: `github.com/hashicorp/raft`
Mutually exclusive with the legacy `cluster` mode and with `numShards > 1`

What landed (M1.T1 through M1.T10)

#	Title	Tests
T1	Skeleton package + hashicorp/raft dep	—
T2	LogStore on Pebble (`raft/log/` prefix, msgpack values)	9
T3	StableStore on Pebble (`raft/stable/` prefix, BE uint64)	6
T4	Snapshot via `pebble.NewSnapshot` (iter-based stream, not Checkpoint) + FSM Persist/Restore	7
T5	FSM.Apply over `pebble.Batch.Repr` (deterministic replication on every node)	8
T6	`raft.DB` wrapper + `NewRaft` wireup + TCP transport + leader signal	11
T7	`config.RaftConfig` + validation + `application_pebble.go` branch (`AttachReplicator`)	2 E2E
T8	Reaper `LeaderGate` — followers stay passive	3
T9	3-node cluster failover end-to-end (REST in-process)	1 (320 ms run)
T10	Bench: raft vs single-node baseline	1

Numbers

47+ tests added (internal/raft + repository/pebble + pkg/app)
3-node failover: leader killed at t=0, new leader elected and writes continue in ~320 ms total
Bench (12-core dev box, loopback): single-node REST baseline ~9.7k cycles/s; raft 3-node ~7.1k cycles/s; raft retains ~73% of baseline throughput. Each "cycle" = create + claim + submit, so it's three writes per cycle through the raft.Apply pipeline.
Build clean, vet clean, `go test ./...` passes everywhere except the existing `internal/bench` long-running perf timeout (pre-existing, unrelated)

Architectural notes

Delegation seam: `internal/repository/pebble.DB` got a new `Replicator` interface and `AttachReplicator()` method. When attached, the existing coalescer is bypassed and writes route through `Replicator.Replicate(repr)`. The repository code (`task_repository.go`, `reaper.go`, etc.) is unchanged.
Shared Pebble: `raft.OpenWithPebble(ctx, cfg, pdb)` lets the application layer open Pebble once and share it between the repository wrapper and the raft FSM (different key prefixes: `codeq/` for user state, `raft/log/` and `raft/stable/` for raft internals).
Lifecycle: raft.Close runs before pebble.Close — otherwise the FSM apply pipeline would write to a closed DB and panic.
Reads stay local — follower reads may be stale; the repository code (and clients) handle that explicitly. M2 can add a linearizable read path if needed.

Mutual exclusion

`cfg.Validate` rejects:

`raft.enabled + cluster.enabled` (different replication models)
`raft.enabled + sharding.enabled` (legacy multi-backend sharding)
`raft.enabled + persistenceProvider != pebble`
`raft.enabled` without `selfId` / `bindAddr` / `selfId` not in `peers`

What's NOT in this PR

No extraction: the package lives at `internal/raft/`; if a future PeterDB needs a standalone `raft-pebble` lib, the boundary is clean enough to extract then.
Single-raft per node: M2 (next milestone) adds multi-raft per shard. Validation currently rejects `numShards > 1` together with `raft.enabled`.
No linearizable reads: reads pass through to local Pebble. Hitting the leader from a stale follower is allowed.
No gRPC multiplexed transport: M1 uses hashicorp/raft's stock TCP transport. M2 swaps to a multiplexed gRPC transport for multi-raft.

Test plan

`go build ./...` clean
`go vet ./...` clean
`go test ./internal/raft ./internal/repository/pebble ./pkg/app -count=1` — all pass
3-node failover (`TestRaft3NodeFailover`) passes deterministically (~320 ms)
Bench (`TestRaftBench_VsSingleNode`) passes with raft ≥ 30% of baseline
Manual: stand up a 3-node deployment via `codeq install --target docker` (out of scope — Compose templates aren't raft-aware yet; follow-up task)

🤖 Generated with Claude Code

First milestone of the raft replication work documented in /home/ova/.claude/plans/woolly-stargazing-orbit.md. M1 builds an in-tree internal/raft/ package with the same surface as internal/repository/pebble.DB so the repository layer can swap one for the other without touching the service tier. This commit lays the skeleton — every file compiles, no behavior change, no tests yet. Subsequent tasks fill in the implementations: - db.go : DB wrapper with Open/Close, read pass-through, write path stubs (writes still go direct to Pebble until T6) - fsm.go : hashicorp/raft.FSM stub (T5 fills Apply/Snapshot/Restore) - log_store.go : hashicorp/raft.LogStore stub (T2) - stable_store.go : hashicorp/raft.StableStore stub (T3) - snapshot.go : hashicorp/raft.SnapshotStore stub (T4) - leader.go : placeholder LeaseTable type for the wrapper surface Dependencies added: github.com/hashicorp/raft v1.7.3 + transitive (go-immutable-radix, go-metrics, go-msgpack/v2, golang-lru, go-uuid).

hashicorp/raft.LogStore implementation backed by the same Pebble store the FSM uses. Log entries live under the `raft/log/<be8 index>` prefix; values are msgpack-encoded raft.Log structs (same wire shape as hashicorp/raft-boltdb so the encoding is unsurprising). API contracts verified by tests: - FirstIndex / LastIndex on empty store → 0 - StoreLog + GetLog round-trip preserves Index, Term, Type, Data - GetLog on missing index → hraft.ErrLogNotFound (raft uses the sentinel to decide between catch-up entries and snapshot install) - StoreLogs commits a whole batch atomically via pebble.Batch - StoreLogs(nil) is a no-op - DeleteRange [min, max] is inclusive on both ends; uses pebble's native range delete with a +1 carry on the end key - DeleteRange of prefix, suffix, and single-entry ranges all behave - DeleteRange with min > max returns an error Internal helpers (logKey/indexFromKey/encodeLog/decodeLog) kept private so the wire format can evolve without an API break.

hashicorp/raft.StableStore implementation backed by the shared Pebble store. Keys live under `raft/stable/<key>`. Values: - Set/Get: raw bytes - SetUint64/GetUint64: 8-byte big-endian encoding Missing-key contract follows the documented spec (raft/stable.go:16-17): - Get → (nil, nil) - GetUint64 → (0, nil) raft consumes both forms permissively (api.go also accepts "not found" error string), but honoring the documented contract is safer. Tests: - Missing keys return zero values (no error) - Set/Get and SetUint64/GetUint64 round-trip - Overwrite preserves last write - Prefix isolation: a raw Pebble write outside `raft/stable/` is invisible to the StableStore (sanity check that the prefix scoping is wired)

Revised the approach from the plan: instead of pebble.Checkpoint (which captures the WHOLE LSM, including raft log + stable keys, and produces sst-file-shaped snapshots that are large and opaque), the FSM uses pebble.NewSnapshot for a consistent point-in-time view, then iterates ONLY the codeq/ prefix and streams (key, value) pairs into the raft SnapshotSink. Restore wipes codeq/ first (DeleteRange) then re-applies in batches. Why iter-based over checkpoint: - Snapshot size = sum of live keys, not SST file footprint. Smaller. - No need to filter out raft/log/, raft/stable/ during restore. - Plain length-prefixed binary format (magic + version + tagged records) parseable with stdlib only. SnapshotStore wrapping: hashicorp/raft already ships FileSnapshotStore, which handles snapshot metadata + retention (last 3 kept, older purged). newSnapshotStore is a thin factory. Format: [4] magic "CDQS" [4] version uint32 BE (= 1) loop: [1] tag 0x01=entry, 0x00=eof if entry: [4] klen uint32 BE [klen] key raw [4] vlen uint32 BE [vlen] val raw Tests (all passing): - Round-trip: 3 codeq/ keys survive write+read; a sibling raft/log/ key in the source is filtered out (FSM scope). - Restore wipes existing codeq/ state (replace semantics). - Empty source clears the destination. - Bad magic / unsupported version / truncated stream all error. - newSnapshotStore creates the directory and exposes an empty List(). fsm.go: Snapshot creates pebble.NewSnapshot synchronously (raft FSM lock held), returns an fsmSnapshot whose Persist streams via writeSnapshot. Restore reads the stream via readSnapshot.

FSM.Apply turns committed raft log entries into Pebble batch commits. The wire format is pebble.Batch.Repr() — same bytes the leader builds locally, shipped through raft, deserialized via SetRepr on every replica. That's the canonical pebble replication path (the format is versioned + stable; CockroachDB uses the same). Implementation notes: - log.Data is copied before SetRepr because Batch takes ownership of the slice and raft reuses log buffers on next dispatch. - Non-LogCommand entries (LogNoop, LogBarrier, LogConfiguration) and empty Data are no-ops. raft's runFSM filters most of these but defensive checks cost nothing. - Apply returns error-or-nil through the `any` return; the leader retrieves it via raft.ApplyFuture.Response(). Tests (8 passing): - Set writes the key - Delete removes the key - Mixed Set+Delete in one batch lands atomically (the codeq Claim pattern) - LogNoop is a no-op even when Data is non-empty - nil and empty Data are no-ops - Corrupt Data returns an error with "SetRepr" in the message - Two replicas applying the same Repr converge to identical state (the property raft replication needs) - Apply → Snapshot → Restore on a sibling produces a replica that matches the original — wires T4 + T5 end-to-end

Integrates LogStore (T2), StableStore (T3), SnapshotStore (T4), FSM (T5), and an hraft.NewTCPTransport into a working raft node. The DB wrapper exposes the same surface as internal/repository/pebble.DB so the repository layer (T7) can swap one for the other without touching the service tier. Open flow: - Opens pebble at <Path>/state/ - Creates the snapshot dir at <Path>/snapshots/ and wraps with raft.NewFileSnapshotStore (retain=3) - Wires log + stable stores over the same pebble (different prefixes) - Builds the FSM, opens an hraft.NewTCPTransport - If Bootstrap=true and HasExistingState=false, calls raft.BootstrapCluster with the configured peers (defaulting to a self-only cluster when PeerAddrs is empty — handy for tests) - Calls hraft.NewRaft and starts a goroutine relaying LeaderCh to a buffered channel for the reaper gate (T8) - recoverSeq scans existing pending keys to seed the seq counter Write path: - IsLeader gate returns ErrNotLeader to the repository layer (no transparent forwarding — that's the cluster router's job, not the storage layer's) - Set/Delete build a 1-op pebble.Batch and route through CommitBatch - CommitBatch takes the batch's Repr, copies it, raft.Apply(repr, timeout). Blocks until the entry is committed and applied locally. - FSM.Apply on every replica (including the leader) does the actual Pebble write via SetRepr. Read path: - Get/Has/Iter stay local. Reads on followers may be stale; the repository layer (and clients) handle that explicitly. Test additions (10): - SingleNodeBootstrapElects: bootstraps a 1-voter cluster, becomes leader within ~50ms. - SetRoutesThroughRaftAndReadsLocal: replicated write visible via local Get. - DeleteRoutesThroughRaft: same for deletes. - CommitBatchAtomicMultiKey: multi-key batch lands as a unit. - CommitBatch_EmptyIsNoop: empty batch returns nil without raft hit. - NextSeqMonotonic: per-node atomic counter behaves. - IsLeader_FollowerWritesReturnErrNotLeader: post-shutdown writes surface ErrNotLeader. - ReopenPreservesState: durable across process restart (HasExisting- State skips re-bootstrap). - IterReturnsLiveRange: read range scan works as before. - SeqRecoveryOnReopen: pending key with seq=42 recovered on reopen, NextSeq returns 43. raft package now has 41 passing tests total.

Closes the loop: when cfg.Raft.Enabled, every write on the existing internal/repository/pebble.DB flows through raft.Apply, lands in the FSM on every replica, and commits to the local Pebble. The repository code is unchanged — the integration sits on a clean delegation seam. Design changes: - internal/raft.DB grew two constructors: Open(ctx, cfg) opens its own pebble (standalone) OpenWithPebble(ctx, cfg, pdb) wraps a caller-owned pebble Close() only closes pebble when Open() created it. OpenWithPebble's pebble lifecycle stays with the caller. Both share an internal openInternal() that does the raft wireup. - Added raft.DB.Replicate(repr) public method — the entry point pebble.DB calls when it has a batch to ship. Same body as the old internal CommitBatch path, just exposed. - internal/repository/pebble.DB: + Replicator interface (IsLeader, Replicate) + AttachReplicator(r) field installer + Set/Delete/CommitBatch route through repl.Replicate when attached (and return ErrNotLeader if !IsLeader); fall back to the existing coalescer/direct paths when not attached No change to the repos that use *DB — Leases, NextSeq, Iter, Get all stay where they are. *raft.DB satisfies pebble.Replicator (compile- time asserted inside the raft package). - pkg/config.RaftConfig: Enabled, SelfID, BindAddr, Bootstrap, Peers, HeartbeatMS / ElectionMS / LeaderLeaseMS / CommitMS, ApplyTimeoutSeconds. - Validation: raft.enabled is mutually exclusive with cluster.enabled and sharding.enabled; persistenceProvider must be pebble; SelfID must appear in Peers when Peers is non-empty. - application_pebble.go: after opening the pebble shard(s), if cfg.Raft.Enabled it opens raft.OpenWithPebble sharing the local pebble and calls AttachReplicator on the wrapper. Shutdown closes raft BEFORE pebble (otherwise raft's apply pipeline writes into a closed DB and panics). Tests: - pkg/app TestNewApplication_RaftEnabled: full create→claim→submit→get REST cycle against a single-node raft-bootstrap cluster. Verifies the wireup, leader election, write replication, and local reads. - pkg/app TestRaftConfig_MutualExclusion: validator rejects each invalid combination (raft+cluster, raft+sharding, raft+redis, raft without selfID). Module-wide test sweep: every package passes except internal/bench (unrelated 120s timeout on long-running perf harnesses). raft package holds at 41 passing tests; pkg/app gains 2 more. internal/raft/leader.go removed (placeholder LeaseTable type is gone; the actual lease table stays on pebble.DB where it always belonged).

In raft mode every replica's pebble.DB sees the same state via FSM apply, but only the leader can write. Without a gate the followers' reapers would either: - issue writes that fail with ErrNotLeader (noisy logs), or - duplicate work the leader is already doing (lease expired locally looks the same on every node). The fix is a tiny check at the top of reaper.loop: if LeaderGate is set and returns false, skip the tick. The gate is wired from application_pebble.go to raft.DB.IsLeader when raft is enabled, and left nil for single-node / non-raft deployments (existing behavior unchanged). API surface: - ReaperOptions.LeaderGate func() bool — opt-in; nil = run every tick - Reaper.leaderGate internal mirror, consulted in loop() Failover: when the leader dies, the new leader's reaper begins sweeping on its next tick (≤ leaseInterval, default 2s). The brief window where no reaper runs is recovered by the foreground repair path at claim time — same as today. Tests (3 new): - SkipsTickWhenNotLeader: gate=false, 6+ ticks elapse, expired lease task stays IN_PROGRESS (no DLQ transition). Gate callback observed to fire ≥ 2 times. - RunsTickWhenLeader: gate=true ⇒ DLQ transition happens normally. - NilGate_RunsAlways: backward-compat for single-node deployments. pkg/app wires reaperOpts.LeaderGate = raftNodes[0].IsLeader when cfg.Raft.Enabled. The wireup uses raftNodes[0] because M1 has exactly one shard ⇒ one raft node; M2 multi-raft will need a per-shard gate.

The M1 proof: three in-process codeq apps joined into a raft cluster, writes go to the leader and replicate to followers, kill the leader, survivors elect a new one, writes continue against the new leader, all 30 tasks consistent across the two survivors. Runs in ~320ms. What it covers: - Multi-node bootstrap. Only node-1 has cfg.Raft.Bootstrap=true with the full Peers map; node-2 and node-3 start fresh and pick up the initial Configuration from the leader via raft.AppendEntries. - Pre-grabbed loopback ports so the bootstrap configuration can reference final addresses; the brief release-to-reuse window is fine on a single-machine test. - waitForLeader probes each node's REST surface — POST /tasks against a follower returns ErrNotLeader (translated by the controller), against the leader returns 202. - After failover the test confirms the elected leader is one of the two survivors (not the just-killed leader) and that GET /tasks/:id returns the same body on both remaining nodes for all 30 task IDs. Helpers (raftTestNode, startRaftNode, createTaskOnLeader, getTask) encapsulate the setup so future tests in this package can stack on the same primitives. raft package + pebble package + pkg/app all still pass.

Apples-to-apples REST throughput comparison: single-node Pebble vs 3-node raft cluster, both driven through the same HTTP surface with 32 concurrent goroutines running create → claim → submit cycles for 5s after a 1s warmup. Numbers on the dev box (12-core): single-node baseline: ~9.7k cycles/s raft 3-node: ~7.1k cycles/s ratio: ~0.73 (raft retains 70-75% of baseline) That ratio is the cost of replication: 1 raft RTT to acknowledge the commit + 2 follower appendEntries per write. Each cycle is 3 writes (create, claim, submit-result), so the per-write cost is well-amortized by the 32-way concurrency. The plan's "≥30k tasks/s" target was set against the Phase 8 in-process bench (which hits ~83k); REST adds an order of magnitude of overhead, so the meaningful metric here is the ratio. The test fails only if raft drops below 30% of baseline — below that something is wrong (mutex storms, transport retries) and the bench should refuse to report bogus numbers. Skipped under -short.

osvaldoandrade added 10 commits May 18, 2026 04:04

osvaldoandrade merged commit fa5c87f into main May 18, 2026
2 checks passed

osvaldoandrade deleted the feat/raft-m1 branch May 18, 2026 12:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(raft): M1 — replicate Pebble writes through hashicorp/raft (in-tree)#584

feat(raft): M1 — replicate Pebble writes through hashicorp/raft (in-tree)#584
osvaldoandrade merged 10 commits into
mainfrom
feat/raft-m1

osvaldoandrade commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

osvaldoandrade commented May 18, 2026

Summary

What landed (M1.T1 through M1.T10)

Numbers

Architectural notes

Mutual exclusion

What's NOT in this PR

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant