Skip to content

persist blocks and FullCommitQCs in data layer via WAL#3126

Open
wen-coding wants to merge 3 commits intomainfrom
wen/persist_data
Open

persist blocks and FullCommitQCs in data layer via WAL#3126
wen-coding wants to merge 3 commits intomainfrom
wen/persist_data

Conversation

@wen-coding
Copy link
Copy Markdown
Contributor

@wen-coding wen-coding commented Mar 27, 2026

Summary

  • Add GlobalBlockPersister and FullCommitQCPersister backed by indexedWAL for data-layer crash recovery
  • Group both into a DataWAL parameter struct consumed by data.NewState
  • NewState recovery verifies all loaded data via insertQC/insertBlock (signatures + hash matching), treating WAL data as untrusted. Blocks outside QC range are ignored. Cursors fast-forward after pruning
  • DataWAL.reconcile() fixes WAL cursor inconsistencies (crash between parallel truncations) at construction time. NewState just asserts consistency
  • Migrate A/B file persistence from hand-rolled CRC32-C format to protobuf PersistedWrapper
  • Add TruncateWhile to indexedWAL for predicate-based truncation (single-pass scan+truncate)
  • Make indexedWAL.TruncateBefore handle walIdx >= nextIdx via TruncateAll (removes special-case branches)
  • Remove unused BlockStore interface (superseded by DataWAL)
  • Rename GlobalCommitQCPersisterFullCommitQCPersister; on-disk dir to fullcommitqcs
  • Rename LoadNextNext() on both persisters
  • Move loaded WAL data into persisters via ConsumeLoaded() (thread-safe, clearer ownership)
  • Async persistence: PushQC/PushBlock are pure in-memory operations — errors only reflect verification failures. Background runPersist goroutine writes QCs (eagerly up to nextQC) and blocks (up to nextBlock) in parallel via scope.Parallel. Persistence errors propagate vertically via Run()
  • nextBlockToPersist cursor advances to min(persistedQC, persistedBlock) after both are durable. PushAppHash (now takes ctx) waits on this cursor, ensuring AppVotes are only issued for persisted data
  • Parallel WAL truncation in DataWAL.TruncateBefore
  • Pruning keeps at least one entry to prevent empty WALs on restart (inner.first always recoverable). TODO: proper fix would not prune until AppQC exists

Test plan

  • globalblocks_test.go: persist & reload, truncate & reload, truncate all, no-op, duplicate ignored, gap error, continue after reload
  • fullcommitqcs_test.go: persist & reload, truncate & reload, truncate all, no-op, duplicate ignored, gap error, mid-range truncation, continue after reload
  • wal_test.go: TruncateWhile — empty, none match, partial, all, reopen after
  • state_test.go:
    • TestStateRecoveryFromWAL — full recovery; third restart verifies WALs not wiped
    • TestStateRecoveryBlocksOnly — QCs WAL lost, blocks re-pushed with QC
    • TestStateRecoveryQCsOnly — blocks WAL lost, cursor sync via reconcile
    • TestStateRecoveryAfterPruning — both WALs truncated, only tail survives
    • TestStateRecoverySkipsStaleBlocks — blocks before first QC range ignored
    • TestStateRecoveryBlocksBehindQCs — QCs ahead of blocks, gap re-fetched
    • TestStateRecoveryIgnoresBlocksBeyondQC — blocks beyond QC range ignored
    • TestPruningKeepsLastEntry — pruning never empties state; restart recovers
    • TestExecution — async persistence + PushAppHash wait semantics
  • All existing tests pass (data, avail, consensus, p2p/giga)
  • gofmt clean

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 27, 2026

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedApr 2, 2026, 5:44 PM

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 27, 2026

Codecov Report

❌ Patch coverage is 76.24309% with 86 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.60%. Comparing base (3d64c1b) to head (8542d48).
⚠️ Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
sei-tendermint/internal/autobahn/data/state.go 76.43% 21 Missing and 16 partials ⚠️
...nternal/autobahn/consensus/persist/globalblocks.go 76.76% 16 Missing and 7 partials ⚠️
...ternal/autobahn/consensus/persist/fullcommitqcs.go 78.75% 12 Missing and 5 partials ⚠️
...dermint/internal/autobahn/consensus/persist/wal.go 68.00% 4 Missing and 4 partials ⚠️
sei-tendermint/internal/autobahn/data/testonly.go 0.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3126      +/-   ##
==========================================
+ Coverage   58.55%   58.60%   +0.05%     
==========================================
  Files        2099     2101       +2     
  Lines      173847   174176     +329     
==========================================
+ Hits       101803   102084     +281     
- Misses      62873    62894      +21     
- Partials     9171     9198      +27     
Flag Coverage Δ
sei-chain-pr 75.02% <76.24%> (?)
sei-db 70.41% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
sei-tendermint/internal/autobahn/data/testonly.go 61.32% <0.00%> (ø)
...dermint/internal/autobahn/consensus/persist/wal.go 67.08% <68.00%> (+3.15%) ⬆️
...ternal/autobahn/consensus/persist/fullcommitqcs.go 78.75% <78.75%> (ø)
...nternal/autobahn/consensus/persist/globalblocks.go 76.76% <76.76%> (ø)
sei-tendermint/internal/autobahn/data/state.go 78.11% <76.43%> (+11.61%) ⬆️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@wen-coding wen-coding changed the title persist FullCommitQCs in data layer via WAL persist blocks and FullCommitQCs in data layer via WAL Mar 27, 2026
@wen-coding wen-coding force-pushed the wen/persist_data branch 2 times, most recently from 73df12b to 2613700 Compare March 31, 2026 00:16
@wen-coding wen-coding force-pushed the wen/persist_data branch 2 times, most recently from e24ad7d to 9c79a0c Compare March 31, 2026 21:34
@wen-coding wen-coding requested a review from pompon0 March 31, 2026 21:36
return nil
}
if keepIdx >= w.nextIdx {
return w.TruncateAll()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant, isn't w.wal.TruncateBefore(w.nextIdx) equivalent to w.wal.TruncateAll()?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main difference is that when we are removing all entries, we don't need to verify every removed entry any more. We verified in TruncateBefore because we were afraid somehow we messed up with entry boundary and thought we removed one entry but actually screwed up the whole file (that said, how we can recover from this case is a big problem.)

If you think we can still verify, then we can do TruncateBefore.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this distinction rather artificial in terms of requirements, but that's not a strong opinion.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, changed

}
// Persist QCs and blocks in parallel.
if err := scope.Parallel(func(ps scope.ParallelScope) error {
ps.Spawn(func() error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that it may happen that blocks will be persisted without their corresponding qcs. You should resolve it when loading.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we now skip blocks without QC

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaict NewGlobalBlockPersister still does s.next = loaded[len(loaded)-1].Number + 1 even if the last block does not have the corresponding QC and reconciliation only fixes the prefix mismatch

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, got it, I see that the extra blocks are ignored on load, and then the writes of those blocks are ignored afterwards. I suppose that's fine, while again it is an asymmetry that truncation consistency is maintained by WAL, while suffix consistency is not.

@wen-coding wen-coding force-pushed the wen/persist_data branch 2 times, most recently from ec822a7 to 6801499 Compare April 1, 2026 18:41
@wen-coding wen-coding requested a review from pompon0 April 1, 2026 22:56
}

// globalBlockState is the mutable state protected by GlobalBlockPersister's mutex.
type globalBlockState struct {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

globalBlockState does not have committee, despite it is needed in the same sense as in case of fullcommitqcsState. Is this asymmetry intentional?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally added it, then figured we can save a few bytes by passing firstBlock in directly and not saving committee.

But I agree symmetry is a good reason, adding it back.


// zeroCommittee is a committee with FirstBlock()==0, used by persister tests
// that persist blocks starting at global block number 0.
var zeroCommittee = func() *types.Committee {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these tests testing any properties that are specific to chain starting at 0? If not, can we make the tests use a randomized starting point instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, I don't believe we depend on any tests starting at 0, changed.

- Add GlobalBlockPersister and FullCommitQCPersister backed by indexedWAL
- Group into DataWAL struct; NewState takes DataWAL for crash recovery
- NewState verifies loaded data via insertQC/insertBlock (signatures + hashes)
- DataWAL.reconcile() fixes WAL cursor inconsistencies at construction
- Add TruncateWhile to indexedWAL; TruncateBefore handles walIdx >= nextIdx
- Async persistence via runPersist with parallel QC/block writes
- PushQC/PushBlock are pure in-memory; errors only from verification
- nextBlockToPersist cursor gates PushAppHash for AppVote safety
- Pruning keeps at least one entry to prevent empty WALs on restart

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
}

// insertQC verifies and inserts a FullCommitQC into the inner state.
// On the first QC, it sets inner.first and related cursors (handles
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cursors are already set in constructor, no?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, now I see what you mean. In case inner is empty, it will assume the new starting point instead of assuming committee.FirstBlock(). nvm then

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: imo an explicit prune() call for shifting starting point would be more readable. Up to you

// insertBlock verifies a block and inserts it into the inner state.
// Requires a QC to already be present for block n.
func (i *inner) insertBlock(committee *types.Committee, n types.GlobalBlockNumber, block *types.Block) error {
if err := block.Verify(committee); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this verification (relatively expensive) can be moved after the early exit checks.

if err := block.Verify(committee); err != nil {
return fmt.Errorf("block.Verify(): %w", err)
}
if n < i.first || n >= i.nextQC {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a weird corner case with this implementation: in case of, let's say archival node (or a node with a very long retention), pruning does not remove the QC ranges of blocks atomically. Therefore we will end up with the first QC missing a prefix of blocks on restart. After restart nextBlock will be set to first - data.State will think that there are some very old blocks missing and will try to fetch them. That is not a desirable behavior, since old blocks are stored for the convenience of peers - our node should not be blocked on availability of old blocks that it already has executed.


// insertBlock verifies a block and inserts it into the inner state.
// Requires a QC to already be present for block n.
func (i *inner) insertBlock(committee *types.Committee, n types.GlobalBlockNumber, block *types.Block) error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we deduplicate logic of insertBlock with PushQC/PushBlock?

wen-coding and others added 2 commits April 2, 2026 10:18
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace zeroCommittee with types.GenCommittee (random FirstBlock).
All block numbers are now relative to committee.FirstBlock().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants