persist: replace file-per-entry with WAL and refactor into generic indexedWAL#3044
persist: replace file-per-entry with WAL and refactor into generic indexedWAL#3044wen-coding wants to merge 47 commits intomainfrom
Conversation
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3044 +/- ##
===========================================
+ Coverage 58.42% 76.05% +17.62%
===========================================
Files 2088 15 -2073
Lines 172108 1424 -170684
===========================================
- Hits 100552 1083 -99469
+ Misses 62620 227 -62393
+ Partials 8936 114 -8822
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
| for lane, first := range laneFirsts { | ||
| lw, ok := bp.lanes[lane] | ||
| if !ok { | ||
| continue // no WAL yet; PersistBlock will create one lazily | ||
| } | ||
| lane, fileN, err := parseBlockFilename(entry.Name()) | ||
| if err != nil { | ||
| firstBN, ok := lw.firstBlockNum().Get() | ||
| if !ok || first <= firstBN { | ||
| continue | ||
| } | ||
| first, ok := laneFirsts[lane] | ||
| if ok && fileN >= first { | ||
| continue | ||
| walIdx := lw.firstIdx + uint64(first-firstBN) | ||
| if err := lw.TruncateBefore(walIdx); err != nil { | ||
| return fmt.Errorf("truncate lane %s WAL before block %d: %w", lane, first, err) | ||
| } | ||
| path := filepath.Join(bp.dir, entry.Name()) | ||
| if err := os.Remove(path); err != nil && !os.IsNotExist(err) { | ||
| logger.Warn("failed to delete block file", "path", path, "err", err) | ||
| } |
Check warning
Code scanning / CodeQL
Iteration over map
| dbwal.Config{ | ||
| WriteBufferSize: 0, // synchronous writes | ||
| WriteBatchSize: 1, // no batching | ||
| FsyncEnabled: true, |
There was a problem hiding this comment.
Be aware of the latency impact here, if putting write in critical path, it would introduce some noticeable latency. Would recommend synchronous write + nofsync for perf reason, fsync does provide stronger guarantees, but the chance of all validators hitting power off at the same time is pretty rare
There was a problem hiding this comment.
That's reasonable, changed
There was a problem hiding this comment.
@yzang2019 Talked with @pompon0 about this, since lane block QC is only f+1, theoretically one OS crash can screw us. We are changing all lanes and commitqc writes to happen concurrently, so latency is less of a concern. Is it okay if I change the Fsync back to true here?
| // Used when all entries are stale (e.g. the prune anchor advanced past | ||
| // everything persisted). | ||
| // | ||
| // TODO: sei-db/wal doesn't expose tidwall/wal's AllowEmpty option, so there's |
There was a problem hiding this comment.
Is the plan to expose that in a separate PR? It should be pretty simple to add though?
There was a problem hiding this comment.
PR 3049 merged and switched to TruncateAll() here.
| if err := w.wal.Write(entry); err != nil { | ||
| return err | ||
| } | ||
| if w.firstIdx == 0 { |
There was a problem hiding this comment.
Recommend using Count() == 0 instead of relying on firstIdx == 0 as a sentinel for "WAL is empty" incase the assumption of wal starting index from 0 is not valid in the future if we switch the wal library
| return nil, nil | ||
| } | ||
| entries := make([]T, 0, w.Count()) | ||
| err := w.wal.Replay(w.firstIdx, w.nextIdx-1, func(_ uint64, entry T) error { |
There was a problem hiding this comment.
There's no validation that len(entries) == w.Count() after a successful replay. If Replay succeeds but returns fewer entries than expected (e.g., the underlying WAL silently truncated its tail on open due to corruption), ReadAll would return a short slice with no error.
| return nil | ||
| } | ||
| for _, lw := range bp.lanes { | ||
| if err := lw.Close(); err != nil { |
There was a problem hiding this comment.
If one lane's WAL fails to close, the remaining lanes are never closed. Use errors.Join to accumulate errors and close all lanes
c0727e6 to
422aa8f
Compare
| // defaultStaleRetention is how long a lane WAL is kept after its last write | ||
| // before being deleted when the lane is no longer in the committee. This gives | ||
| // catching-up peers time to fetch blocks from the stale lane. | ||
| const defaultStaleRetention = 30 * time.Minute |
There was a problem hiding this comment.
that should not be an issue, given that we persist data until it is executed anyway.
There was a problem hiding this comment.
Let's talk about lane retention in tomorrow's syncup.
There was a problem hiding this comment.
Okay, per our discussion, removed stale lane removal for now and added a TODO for later. Ready for another look please.
| return &BlockPersister{lanes: map[types.LaneID]*laneWAL{}, staleRetention: defaultStaleRetention}, nil, nil | ||
| } | ||
| dir := filepath.Join(sd, "blocks") | ||
| dir := filepath.Join(sd, blocksDir) |
There was a problem hiding this comment.
nit: "Dir" suffix seems redundant as a part of a filepath, no?
There was a problem hiding this comment.
It is, but naming the constant blocks sounds a bit weird, since it seems to imply these are the blocks when it's only a path. Any better suggestions?
There was a problem hiding this comment.
I'm not sure if I follow, but I'd recommend dir := filepath.Join(sd, "lanes"), since blocks are actually in WALs and we have WAL per lane.
| entries, err := os.ReadDir(bp.dir) | ||
| if err != nil { | ||
| return fmt.Errorf("list blocks dir for cleanup: %w", err) | ||
| if len(laneFirsts) == 0 { |
There was a problem hiding this comment.
nit: is this the right layer to check such invariants?
There was a problem hiding this comment.
That's fair, removed.
| if ok && fileN >= first { | ||
| if first >= lw.nextBlockNum { | ||
| // Anchor advanced past all persisted blocks for this lane. | ||
| if err := lw.TruncateAll(); err != nil { |
There was a problem hiding this comment.
is truncation synchronous? How expensive it is?
There was a problem hiding this comment.
TruncateAll is synchronous, it removes the segment files and then change some internal pointers, not too expensive. I don't expect this to happen very often though. In practice if every validator keeps emitting blocks, there should always be 1 or 2 lane blocks which are generated but not in AppQC yet. Unless they do "generate a block then wait for a while, generate another block then wait for a while", are you imagining that as an attack?
| "filenameLane", lane, | ||
| slog.Uint64("filenameNum", uint64(n)), | ||
| ) | ||
| if time.Since(lw.LastWriteTime()) < bp.staleRetention { |
There was a problem hiding this comment.
why have you switched from the strict availability guarantees to time-based ones?
There was a problem hiding this comment.
This time-based approach is only for how long we retain a lane not in current committee any more. I originally delete the stale lane immediately, but then freaked out thinking some block in flight might still include lane blocks in the stale lanes, so I changed this to wait for 30 minutes before deletion. We can delete this for now and wait for the Epoch implementation.
For things other than stale lanes, truncation is not time-based.
| @@ -673,32 +704,23 @@ func (s *State) runPersist(ctx context.Context, pers persisters) error { | |||
| s.markCommitQCsPersisted(batch.commitQCs[len(batch.commitQCs)-1]) | |||
| } | |||
|
|
|||
There was a problem hiding this comment.
I think it is time to actually do those writes concurrently - i.e. persistence of commitQCs and each lane are independent and will be done faster in parallel. We need to minimize the critical path of sending the votes.
aeeddbf to
8de2a83
Compare
Replace the file-per-block and file-per-commitqc persistence with sei-db/wal. Blocks use one WAL per lane so that truncation is independent (no stale-lane problem). CommitQCs use a single WAL with a linear RoadIndex-to-WAL-index mapping. Key changes: - BlockPersister: per-lane WAL in blocks/<hex_lane_id>/ subdirs, lazy lane creation, independent per-lane TruncateBefore. - CommitQCPersister: single WAL in commitqcs/, tracks firstWALIdx and nextWALIdx locally for correct truncation mapping. - Remove all file-per-entry code: filename construction/parsing, directory scanning, individual file read/write/delete, corrupt file skipping. - Rewrite tests for WAL semantics (append-only, truncation, replay). Made-with: Cursor
Extract common WAL mechanics (index tracking, typed write/replay, truncation) into a generic indexedWAL[T] backed by sei-db/wal, replacing the duplicated raw-bytes WAL setup in both blocks.go and commitqcs.go. Key changes: - Add indexedWAL[T] with codec[T] interface for typed serialization - laneWAL embeds indexedWAL; firstBlockNum() returns Option for safety - DeleteBefore now removes stale lane WALs (validators no longer in committee) and their directories - Add empty-WAL guard to CommitQCPersister.DeleteBefore - Add direct unit tests for indexedWAL (wal_test.go) - Add TODO for dynamic committee membership support Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Instead of extracting the lane ID from the first replayed entry (and closing/skipping empty WALs), decode it from the hex directory name. This keeps the WAL open so the lane can receive blocks without reopening it. Made-with: Cursor
Made-with: Cursor
Validate the directory name (hex decode + PublicKeyFromBytes) before opening the WAL, avoiding a redundant hex.DecodeString call. Add tests for both skip paths: non-hex directory name and valid hex but invalid public key length. Made-with: Cursor
Replace callback-based Replay on indexedWAL with ReadAll that returns a slice. Remove the defensive sort in blocks loadAll since WAL entries are already in append order. Fix stale Replay reference in godoc. Made-with: Cursor
TruncateBefore now reads and verifies the entry at the target WAL index before truncating, catching index-mapping corruption before data loss. PersistCommitQC and PersistBlock enforce strict sequential order to prevent gaps that would break the linear domain-to-WAL-index mapping. Made-with: Cursor
The non-contiguous commitQC test now expects the gap to be caught
at PersistCommitQC time ("out of sequence") rather than at NewState
load time, matching the defense-in-depth guard added earlier.
Made-with: Cursor
Now that sei-db/wal exposes AllowEmpty and TruncateAll (#3049), use them to clear a WAL in-place instead of the heavier close → remove directory → reopen pattern. - Enable AllowEmpty in WAL config. - Replace Reset() with TruncateAll() — single call, no dir removal. - Remove dir/codec fields from indexedWAL (only needed for reopen). - Eliminate firstIdx == 0 sentinel: Count() is now just nextIdx - firstIdx, empty when equal. Write() no longer needs the first-write bookkeeping branch. - Update openIndexedWAL to handle AllowEmpty's empty-log reporting (first > last) uniformly with the non-empty case. Made-with: Cursor
…runcates" Made-with: Cursor
sei-db/wal.NewWAL no longer accepts a *slog.Logger parameter. Made-with: Cursor
Instead of immediately deleting lane WALs not in the current committee, retain them for 30 minutes (defaultStaleRetention) after the last write. This gives catching-up peers time to fetch blocks from lanes that have left the committee. - Add lastWriteTime to indexedWAL, initialized to time.Now() on open and updated on every Write() - Add staleRetention field to BlockPersister (default 30m) - DeleteBefore skips stale lane deletion when lastWriteTime is recent - Tests use staleRetention=0 for immediate deletion behavior Made-with: Cursor
Move commitQC DeleteBefore inside the anchor-persist block and derive the truncation index directly from anchor.CommitQC rather than from the in-memory queue's first index. This makes the safety invariant explicit: we only truncate WAL entries that the on-disk anchor covers. Remove the now-unused commitQCFirst field from persistBatch. Made-with: Cursor
Annotate all guard sites where persistence is disabled (dir/iw is None) with inline comments so the no-op behavior is immediately obvious. Made-with: Cursor
Make Close() internal (close()) since it's only called within the persist package — by tests and constructors for error cleanup. Add no-op comments, TODO for metrics, fix truncation comment, and correct close() godoc. Made-with: Cursor
Group blocks by lane in runPersist step 4 and write each lane's blocks in a separate errgroup goroutine. Each goroutine calls markBlockPersisted when its lane finishes so voting unblocks per-lane without waiting for all lanes. MaybeCreateLane pre-creates lane WALs sequentially before launching goroutines so bp.lanes is read-only during the concurrent phase — no mutex needed. PersistBlock now requires the lane to exist (returns an error otherwise) to prevent silent data races from lazy map writes. Made-with: Cursor
Move CommitQC writes into the same errgroup as per-lane block writes so they run in parallel. CommitQCs go in one goroutine; blocks fan out one goroutine per lane. Each goroutine publishes its result as soon as it finishes. No ordering dependency between the two — they write to independent WALs and update independent inner fields. Made-with: Cursor
Made-with: Cursor
With epoch-based fixed committees, stale lane cleanup is unnecessary. Added TODO to implement it after dynamic committee changes land. Made-with: Cursor
b6be8e4 to
273ab80
Compare
Made-with: Cursor
Made-with: Cursor
| // 4. Prune old data. | ||
| if err := pers.blocks.DeleteBefore(batch.laneFirsts); err != nil { | ||
| return fmt.Errorf("block deleteBefore: %w", err) | ||
| var g errgroup.Group |
| if err := pers.blocks.PersistBlock(proposal); err != nil { | ||
| return fmt.Errorf("persist block %s/%d: %w", h.Lane(), h.BlockNumber(), err) | ||
| lane := proposal.Msg().Block().Header().Lane() | ||
| if err := pers.blocks.MaybeCreateLane(lane); err != nil { |
There was a problem hiding this comment.
nit: how about making MaybeCreateLane an implementation detail of PersistBlock?
There was a problem hiding this comment.
What do you mean? MaybeCreateLane will write to shared map, so it's not safe to do inside PersistBlock, added a comment.
| // all persisted entries, DeleteBefore resets the WAL so new writes | ||
| // start clean. Runs every cycle (not just on anchor change) so | ||
| // that stale lane retention timeouts are evaluated promptly. | ||
| if err := pers.blocks.DeleteBefore(batch.laneFirsts); err != nil { |
There was a problem hiding this comment.
lanes can be pruned in parallel (maybe even in parallel to insertions, but I don't know how complicated would that be)
There was a problem hiding this comment.
done, make them run in parallel to insertions would be a bit tricker, because we can't run deletion and insertion on the same lane in parallel. So let's just keep it as is for now, in normal cases pruning should be trivial.
| // Persist only indices 0-4 to the CommitQC WAL. | ||
| cp, _, err := persist.NewCommitQCPersister(utils.Some(dir)) | ||
| require.NoError(t, err) | ||
| for i := 0; i < 5; i++ { |
| require.NoError(t, err) | ||
|
|
||
| require.Equal(t, types.RoadIndex(9), state.FirstCommitQC()) | ||
| latest, ok := state.LastCommitQC().Load().Get() |
There was a problem hiding this comment.
nit: require.NoError(t, utils.TestDiff(utils.Some(...),state.LastCommitQC().Load()))
| if bp.noop { | ||
| return nil | ||
| // PARALLEL-SAFE: called concurrently (one goroutine per lane). | ||
| // Only touch the per-lane *laneWAL here — no shared mutable state. |
There was a problem hiding this comment.
is lw.Write() parallel-safe though?
There was a problem hiding this comment.
We use a separate laneWAL per lane, so it should be parallel-safe, updated comments.
Made-with: Cursor
…lock Made-with: Cursor
Each lane's WAL is independent, so TruncateAll and TruncateBefore can run concurrently across lanes via scope.Parallel. Made-with: Cursor
Made-with: Cursor
…Write Made-with: Cursor
- wal.go: add INVARIANT comment on firstIdx <= nextIdx, CRITICAL comment on AllowEmpty dependency, wrap TruncateBefore error with context - commitqcs.go: document volatile cursor obligation after crash recovery, fix error string casing for consistency - wal_test.go: use _ for unused param in acceptAny Made-with: Cursor
After TruncateAll, set lw.nextBlockNum = first so the contiguity check in PersistBlock catches stale block numbers instead of accepting any value when Count() == 0. Made-with: Cursor
Tighten PersistBlock to reject stale block numbers after DeleteBefore re-anchors nextBlockNum via TruncateAll. New tests: - wal: TruncateAll on empty, reopen after TruncateAll with no writes, successive TruncateBefore calls - blocks: stale block number rejected after TruncateAll - commitqcs: DeleteBefore at already-pruned index is no-op, multiple progressive DeleteBefore calls Made-with: Cursor
… gaps - wal.go: add FirstIdx() accessor to avoid direct field access from callers; add bounds validation in TruncateBefore for clear errors - blocks.go, commitqcs.go: use FirstIdx() instead of .firstIdx - blocks.go: make close() idempotent by nil-ing lanes after close - commitqcs.go: make Close() idempotent by clearing iw Option - state_test.go: add test for anchor past all persisted blocks (block-WAL TruncateAll recovery path) - wal_test.go: use FirstIdx() accessor in all assertions Made-with: Cursor
…nment Made-with: Cursor
Summary
Replace file-per-block and file-per-commitqc persistence in
blocks.goandcommitqcs.gowithsei-db/wal, and extract common WAL mechanics into a genericindexedWAL[T].wal.go(new): genericindexedWAL[T]withcodec[T]interface, providingWrite,ReadAll,TruncateBefore(with verify callback and bounds validation),TruncateAll,FirstIdx,Count, andClose. Handles monotonic index tracking, typed serialization, and full WAL lifecycle. Opens the underlying WAL withAllowEmpty: true(critical — index arithmetic depends on this convention). Emptiness is defined byfirstIdx == nextIdxwith documentedINVARIANT: firstIdx <= nextIdx.TruncateAlldelegates towal.TruncateAll()(a synchronous reset that removes all segment files) and advancesfirstIdxtonextIdxsoCount() == 0while preserving the index counter. Fsync is enabled for additional durability; the prune anchor (persisted via A/B files with fsync) is the crash-recovery watermark.ReadAllincludes a post-replay count check to detect silent data loss.blocks.go: one WAL per lane inblocks/<hex_lane_id>/subdirectories, with independent per-lane truncation (parallelized viascope.Parallel), andTruncateAll()when the prune anchor advances past all persisted blocks. AfterTruncateAll,nextBlockNumis re-anchored soPersistBlockrejects stale block numbers even whenCount() == 0.MaybeCreateLanepre-creates lane WALs sequentially before concurrent writes;PersistBlockrequires the lane to exist (returns error otherwise) to prevent silent data races from lazy map writes.PersistBlockenforces strict contiguous block numbers.DeleteBeforeverifies block numbers before truncating via a defense-in-depth callback.loadAllchecks for gaps at replay time.close()(unexported — only used by tests and constructor error cleanup) useserrors.Jointo ensure all lane WALs are closed even if one fails; idempotent (nil-out after close).commitqcs.go: single WAL incommitqcs/, linear RoadIndex-to-WAL-index mapping viaFirstIdx()accessor.PersistCommitQCsilently ignores duplicates (idx < next) for idempotent startup, rejects gaps (idx > next).DeleteBeforeadvances the write cursor and truncates the WAL viaTruncateAllwhen the anchor advances past all entries — cursor advancement happens before the count-zero check so it works correctly even after a crash betweenTruncateAlland the first new write. After crash recovery with an empty WAL, the caller MUST callDeleteBefore(anchor)to re-establish the cursor (documented in constructor).loadAllchecks for gaps at replay time.Close()is idempotent (clears internal Option).state.go: startup prunes stale WAL entries viaDeleteBefore, then re-persists in-memory CommitQCs (no-op in normal case; writes anchor's QC after a WALTruncateAll). RuntimerunPersistreordered: anchor + commitQC prune → block prune → concurrent writes viascope.Parallel. CommitQCs and blocks are persisted concurrently: one goroutine for CommitQCs, one goroutine per lane for blocks (sequential within each lane). Each goroutine publishes its result (markCommitQCsPersisted/markBlockPersisted) as soon as it finishes. CommitQC WAL truncation is co-located with the anchor persist step so the truncation point is derived directly from the on-disk anchor, making the safety invariant explicit: we only truncate entries the anchor covers.ResetNextmethod.DeleteBefore).Concurrency design (no mutex)
runPersistgroups blocks by lane, callsMaybeCreateLanesequentially (must happen before concurrent writes because it writes to the sharedlanesmap), then launchesscope.Parallelwith one goroutine for CommitQCs and one goroutine per block lane. During the concurrent phasebp.lanesis read-only (no map writes), so nosync.Mutexis needed. Each*laneWALis exclusively owned by one goroutine —lw.Write()itself is NOT thread-safe, safety comes from lane isolation.scope.Parallelwaits for all goroutines before the nextcollectPersistBatchiteration, ensuring no two batches write to the same lane concurrently.DeleteBeforealso parallelizes per-lane truncation viascope.Parallel— each lane'sTruncateAll/TruncateBeforeoperates on an independent WAL directory with no shared mutable state.Test plan
wal_test.go: empty start, write+read, reopen with data, reopen after truncate, truncate all but last, verify callback (accept + reject), write after truncate, TruncateAll, stale nextIdx detection, TruncateAll on empty WAL, reopen after TruncateAll with no writes, successive TruncateBefore callsblocks_test.go: empty dir, persist+load, multiple lanes, delete-before (single/multi/empty lane/restart), noop, delete-then-persist, delete-past-all (TruncateAll), stale block rejected after TruncateAll, empty WAL survives reopen, MaybeCreateLane (creates on first call, idempotent, PersistBlock errors without it), skip non-hex/invalid-key dirs, out-of-sequence rejection, gap detection at load timecommitqcs_test.go: empty dir, persist+load, delete-before, duplicate is no-op, gap rejected, noop, delete-then-persist, delete-past-all (TruncateAll + cursor advance), crash-recovery (TruncateAll then crash before write, restart with empty WAL), gap detection at load time, already-pruned DeleteBefore is no-op, progressive DeleteBefore callsstate_test.go: anchor past all persisted commitQCs truncates WAL and re-persists anchor's QC, anchor past all persisted blocks truncates lane WAL, loads persisted blocks, loads persisted AppQC and blocks together