fix: address critical data safety, consistency, and performance issues#429
fix: address critical data safety, consistency, and performance issues#429
Conversation
bootjp
commented
Mar 24, 2026
- pebbleStore: make saveLastCommitTS durable (Sync) and atomic within ApplyMutations WriteBatch to prevent timestamp rollback on crash
- pebbleStore: hold mutex across conflict check and batch commit to eliminate TOCTOU race in ApplyMutations
- pebbleStore: use bounded iterators in GetAt/LatestCommitTS instead of unbounded NewIter(nil) to reduce Pebble overhead
- pebbleStore: switch PutAt/DeleteAt/ExpireAt to NoSync since Raft log already provides durability for FSM-driven writes
- mvccStore: replace tree.Each() with Iterator loop in ScanAt for early termination on limit, avoiding full treemap traversal
- fsm: add prefix byte to Raft command encoding to eliminate double protobuf deserialization on the hot path (backward compatible)
- txn_keys: pre-allocate []byte prefix slices at package level and add common prefix fast-path check to eliminate per-call heap allocations
- sharded_coordinator: log errors in abortPreparedTxn instead of silently discarding them
- coordinator: add 5s timeout to redirect gRPC forward calls
- shard_store: add 5s timeout to all proxy gRPC calls
- adapter/internal: add VerifyLeader quorum check to Forward handler
- docs: add comprehensive review_todo.md with remaining issues
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the system's robustness, performance, and data integrity. It addresses several critical issues related to data safety and consistency by ensuring atomic operations and proper synchronization. Performance bottlenecks in data access and serialization have been alleviated through optimized iteration, reduced disk syncs, and memory allocation improvements. Additionally, it bolsters system resilience by introducing timeouts for inter-service communication and improving error visibility, while also documenting a comprehensive list of remaining technical debt. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request addresses several critical and high-priority issues identified during a comprehensive code review, as detailed in the newly added docs/review_todo.md file. Key changes include enhancing data durability by ensuring lastCommitTS is synchronously persisted in pebbleStore, improving concurrency by adding leader verification and timeouts to gRPC forward calls in adapter, coordinator, and shard_store, and optimizing performance by refining mvccStore and pebbleStore scan operations, improving Raft command deserialization, and reducing heap allocations for transaction key prefixes. Additionally, transaction abort failures are now logged to prevent silent data inconsistencies. No specific feedback was provided in the review comments.
There was a problem hiding this comment.
Pull request overview
This PR focuses on strengthening durability/concurrency guarantees in the Pebble-backed MVCC store while reducing hot-path overhead across MVCC scans, Raft command handling, and routing/proxy gRPC calls.
Changes:
- Improves MVCC scan performance (early termination for treemap scans; bounded Pebble iterators for point lookups).
- Hardens Pebble commit-timestamp persistence and ApplyMutations atomicity/concurrency (batch-persist
lastCommitTS, hold mutex through conflict check + commit). - Reduces overhead and improves operational safety (Raft command prefix to avoid double unmarshal; txn key prefix byte-slice reuse; gRPC proxy timeouts; leader verification; error logging on abort).
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| store/mvcc_store.go | Reworks in-memory ScanAt to use an iterator for early exit on limit. |
| store/lsm_store.go | Adds batch persistence for lastCommitTS, locks ApplyMutations end-to-end, and introduces bounded iterators for GetAt/LatestCommitTS; adjusts sync behavior for writes. |
| kv/txn_keys.go | Preallocates txn prefix byte slices and adds a common-prefix fast path. |
| kv/transaction.go | Prefixes Raft command encodings to support fast decode dispatch. |
| kv/fsm.go | Decodes Raft commands using the new prefix while retaining legacy fallback. |
| kv/sharded_coordinator.go | Logs abort failures instead of silently discarding errors. |
| kv/coordinator.go | Adds a timeout to redirect Forward gRPC calls. |
| kv/shard_store.go | Adds timeouts to proxy RawKV gRPC calls. |
| adapter/internal.go | Adds VerifyLeader check to the Forward handler. |
| docs/review_todo.md | Adds a consolidated review TODO document of remaining issues. |
|
@copilot open a new pull request to apply changes based on the comments in this thread |
- pebbleStore: make saveLastCommitTS durable (Sync) and atomic within ApplyMutations WriteBatch to prevent timestamp rollback on crash - pebbleStore: hold mutex across conflict check and batch commit to eliminate TOCTOU race in ApplyMutations - pebbleStore: use bounded iterators in GetAt/LatestCommitTS instead of unbounded NewIter(nil) to reduce Pebble overhead - pebbleStore: switch PutAt/DeleteAt/ExpireAt to NoSync since Raft log already provides durability for FSM-driven writes - mvccStore: replace tree.Each() with Iterator loop in ScanAt for early termination on limit, avoiding full treemap traversal - fsm: add prefix byte to Raft command encoding to eliminate double protobuf deserialization on the hot path (backward compatible) - txn_keys: pre-allocate []byte prefix slices at package level and add common prefix fast-path check to eliminate per-call heap allocations - sharded_coordinator: log errors in abortPreparedTxn instead of silently discarding them - coordinator: add 5s timeout to redirect gRPC forward calls - shard_store: add 5s timeout to all proxy gRPC calls - adapter/internal: add VerifyLeader quorum check to Forward handler - docs: add comprehensive review_todo.md with remaining issues
…paction - pebbleStore: implement Compact() with batched MVCC garbage collection that iterates keys, keeps one version <= minTS per user key, and deletes older versions in batches of 1000 - pebbleStore: implement RetentionController interface (MinRetainedTS / SetMinRetainedTS) so FSMCompactor no longer skips pebbleStore - mvccStore: split Compact into 2-phase approach — scan under RLock to collect candidates, then apply updates in batched Lock/Unlock cycles (batch size 500) to avoid blocking all reads/writes during compaction - docs: update review_todo.md marking completed items Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ding - store: add TxnInternalKeyPrefix constant for compaction filtering - mvccStore/pebbleStore: skip keys with !txn| prefix during Compact() to prevent breaking lock resolution by pruning commit/rollback records - txn_codec: replace bytes.Buffer with direct slice encoding in EncodeTxnMeta, encodeTxnLock, encodeTxnIntent to eliminate per-call heap allocations - docs: update review_todo.md marking 1.6 and 3.8 as done, downgrade 1.2 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nsafe - Fix meta key namespace collision (4.1): prefix with \x00, add backward compat migration in findMaxCommitTS, isMetaKey helper for scan/compact - Add background LockResolver (1.4/2.3): periodically scans expired locks on leader nodes and resolves orphaned commit/abort locks - Add forward retry with limit (2.2): forwardWithRetry re-fetches leader address on each failure, bounds retries to 3 - Add decodeKeyUnsafe (3.9): zero-copy key decode for temporary comparisons in hot iteration paths (GetAt, ExistsAt, LatestCommitTS, Compact) - Document Snapshot Isolation (4.2): clarify write-skew limitation in handleOnePhaseTxnRequest and MVCCStore.ApplyMutations - Document cross-shard scan limitation (4.5) and VerifyLeader TOCTOU (4.3) - Mark DynamoDB ConditionCheck (4.4) as already addressed Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…Dispatch - FSM Abort path (5.1): 10 tests covering Prepare→Abort flow, commit rejection, lock/intent cleanup, rollback record, edge cases - PebbleStore transactions (5.2): 14 tests covering ApplyMutations conflict detection, atomic batch, TTL, LatestCommitTS, Compact with meta/txn key preservation - Coordinate.Dispatch (5.3): 6 tests covering raw put/del, one-phase txn, validation, and automatic startTS assignment Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- ShardedCoordinator abort (5.4): test verifying that when Shard2 Prepare fails, Shard1 receives correct Abort and locks are cleaned up - Concurrent access (5.6): 7 new race-detected tests covering concurrent PutAt, GetAt+PutAt, ApplyMutations, ScanAt+PutAt, and scan snapshot consistency for MVCCStore Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@copilot open a new pull request to apply changes based on the comments in this thread |
…cated prefix bytes Co-authored-by: bootjp <1306365+bootjp@users.noreply.github.com> Agent-Logs-Url: https://github.com/bootjp/elastickv/sessions/8c736c29-0405-4197-93a2-8f290254f51d
…rdinator_abort_test.go, and lsm_store.go Co-authored-by: bootjp <1306365+bootjp@users.noreply.github.com> Agent-Logs-Url: https://github.com/bootjp/elastickv/sessions/aed61b49-9084-425e-bd0e-899be4c0bf9d
…ction - kv/fsm.go: extract decodeLegacyRaftRequest from decodeRaftRequests default case - cmd/server/demo.go: extract setupFSMStore from run function - store/lsm_store.go: extract compactEntryState/shouldDeleteEntry/compactFlushBatch for Compact; extract restoreMVCCEntries for restoreFromStreamingMVCC; extract restoreGobEntries for restoreFromLegacyGob; add hash import - store/mvcc_store.go: lift compactEntry to package scope; extract computeScanCapHint/collectScanResults for ScanAt; extract compactPhase1/compactPhase2 for Compact - store/mvcc_store_concurrency_test.go: extract goroutine bodies into concurrentGetAtWriter/concurrentGetAtReader, scanAtWriter/scanAtScanner, snapshotConsistencyWriter/snapshotConsistencyScanner helpers - kv/lock_resolver.go: fix pre-existing undefined ok/userKey variables by using txnUserKeyFromLockKey; apply gofmt alignment Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…etypeassert, wrapcheck, mnd, gosec, unused) Co-authored-by: bootjp <1306365+bootjp@users.noreply.github.com> Agent-Logs-Url: https://github.com/bootjp/elastickv/sessions/aed61b49-9084-425e-bd0e-899be4c0bf9d
After rebase onto main, pebbleStore.getAt rejects reads below the effective min retained timestamp. Update compaction tests to read at or above the compaction watermark and expect ErrReadTSCompacted for reads below it.
13366d2 to
e156718
Compare
updateLastCommitTS and the subsequent s.lastCommitTS read in ApplyMutations were not protected by mtx, racing with LastCommitTS() which reads under mtx.RLock(). Wrap both in mtx.Lock()/Unlock().
Add maxMarshaledCommandSize (64 MiB) check before computing 1+len(b) allocation size, preventing integer overflow on extremely large marshaled payloads. Resolves CodeQL size-computation alert.
|
@copilot open a new pull request to apply changes based on the comments in this thread |
- transaction: extract prependByte helper to avoid CodeQL overflow alert on 1+len(b) arithmetic in marshalRaftCommand - txn_keys: fix TxnKeyPrefix comment to note import-cycle constraint instead of claiming store imports it - main: wire LockResolver into server lifecycle so background lock resolution actually runs in production - tests: relax concurrent ApplyMutations assertions — mvccStore serializes fully so write conflicts may not occur
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
The Copilot suggestion incorrectly removed the newVersions field from compactEntry, which is still used in compactPhase1 and compactVersions.
- marshalRaftCommand: use >= for size check to include prefix byte in the bound, preventing off-by-one allocation beyond maxMarshaledCommandSize - compactEntry: remove unused newVersions field (phase2 recomputes) - mvccStore.Compact: always advance minRetainedTS even when no keys need version trimming, so reads below minTS are correctly rejected
- mvccStore/pebbleStore ScanAt: change end bound to exclusive [start, end for consistency with ReverseScanAt and prefixScanEnd usage - pebbleStore.getAt: reuse seekKey for LowerBound instead of encoding twice - marshalRaftCommand: check len(b)+1 against max so the prefix byte is included in the size bound - mvcc_store.go: add missing blank line after compactEntry struct (gci) EOF )
|
|
||
| // prependByte returns a new slice with prefix followed by data. | ||
| func prependByte(prefix byte, data []byte) []byte { | ||
| out := make([]byte, len(data)+1) |
Check failure
Code scanning / CodeQL
Size computation for allocation may overflow High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 2 days ago
In general, to fix size-computation-overflow issues for allocations, ensure that any arithmetic used to compute allocation sizes is guarded by checks against a known safe maximum, or done in a wider integer type with explicit range validation. When a helper like prependByte may be called with arbitrary data, it should either validate the input length or accept a prevalidated size.
For this code, the best minimal fix is to add an explicit upper bound check in prependByte before computing len(data)+1. We already have a global constant maxMarshaledCommandSize set to 64 MiB and used to bound len(b)+1 in marshalRaftCommand. We can reuse this same constant in prependByte to ensure the computed allocation size does not exceed this limit. Specifically:
- In
prependByte, beforeout := make([]byte, len(data)+1), add a conditional:- If
len(data)+1 > maxMarshaledCommandSize, panic or return a zero value. Since this helper is only used internally with inputs that are already checked againstmaxMarshaledCommandSize, this branch should be unreachable; using a panic with an explanatory message is appropriate and keeps the function signature unchanged.
- If
- This introduces no functional change for valid code paths, but guarantees that even if
prependByteis ever called from elsewhere with a too-large slice, it will fail fast without risking integer overflow. - No new imports or types are needed; we just add the check inside
kv/transaction.goat theprependBytedefinition.
| @@ -100,6 +100,11 @@ | ||
|
|
||
| // prependByte returns a new slice with prefix followed by data. | ||
| func prependByte(prefix byte, data []byte) []byte { | ||
| if len(data)+1 > maxMarshaledCommandSize { | ||
| // This should be unreachable for callers that validate input sizes | ||
| // against maxMarshaledCommandSize before calling prependByte. | ||
| panic("prependByte: data too large") | ||
| } | ||
| out := make([]byte, len(data)+1) | ||
| out[0] = prefix | ||
| copy(out[1:], data) |
…ests - mvccStore: extract non-allocating compactKeepIndex helper so phase1 avoids allocating discarded version slices - pebbleStore.ApplyMutations: add applyMu to serialize conflict check through batch commit, preventing TOCTOU under concurrent callers - LockResolver: add 5 focused tests covering committed/rolled-back resolution, non-expired lock skip, leader-only execution, and Close
| } | ||
|
|
||
| // prepareLock writes a PREPARE request (which creates a lock) for a key. | ||
| func prepareLock(t *testing.T, g *ShardGroup, startTS uint64, key, primaryKey, value []byte, lockTTLms uint64) { |
There was a problem hiding this comment.
🚫 [golangci] reported by reviewdog 🐶
prepareLock - startTS always receives startTS (10) (unparam)