Skip to content

fix: address critical data safety, consistency, and performance issues#429

Merged
bootjp merged 21 commits intomainfrom
feature/refactor
Mar 26, 2026
Merged

fix: address critical data safety, consistency, and performance issues#429
bootjp merged 21 commits intomainfrom
feature/refactor

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Mar 24, 2026

  • pebbleStore: make saveLastCommitTS durable (Sync) and atomic within ApplyMutations WriteBatch to prevent timestamp rollback on crash
  • pebbleStore: hold mutex across conflict check and batch commit to eliminate TOCTOU race in ApplyMutations
  • pebbleStore: use bounded iterators in GetAt/LatestCommitTS instead of unbounded NewIter(nil) to reduce Pebble overhead
  • pebbleStore: switch PutAt/DeleteAt/ExpireAt to NoSync since Raft log already provides durability for FSM-driven writes
  • mvccStore: replace tree.Each() with Iterator loop in ScanAt for early termination on limit, avoiding full treemap traversal
  • fsm: add prefix byte to Raft command encoding to eliminate double protobuf deserialization on the hot path (backward compatible)
  • txn_keys: pre-allocate []byte prefix slices at package level and add common prefix fast-path check to eliminate per-call heap allocations
  • sharded_coordinator: log errors in abortPreparedTxn instead of silently discarding them
  • coordinator: add 5s timeout to redirect gRPC forward calls
  • shard_store: add 5s timeout to all proxy gRPC calls
  • adapter/internal: add VerifyLeader quorum check to Forward handler
  • docs: add comprehensive review_todo.md with remaining issues

@bootjp bootjp requested a review from Copilot March 24, 2026 19:49
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the system's robustness, performance, and data integrity. It addresses several critical issues related to data safety and consistency by ensuring atomic operations and proper synchronization. Performance bottlenecks in data access and serialization have been alleviated through optimized iteration, reduced disk syncs, and memory allocation improvements. Additionally, it bolsters system resilience by introducing timeouts for inter-service communication and improving error visibility, while also documenting a comprehensive list of remaining technical debt.

Highlights

  • Data Durability & Atomicity: The pebbleStore.saveLastCommitTS now uses pebble.Sync and is atomically persisted within ApplyMutations to prevent timestamp rollbacks on crash.
  • Concurrency Safety: The pebbleStore.ApplyMutations method now holds a mutex across conflict checks and batch commits, eliminating Time-of-Check-to-Time-of-Use (TOCTOU) race conditions.
  • Performance Improvements (PebbleStore): GetAt and LatestCommitTS in pebbleStore now utilize bounded iterators, significantly reducing overhead. Additionally, PutAt, DeleteAt, and ExpireAt operations have been switched to pebble.NoSync as Raft's log already provides durability.
  • Performance Improvements (MVCC & FSM): mvccStore.ScanAt was optimized to use an iterator loop for early termination, avoiding full treemap traversals. FSM command encoding now incorporates a prefix byte to prevent double protobuf deserialization on the hot path.
  • Memory Optimization: The txn_keys package now pre-allocates []byte prefix slices and includes a common prefix fast-path check, reducing per-call heap allocations.
  • Error Handling & Timeouts: sharded_coordinator.abortPreparedTxn now logs errors instead of silently discarding them. Furthermore, gRPC forward calls in coordinator and proxy calls in shard_store have been configured with 5-second timeouts to prevent indefinite blocking.
  • Leader Verification: The adapter/internal.go file now includes a VerifyLeader quorum check in its Forward handler to ensure requests are processed by the current leader.
  • Documentation: A comprehensive review_todo.md document was added, outlining remaining critical issues across data loss, concurrency, performance, consistency, and test coverage.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses several critical and high-priority issues identified during a comprehensive code review, as detailed in the newly added docs/review_todo.md file. Key changes include enhancing data durability by ensuring lastCommitTS is synchronously persisted in pebbleStore, improving concurrency by adding leader verification and timeouts to gRPC forward calls in adapter, coordinator, and shard_store, and optimizing performance by refining mvccStore and pebbleStore scan operations, improving Raft command deserialization, and reducing heap allocations for transaction key prefixes. Additionally, transaction abort failures are now logged to prevent silent data inconsistencies. No specific feedback was provided in the review comments.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR focuses on strengthening durability/concurrency guarantees in the Pebble-backed MVCC store while reducing hot-path overhead across MVCC scans, Raft command handling, and routing/proxy gRPC calls.

Changes:

  • Improves MVCC scan performance (early termination for treemap scans; bounded Pebble iterators for point lookups).
  • Hardens Pebble commit-timestamp persistence and ApplyMutations atomicity/concurrency (batch-persist lastCommitTS, hold mutex through conflict check + commit).
  • Reduces overhead and improves operational safety (Raft command prefix to avoid double unmarshal; txn key prefix byte-slice reuse; gRPC proxy timeouts; leader verification; error logging on abort).

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
store/mvcc_store.go Reworks in-memory ScanAt to use an iterator for early exit on limit.
store/lsm_store.go Adds batch persistence for lastCommitTS, locks ApplyMutations end-to-end, and introduces bounded iterators for GetAt/LatestCommitTS; adjusts sync behavior for writes.
kv/txn_keys.go Preallocates txn prefix byte slices and adds a common-prefix fast path.
kv/transaction.go Prefixes Raft command encodings to support fast decode dispatch.
kv/fsm.go Decodes Raft commands using the new prefix while retaining legacy fallback.
kv/sharded_coordinator.go Logs abort failures instead of silently discarding errors.
kv/coordinator.go Adds a timeout to redirect Forward gRPC calls.
kv/shard_store.go Adds timeouts to proxy RawKV gRPC calls.
adapter/internal.go Adds VerifyLeader check to the Forward handler.
docs/review_todo.md Adds a consolidated review TODO document of remaining issues.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Mar 25, 2026

@copilot open a new pull request to apply changes based on the comments in this thread

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 25, 2026

@bootjp I've opened a new pull request, #432, to work on those changes. Once the pull request is ready, I'll request review from you.

@bootjp bootjp requested a review from Copilot March 25, 2026 14:51
- pebbleStore: make saveLastCommitTS durable (Sync) and atomic within
  ApplyMutations WriteBatch to prevent timestamp rollback on crash
- pebbleStore: hold mutex across conflict check and batch commit to
  eliminate TOCTOU race in ApplyMutations
- pebbleStore: use bounded iterators in GetAt/LatestCommitTS instead of
  unbounded NewIter(nil) to reduce Pebble overhead
- pebbleStore: switch PutAt/DeleteAt/ExpireAt to NoSync since Raft log
  already provides durability for FSM-driven writes
- mvccStore: replace tree.Each() with Iterator loop in ScanAt for early
  termination on limit, avoiding full treemap traversal
- fsm: add prefix byte to Raft command encoding to eliminate double
  protobuf deserialization on the hot path (backward compatible)
- txn_keys: pre-allocate []byte prefix slices at package level and add
  common prefix fast-path check to eliminate per-call heap allocations
- sharded_coordinator: log errors in abortPreparedTxn instead of
  silently discarding them
- coordinator: add 5s timeout to redirect gRPC forward calls
- shard_store: add 5s timeout to all proxy gRPC calls
- adapter/internal: add VerifyLeader quorum check to Forward handler
- docs: add comprehensive review_todo.md with remaining issues
…paction

- pebbleStore: implement Compact() with batched MVCC garbage collection
  that iterates keys, keeps one version <= minTS per user key, and
  deletes older versions in batches of 1000
- pebbleStore: implement RetentionController interface (MinRetainedTS /
  SetMinRetainedTS) so FSMCompactor no longer skips pebbleStore
- mvccStore: split Compact into 2-phase approach — scan under RLock to
  collect candidates, then apply updates in batched Lock/Unlock cycles
  (batch size 500) to avoid blocking all reads/writes during compaction
- docs: update review_todo.md marking completed items

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 22 changed files in this pull request and generated 7 comments.

bootjp and others added 5 commits March 26, 2026 01:07
…ding

- store: add TxnInternalKeyPrefix constant for compaction filtering
- mvccStore/pebbleStore: skip keys with !txn| prefix during Compact()
  to prevent breaking lock resolution by pruning commit/rollback records
- txn_codec: replace bytes.Buffer with direct slice encoding in
  EncodeTxnMeta, encodeTxnLock, encodeTxnIntent to eliminate per-call
  heap allocations
- docs: update review_todo.md marking 1.6 and 3.8 as done, downgrade 1.2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nsafe

- Fix meta key namespace collision (4.1): prefix with \x00, add backward
  compat migration in findMaxCommitTS, isMetaKey helper for scan/compact
- Add background LockResolver (1.4/2.3): periodically scans expired locks
  on leader nodes and resolves orphaned commit/abort locks
- Add forward retry with limit (2.2): forwardWithRetry re-fetches leader
  address on each failure, bounds retries to 3
- Add decodeKeyUnsafe (3.9): zero-copy key decode for temporary comparisons
  in hot iteration paths (GetAt, ExistsAt, LatestCommitTS, Compact)
- Document Snapshot Isolation (4.2): clarify write-skew limitation in
  handleOnePhaseTxnRequest and MVCCStore.ApplyMutations
- Document cross-shard scan limitation (4.5) and VerifyLeader TOCTOU (4.3)
- Mark DynamoDB ConditionCheck (4.4) as already addressed

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…Dispatch

- FSM Abort path (5.1): 10 tests covering Prepare→Abort flow, commit
  rejection, lock/intent cleanup, rollback record, edge cases
- PebbleStore transactions (5.2): 14 tests covering ApplyMutations
  conflict detection, atomic batch, TTL, LatestCommitTS, Compact with
  meta/txn key preservation
- Coordinate.Dispatch (5.3): 6 tests covering raw put/del, one-phase
  txn, validation, and automatic startTS assignment

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- ShardedCoordinator abort (5.4): test verifying that when Shard2 Prepare
  fails, Shard1 receives correct Abort and locks are cleaned up
- Concurrent access (5.6): 7 new race-detected tests covering concurrent
  PutAt, GetAt+PutAt, ApplyMutations, ScanAt+PutAt, and scan snapshot
  consistency for MVCCStore

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Mar 25, 2026

@copilot open a new pull request to apply changes based on the comments in this thread

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 25, 2026

@bootjp I've opened a new pull request, #433, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot AI and others added 5 commits March 26, 2026 01:12
…cated prefix bytes

Co-authored-by: bootjp <1306365+bootjp@users.noreply.github.com>
Agent-Logs-Url: https://github.com/bootjp/elastickv/sessions/8c736c29-0405-4197-93a2-8f290254f51d
…rdinator_abort_test.go, and lsm_store.go

Co-authored-by: bootjp <1306365+bootjp@users.noreply.github.com>
Agent-Logs-Url: https://github.com/bootjp/elastickv/sessions/aed61b49-9084-425e-bd0e-899be4c0bf9d
…ction

- kv/fsm.go: extract decodeLegacyRaftRequest from decodeRaftRequests default case
- cmd/server/demo.go: extract setupFSMStore from run function
- store/lsm_store.go: extract compactEntryState/shouldDeleteEntry/compactFlushBatch
  for Compact; extract restoreMVCCEntries for restoreFromStreamingMVCC;
  extract restoreGobEntries for restoreFromLegacyGob; add hash import
- store/mvcc_store.go: lift compactEntry to package scope; extract
  computeScanCapHint/collectScanResults for ScanAt; extract
  compactPhase1/compactPhase2 for Compact
- store/mvcc_store_concurrency_test.go: extract goroutine bodies into
  concurrentGetAtWriter/concurrentGetAtReader, scanAtWriter/scanAtScanner,
  snapshotConsistencyWriter/snapshotConsistencyScanner helpers
- kv/lock_resolver.go: fix pre-existing undefined ok/userKey variables by
  using txnUserKeyFromLockKey; apply gofmt alignment

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…etypeassert, wrapcheck, mnd, gosec, unused)

Co-authored-by: bootjp <1306365+bootjp@users.noreply.github.com>
Agent-Logs-Url: https://github.com/bootjp/elastickv/sessions/aed61b49-9084-425e-bd0e-899be4c0bf9d
After rebase onto main, pebbleStore.getAt rejects reads below the
effective min retained timestamp. Update compaction tests to read at
or above the compaction watermark and expect ErrReadTSCompacted for
reads below it.
@bootjp bootjp force-pushed the feature/refactor branch from 13366d2 to e156718 Compare March 25, 2026 16:16
updateLastCommitTS and the subsequent s.lastCommitTS read in
ApplyMutations were not protected by mtx, racing with LastCommitTS()
which reads under mtx.RLock(). Wrap both in mtx.Lock()/Unlock().
@bootjp bootjp requested a review from Copilot March 25, 2026 16:55
Add maxMarshaledCommandSize (64 MiB) check before computing 1+len(b)
allocation size, preventing integer overflow on extremely large
marshaled payloads. Resolves CodeQL size-computation alert.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 3 comments.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Mar 26, 2026

@copilot open a new pull request to apply changes based on the comments in this thread

- transaction: extract prependByte helper to avoid CodeQL overflow alert
  on 1+len(b) arithmetic in marshalRaftCommand
- txn_keys: fix TxnKeyPrefix comment to note import-cycle constraint
  instead of claiming store imports it
- main: wire LockResolver into server lifecycle so background lock
  resolution actually runs in production
- tests: relax concurrent ApplyMutations assertions — mvccStore
  serializes fully so write conflicts may not occur
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 1 comment.

bootjp and others added 2 commits March 26, 2026 14:33
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
The Copilot suggestion incorrectly removed the newVersions field from
compactEntry, which is still used in compactPhase1 and compactVersions.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 2 comments.

- marshalRaftCommand: use >= for size check to include prefix byte in
  the bound, preventing off-by-one allocation beyond maxMarshaledCommandSize
- compactEntry: remove unused newVersions field (phase2 recomputes)
- mvccStore.Compact: always advance minRetainedTS even when no keys
  need version trimming, so reads below minTS are correctly rejected
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 3 comments.

- mvccStore/pebbleStore ScanAt: change end bound to exclusive [start, end
  for consistency with ReverseScanAt and prefixScanEnd usage
- pebbleStore.getAt: reuse seekKey for LowerBound instead of encoding twice
- marshalRaftCommand: check len(b)+1 against max so the prefix byte is
  included in the size bound
- mvcc_store.go: add missing blank line after compactEntry struct (gci)
EOF
)

// prependByte returns a new slice with prefix followed by data.
func prependByte(prefix byte, data []byte) []byte {
out := make([]byte, len(data)+1)

Check failure

Code scanning / CodeQL

Size computation for allocation may overflow High

This operation, which is used in an
allocation
, involves a
potentially large value
and might overflow.
This operation, which is used in an
allocation
, involves a
potentially large value
and might overflow.

Copilot Autofix

AI 2 days ago

In general, to fix size-computation-overflow issues for allocations, ensure that any arithmetic used to compute allocation sizes is guarded by checks against a known safe maximum, or done in a wider integer type with explicit range validation. When a helper like prependByte may be called with arbitrary data, it should either validate the input length or accept a prevalidated size.

For this code, the best minimal fix is to add an explicit upper bound check in prependByte before computing len(data)+1. We already have a global constant maxMarshaledCommandSize set to 64 MiB and used to bound len(b)+1 in marshalRaftCommand. We can reuse this same constant in prependByte to ensure the computed allocation size does not exceed this limit. Specifically:

  • In prependByte, before out := make([]byte, len(data)+1), add a conditional:
    • If len(data)+1 > maxMarshaledCommandSize, panic or return a zero value. Since this helper is only used internally with inputs that are already checked against maxMarshaledCommandSize, this branch should be unreachable; using a panic with an explanatory message is appropriate and keeps the function signature unchanged.
  • This introduces no functional change for valid code paths, but guarantees that even if prependByte is ever called from elsewhere with a too-large slice, it will fail fast without risking integer overflow.
  • No new imports or types are needed; we just add the check inside kv/transaction.go at the prependByte definition.

Suggested changeset 1
kv/transaction.go

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/kv/transaction.go b/kv/transaction.go
--- a/kv/transaction.go
+++ b/kv/transaction.go
@@ -100,6 +100,11 @@
 
 // prependByte returns a new slice with prefix followed by data.
 func prependByte(prefix byte, data []byte) []byte {
+	if len(data)+1 > maxMarshaledCommandSize {
+		// This should be unreachable for callers that validate input sizes
+		// against maxMarshaledCommandSize before calling prependByte.
+		panic("prependByte: data too large")
+	}
 	out := make([]byte, len(data)+1)
 	out[0] = prefix
 	copy(out[1:], data)
EOF
@@ -100,6 +100,11 @@

// prependByte returns a new slice with prefix followed by data.
func prependByte(prefix byte, data []byte) []byte {
if len(data)+1 > maxMarshaledCommandSize {
// This should be unreachable for callers that validate input sizes
// against maxMarshaledCommandSize before calling prependByte.
panic("prependByte: data too large")
}
out := make([]byte, len(data)+1)
out[0] = prefix
copy(out[1:], data)
Copilot is powered by AI and may make mistakes. Always verify output.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 3 comments.

…ests

- mvccStore: extract non-allocating compactKeepIndex helper so phase1
  avoids allocating discarded version slices
- pebbleStore.ApplyMutations: add applyMu to serialize conflict check
  through batch commit, preventing TOCTOU under concurrent callers
- LockResolver: add 5 focused tests covering committed/rolled-back
  resolution, non-expired lock skip, leader-only execution, and Close
}

// prepareLock writes a PREPARE request (which creates a lock) for a key.
func prepareLock(t *testing.T, g *ShardGroup, startTS uint64, key, primaryKey, value []byte, lockTTLms uint64) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [golangci] reported by reviewdog 🐶
prepareLock - startTS always receives startTS (10) (unparam)

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 22 changed files in this pull request and generated no new comments.

@bootjp bootjp merged commit efec2bc into main Mar 26, 2026
10 of 11 checks passed
@bootjp bootjp deleted the feature/refactor branch March 26, 2026 08:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants