feat(rivetkit): sqlite vfs v2 by NathanFlurry · Pull Request #4673 · rivet-dev/rivet

NathanFlurry · 2026-04-16T09:17:21Z

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

railway-app · 2026-04-16T09:17:30Z

🚅 Deployed to the rivet-pr-4673 environment in rivet-frontend

Service	Status	Web	Updated (UTC)
frontend-cloud	😴 Sleeping (View Logs)	Web	Apr 19, 2026 at 3:34 pm
frontend-inspector	😴 Sleeping (View Logs)	Web	Apr 18, 2026 at 11:24 pm
kitchen-sink	❌ Build Failed (View Logs)	Web	Apr 18, 2026 at 11:13 pm
website	😴 Sleeping (View Logs)	Web	Apr 18, 2026 at 4:21 pm
mcp-hub	✅ Success (View Logs)	Web	Apr 16, 2026 at 4:08 pm
ladle	❌ Build Failed (View Logs)	Web	Apr 16, 2026 at 9:18 am

NathanFlurry · 2026-04-16T09:17:36Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

github-actions · 2026-04-16T09:28:03Z

Preview packages published to npm

Install with:

npm install rivetkit@pr-4673

All packages published as 0.0.0-pr.4673.33edc3b with tag pr-4673.

Engine binary is shipped via @rivetkit/engine-cli on linux-x64-musl, linux-arm64-musl, darwin-x64, and darwin-arm64. Windows users should use the release installer or set RIVET_ENGINE_BINARY.

Docker images:

docker pull rivetdev/engine:slim-33edc3b
docker pull rivetdev/engine:full-33edc3b

Individual packages

npm install rivetkit@pr-4673
npm install @rivetkit/react@pr-4673
npm install @rivetkit/rivetkit-native@pr-4673
npm install @rivetkit/workflow-engine@pr-4673

…vered bugs

…om get_pages response

…er retry]

xWrite in the v2 VFS was calling resolve_pages for every dirty page, even when the write was a full-page aligned overwrite. For page numbers beyond db_size_pages (newly allocated pages), this meant fetching from the engine to get data that doesn't exist. A 256-row INSERT transaction (~1MB of dirty data) was making 288 round trips to the engine: one get_pages RTT per new page allocation. At 128ms staging RTT that's ~37s of theoretical network time vs the ~130ms needed for just the final commit. Now: - Full-page aligned writes (offset % page_size == 0 && amt % page_size == 0) skip resolve_pages entirely and use a zero-filled page as the base. - Partial writes filter out pages > db_size_pages before calling resolve_pages, synthesizing zero pages locally for new allocations. Direct engine profiling tests verify 0 engine fetches for 1MB, 5MB, and 100-hot-row-updates workloads. Adds new stress-test workloads to the kitchen-sink bench: churnInsertDelete, mixedOltpLarge, growingAggregation, indexCreationOnLargeTable, bulkUpdate1000Rows, truncateAndRegrow, manySmallTables.

The commit, compaction, and takeover paths all read meta_key and then write meta_key within the same transaction. Meta reads were using Snapshot isolation which does NOT register a read conflict range in UniversalDB (and FoundationDB semantics). Without a read conflict range, two concurrent transactions could both read meta at the same snapshot, both plan mutations based on that stale read, and both commit with last-write-wins semantics. For commit vs compaction this caused head_txid to rewind: commit advances from N to N+1, compaction then writes its own meta with head=N preserved (from its snapshot), overwriting the commit's head=N+1. The VFS sees its own successful commit response but the engine state is behind, and the next commit fence-mismatches with 'commit head_txid X+1 did not match current head_txid X'. Fix: introduce tx_get_value_serializable in udb and use it for meta reads in the fast-path commit, the compaction shard write, and the takeover TOCTOU re-check. Serializable reads register the key in the read conflict range so concurrent writes by other transactions trigger a retry. Adds regression tests in the v2 VFS Direct-engine harness: - autocommit_inserts_maintain_head_txid_consistency - autocommit_survives_close_reopen - concurrent_multi_actor_autocommits - stress_concurrent_multi_actor_autocommits Local bench against the engine now passes Insert single x100/x1000/x10000 (autocommit workloads) that previously timed out or fence-mismatched. The bench still has occasional fence errors under heavy concurrent multi-actor load. Those do not reproduce in the Direct engine stress test (5 actors x 200 interleaved commits), so the remaining issue is not in the storage layer and needs separate investigation.

Both the RocksDB and Postgres drivers called `substitute_versionstamp_if_incomplete` inside `Operation::Set` paths, which reads the last 4 bytes of the value as an "offset" and unconditionally truncates them before checking whether the value is actually a versionstamped payload. The helper then swallows the validation error, so a value whose trailing 4 bytes happen to form a plausible u32 pointing at a 0x33 byte gets silently corrupted: 4 bytes lost, no error surfaced. This showed up in the v2 SQLite slow path: large transactions (~50 MiB) split into 7-8 MiB staged chunks, UDB chunks the serde_bare blob into 10 KiB physical chunks, and a small fraction of those chunks have trailing bytes that look like a versionstamp offset. One corrupted chunk is enough to break serde_bare decoding at commit_finalize time with "unexpected end of input", aborting the commit. Versionstamp substitution only belongs on SetVersionstampedKey and SetVersionstampedValue mutation types, which go through AtomicOp. Plain `Set` must preserve the bytes verbatim, so this change removes the substitution from both driver Set paths and drops the now-unused import. Adds bench-parity Direct-engine tests in rivetkit-typescript/packages/sqlite-native/src/v2/vfs.rs so kitchen-sink bench workloads (large TX insert 500KB/10MB/50MB, churn insert/delete, mixed OLTP, bulk update 1000 rows, truncate + regrow, many small tables, index creation on 10k rows, growing aggregation, insert TX x10000) run against the real engine without the full stack. The 50MB test is the minimal reproducer for this bug and failed reliably before the fix.

…FS counters)

…+ VFS counters)]

…k prefix]

… work

…tories (US-060, US-061, US-062)

claude · 2026-04-17T00:31:03Z

SQLite VFS v2 Review

Overview

This PR introduces a new server-side SQLite storage backend (engine/packages/sqlite-storage) for Rivet Actors, extending the envoy protocol to v2 with SQLite-specific message types. The design uses a DELTA/SHARD log-structured approach with generation-based writer fencing, LTX v3 encoding, and a background compaction coordinator. The actor-side v2 VFS implementation lives in the native Rust module.

This is a large, well-structured implementation. Test coverage is genuinely good — fault injection hooks, concurrency tests, and per-module unit tests all verify the failure modes that matter most.

Issues

1. AI development artifacts committed to the repo

scripts/ralph/ contains PRD docs, user story reviews, progress tracking, and a shell script that appear to be AI agent working files from the development process. Similarly, .claude/scheduled_tasks.lock is a tool artifact. Neither belongs in the final PR.

2. Dead allocation in compaction/shard.rs:136

let _shard_txids = shard_rows
    .iter()
    .map(|row| row.txid)
    .collect::<BTreeSet<_>>();

This BTreeSet is computed and immediately discarded. It was likely used in an earlier version of the merge algorithm. Remove it.

3. Missing limits documentation

CLAUDE.md requires that limit changes be reflected in website/src/content/docs/actors/limits.mdx. The PR introduces SQLITE_MAX_DELTA_BYTES = 8 MiB as the per-commit payload limit on the fast path, and SqliteMeta.max_delta_bytes is exposed to actors. The existing limits page does not document this per-commit constraint, which is a practical ceiling actors will hit. The 10 GiB storage cap was already documented.

4. Hardcoded compaction threshold with no constant or comment

takeover.rs:86: should_schedule_compaction = recovery_plan.live_delta_count >= 32;

The value 32 is unexplained. A named constant (e.g. COMPACTION_DELTA_THRESHOLD) or a brief comment would make the intent clear.

5. get_or_load_pidx TOCTOU under concurrent takeovers

Entry::Vacant(entry) => {
    drop(entry); // releases the lock — a second caller can race here
    let index = DeltaPageIndex::load_from_store(...).await?;
    // re-enter; second insert wins, first is discarded
}

Two concurrent takeovers for the same actor will both see Vacant, both scan the PIDX prefix, and the second insert wins. This is not a correctness issue (both reads are identical), but it can double the DB round-trips during concurrent takeovers. A comment explaining the trade-off would help the next reader.

6. Fast-path commit sends to compaction_tx unconditionally

Every successful commit() enqueues the actor_id. For a high-frequency actor writing small transactions the unbounded channel will accumulate many identical signals between compaction passes. The coordinator already deduplicates by skipping new workers while one is running (spawn_worker_if_needed), so correctness is fine. Consider only signaling when the in-memory delta count crosses a threshold (e.g. matching the >= 32 check in takeover) to reduce channel pressure.

Nits

error.rs uses thiserror directly instead of the #[derive(RivetError)] pattern from rivet_error::* described in CLAUDE.md. If sqlite-storage is intentionally exempt as a lower-level crate, a note in its CLAUDE.md or a comment would prevent future confusion.
sqlite_runtime.rs uses a process-wide OnceCell<Arc<SqliteEngine>> with a fixed subspace. A doc comment noting the singleton invariant (and that it prevents multiple engines per process) would help avoid surprises in future integration tests.
EMPTY_DB_PAGE_HEADER_PREFIX in lib.rs is a 108-byte magic constant with no comment explaining what it represents or how it was derived. Even a one-line note linking it to the SQLite file format spec would help the next reader.

Positives

Correct use of tx_get_value_serializable vs tx_get_value with inline documentation explaining why conflict-range registration matters for fence checks.
Fault-injection test hooks (fail_next_fast_commit_write, pause_before_commit) cover the failure modes that matter most.
Batch compaction amortizes PIDX and DELTA scans across a full batch — the op-count assertion in the worker test makes this contract explicit and machine-checked.
Generation fencing and concurrent-takeover detection are solid. The compare-and-swap pattern in takeover.rs is clean.
encode_db_head_with_usage convergence loop for self-referential size accounting is correct.
Proper use of scc::HashMap throughout per project conventions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

railway-app bot had a problem deploying to rivet-frontend / rivet-pr-4673 April 16, 2026 09:17 Failure

NathanFlurry changed the base branch from main to graphite-base/4673 April 16, 2026 09:19

NathanFlurry force-pushed the graphite-base/4673 branch from 27742e1 to 83bef30 Compare April 16, 2026 09:19

NathanFlurry force-pushed the feat/sqlite-vfs-v2 branch from 95ee628 to 9f5f7b3 Compare April 16, 2026 09:19

railway-app bot had a problem deploying to rivet-frontend / rivet-pr-4673 April 16, 2026 09:19 Failure

NathanFlurry changed the base branch from graphite-base/4673 to 04-15-chore_rename_sandbox_-_kitchen_sink April 16, 2026 09:19

NathanFlurry force-pushed the feat/sqlite-vfs-v2 branch from 9f5f7b3 to 2aff233 Compare April 16, 2026 13:19

railway-app bot had a problem deploying to rivet-frontend / rivet-pr-4673 April 16, 2026 13:20 Failure

NathanFlurry changed the title ~~feat: [US-001] - [Create v1 SQLite VFS baseline test suite]~~ feat(rivetkit): sqlite vfs v2 Apr 16, 2026

feat: sqlite v2

61589f3

NathanFlurry force-pushed the feat/sqlite-vfs-v2 branch from 2aff233 to 61589f3 Compare April 16, 2026 16:07

railway-app bot had a problem deploying to rivet-frontend / rivet-pr-4673 April 16, 2026 16:07 Failure

NathanFlurry added 2 commits April 16, 2026 09:13

feat: [US-045] - Expand Direct engine VFS test coverage and fix disco…

c30b221

…vered bugs

feat: [US-046] - Fix update_meta: never overwrite VFS-owned fields fr…

97bbe04

…om get_pages response

railway-app bot had a problem deploying to rivet-frontend / rivet-pr-4673 April 16, 2026 16:23 Failure

NathanFlurry added 4 commits April 16, 2026 09:26

feat: [US-036] - [Fix compaction-takeover META race and remove takeov…

028f6f6

…er retry]

feat: [US-037] - [Harden pegboard-envoy SQLite error handling]

e7c893c

feat: [US-038] - [Fix VFS v2 error recovery and add batch-atomic probe]

1da677b

feat: [US-039] - [Implement slow-path commit pipelining]

2de3b6c

railway-app bot had a problem deploying to rivet-frontend / rivet-pr-4673 April 16, 2026 17:15 Failure

railway-app bot had a problem deploying to rivet-frontend / rivet-pr-4673 April 16, 2026 18:21 Failure

railway-app bot had a problem deploying to rivet-frontend / rivet-pr-4673 April 16, 2026 19:13 Failure

railway-app bot had a problem deploying to rivet-frontend / rivet-pr-4673 April 16, 2026 19:32 Failure

NathanFlurry added 8 commits April 16, 2026 14:58

feat: US-059 - Add phase-level commit instrumentation (Prometheus + V…

f5ab3f1

…FS counters)

feat: [US-059] - [Add phase-level commit instrumentation (Prometheus …

cd38d10

…+ VFS counters)]

feat: US-059 - Add baseline metrics capture

5140072

feat: [US-048] - [Replace single-blob DELTA layout with per-txid chun…

7b1dff2

…k prefix]

feat: [US-048] - [Replace single-blob DELTA layout with per-txid chun…

147a7f4

…k prefix]

chore(kitchen-sink): add 1MB bench variants to isolate NAPI vs engine…

a5675a3

… work

chore(sqlite-vfs-v2): release-build bench comparison + new coverage s…

7469437

…tories (US-060, US-061, US-062)

feat(sqlite-storage): address US-060, US-061, US-062 review gaps

e7e0f75

railway-app bot had a problem deploying to rivet-frontend / rivet-pr-4673 April 17, 2026 00:26 Failure

railway-app bot had a problem deploying to rivet-frontend / rivet-pr-4673 April 17, 2026 04:09 Failure

NathanFlurry force-pushed the feat/sqlite-vfs-v2 branch from 04be7d5 to 87f6d92 Compare April 18, 2026 03:47

railway-app bot had a problem deploying to rivet-frontend / rivet-pr-4673 April 18, 2026 03:47 Failure

NathanFlurry force-pushed the feat/sqlite-vfs-v2 branch from 87f6d92 to e7e0f75 Compare April 18, 2026 21:36

NathanFlurry mentioned this pull request Apr 18, 2026

chore: rivetkit to rust #4679

Draft

11 tasks

fix(sqlite-native): gate test-only statics with #[cfg(test)]

ab492cd

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

railway-app bot had a problem deploying to rivet-frontend / rivet-pr-4673 April 18, 2026 23:12 Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rivetkit): sqlite vfs v2#4673

feat(rivetkit): sqlite vfs v2#4673
NathanFlurry wants to merge 19 commits into04-15-chore_rename_sandbox_-_kitchen_sinkfrom
feat/sqlite-vfs-v2

NathanFlurry commented Apr 16, 2026

Uh oh!

railway-app bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

NathanFlurry commented Apr 16, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

claude bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NathanFlurry commented Apr 16, 2026

Description

Type of change

How Has This Been Tested?

Checklist:

Uh oh!

railway-app bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NathanFlurry commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Preview packages published to npm

Uh oh!

claude bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SQLite VFS v2 Review

Overview

Issues

Nits

Positives

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

railway-app bot commented Apr 16, 2026 •

edited

Loading

NathanFlurry commented Apr 16, 2026 •

edited

Loading

github-actions bot commented Apr 16, 2026 •

edited

Loading

claude bot commented Apr 17, 2026 •

edited

Loading