Skip to content

perf(nbd): zero-alloc, zero-copy synchronous dispatch#2395

Draft
ValentaTomas wants to merge 5 commits intomainfrom
perf/zero-alloc-nbd
Draft

perf(nbd): zero-alloc, zero-copy synchronous dispatch#2395
ValentaTomas wants to merge 5 commits intomainfrom
perf/zero-alloc-nbd

Conversation

@ValentaTomas
Copy link
Copy Markdown
Member

Summary

  • Replace async goroutine-per-request NBD dispatch with a synchronous loop using mmap slices directly
  • Reads: Overlay.ReadSlices returns per-block mmap slices (overlay cache or template cache), written to the NBD socket via writev (net.Buffers) — zero user-space copies, zero allocations after warmup
  • Writes: Cache.WriteSlice returns a writable mmap reference; socket data is read directly into the mmap page — zero user-space copies, zero allocations
  • Removes per-request: goroutines, channels, sync.Pool, write lock, WaitGroup, and shutting-down machinery

Profile context

From profiling on n2-highmem-80 nodes, NBD dispatch was the dominant allocator after the multipart upload fix (#2394):

Path Cumulative allocs What it was
Dispatch.Handle (write buffers) 57 TB (47%) make([]byte, request.Length) per write
cmdRead.func1 (read buffers) 20 TB (16%) make([]byte, length) per read

This PR eliminates both — zero allocations after the first request at each size.

How it works

Reads: Instead of allocating a buffer, copying mmap data into it, then writing to socket — we collect per-block mmap slice references and send them via writev in a single syscall.

Writes: Instead of allocating a buffer, reading socket data into it, then copying into mmap — we get a writable reference to the overlay cache mmap and read socket data directly into it.

Parallelism is preserved via FlagCanMulticonn (multiple NBD socket connections per device), rather than per-request goroutines within a connection.

Test plan

  • go build ./... passes
  • go vet passes
  • Existing NBD tests pass (require root / nbd kernel module)
  • Profile on staging to verify allocation elimination

@cursor
Copy link
Copy Markdown

cursor bot commented Apr 14, 2026

PR Summary

Medium Risk
Touches the hot-path NBD I/O read/write flow and introduces lock-held writable slices, so bugs could cause data corruption, deadlocks, or kernel-visible I/O errors if alignment/length or unlock/commit handling is wrong.

Overview
Reworks NBD dispatch to avoid per-request data buffers by reading and writing directly against mmap-backed cache slices: reads now gather block-aligned slices from cache/device (Overlay.ReadSlices) and stream them straight to the NBD socket, while writes read socket data directly into a writable mmap window (Cache.WriteSlice/Overlay.WriteFrom). This updates the NBD backend interface to these slice-based APIs, shrinks the dispatch read buffer to header-sized, and adjusts test utilities and the slow-backend test to exercise slice-based reads with proper context cancellation.

Reviewed by Cursor Bugbot for commit f2ca90c. Bugbot is set up for automated code reviews on this repo. Configure here.

c.setIsCached(off, end-off)
}

c.mu.Unlock()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test comment

// done(false) releases the write lock without marking dirty (use on write failure).
func (c *Cache) WriteSlice(off, length int64) ([]byte, func(bool), error) {
c.mu.Lock()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache mutex held during socket I/O. c.mu.Lock() is acquired here and not released until done() is called in dispatch.go after the entire fp.Read loop completes (line 195). The mutex is held for the full duration of reading the write payload off the socket. With FlagCanMulticonn, multiple Dispatch goroutines share the same Provider. While one goroutine is blocked on slow socket data holding c.mu, every other connection's reads and writes stall on this lock. The old design held the lock only briefly inside WriteAt, after socket data was already buffered locally. The fix is to read socket data into a temp buffer first, then acquire the lock only to copy into and mark the mmap.

func (o *Overlay) ReadSlices(ctx context.Context, off, length int64, dest [][]byte) ([][]byte, error) {
dest = dest[:0]
numBlocks := length / o.blockSize

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silent truncation for non-blockSize-aligned lengths. numBlocks := length / o.blockSize discards trailing bytes when length is not a multiple of blockSize, with no error. If a non-aligned length is passed, writeReadResponse sends fewer bytes than requested and the kernel NBD driver treats this as an I/O error. Kernel NBD requests are block-aligned in practice, but a guard returning an error for non-aligned lengths would surface any future misuse immediately.

Comment thread packages/orchestrator/pkg/sandbox/nbd/dispatch.go
Comment thread packages/orchestrator/pkg/sandbox/nbd/dispatch.go Outdated
Comment thread packages/orchestrator/pkg/sandbox/nbd/dispatch.go Outdated
Replace async goroutine-per-request NBD dispatch with a synchronous
loop using mmap slices directly.

Reads: Overlay.ReadSlices returns per-block mmap slices, written to the
NBD socket via writev (net.Buffers) — zero user-space copies.

Writes: socket data is read into a reusable buffer, then copied into
the mmap under a brief lock via Cache.WriteSlice.

Removes per-request: goroutines, channels, sync.Pool, write lock,
WaitGroup, and shutting-down machinery.
…ignment

Restore the 32 MB dispatchMaxWriteBufferSize check removed in the
zero-alloc rewrite — a corrupted NBD header could cause an unbounded
allocation and OOM.

Add offset alignment validation in Overlay.ReadSlices to match the
existing length check.
ReadSlices calls device.Slice instead of ReadAt, so the SlowDevice
test wrapper must delay Slice as well to trigger the kernel NBD
I/O timeout in TestSlowBackend_ShortTimeout.
Copy mmap slice data into a reusable contiguous buffer before writing
the NBD read response, matching the old ReadAt behavior. This avoids
potential issues with sending mmap references directly while the
underlying data could be modified by concurrent operations.
Comment thread packages/orchestrator/pkg/sandbox/nbd/dispatch.go
Comment thread packages/orchestrator/pkg/sandbox/nbd/dispatch.go Outdated
@ValentaTomas ValentaTomas force-pushed the perf/zero-alloc-nbd branch 3 times, most recently from 5c06f2c to f0ec082 Compare April 16, 2026 07:52
Replace per-request buffer allocations in NBD dispatch with zero-copy
mmap slice access. Reads use ReadSlices to get per-block mmap references
and copy into a pooled buffer; writes use WriteSlice to get a writable
mmap reference and copy socket data directly in.

Keeps the existing goroutine-per-request dispatch structure unchanged.
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit f2ca90c. Configure here.

}

return dest, nil
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmap slices read/written without lock protection cause data race

Medium Severity

ReadSlices calls cache.Slice() without holding any lock, and the returned mmap slice references are later written to the socket by writeSlices. Previously, Overlay.ReadAt went through Cache.ReadAt which held c.mu.RLock(), providing mutual exclusion with writes via RWMutex. Now, a concurrent WriteFrom on another connection (multiconn) can modify the same mmap pages while writeSlices is reading them, producing torn data in the NBD response.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f2ca90c. Configure here.


c.mu.Unlock()
}, nil
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exclusive cache lock held during entire socket read

Medium Severity

Cache.WriteSlice acquires the exclusive write lock (c.mu.Lock()), and Overlay.WriteFrom calls io.ReadFull from a network socket before releasing it via the done callback. For writes up to 32 MB, this holds the exclusive lock for the entire duration of the socket read. The old code read socket data into a temporary buffer first, then held the lock only for a brief memcpy. This serializes all lock-dependent cache operations (Close, ExportToDiff, other WriteAt calls) across all multiconn connections during potentially slow I/O.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f2ca90c. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants