uffd: add support for UFFD_EVENT_REMOVE events#1896
Open
bchalios wants to merge 32 commits intofeat/firecracker-v1.14from
Open
uffd: add support for UFFD_EVENT_REMOVE events#1896bchalios wants to merge 32 commits intofeat/firecracker-v1.14from
bchalios wants to merge 32 commits intofeat/firecracker-v1.14from
Conversation
aef7cd8 to
eab27ea
Compare
a1d0a02 to
03e2966
Compare
b766eb3 to
4d197e4
Compare
e639acf to
88420f2
Compare
d2e5d09 to
8ea97e4
Compare
When guest memory is backed by 4K pages, Firecracker backs it with anonymous pages. Anonymous pages can only be write-protected when there is page entry present. This means that until we receive a fault for it a page is not write-protected. This is fine. Not present (not faulted) pages cannot be dirty, by definition. When handling page faults for known pages, we use UFFDIO_COPY with write-protection mode, which automatically sets write-protection for the faulted page. However, imagine this sequence: 1. We resume from a snapshot 2. We receive a UFFD_EVENT_REMOVE for a range that was not previously faulted in. 3. We receive a page fault for a page within the removed region. The correct way to handle this is providing the zero page and that's what we do. However, UFFDIO_ZERO does not have a write-protected mode, so the newly setup page is not write-protected tracked. Such a page will always be reported dirty from Firecracker (present and !write-protected) and it will be needlessly included in the snapshot. Handle this by explicitly marking such pages as write-protected. Handle the race condition by providing the zero page without waking up the thread, marking it write protected and then waking up the thread. Note, that huge pages don't have this issue because: 1. Hugetlbfs-backed pages are write-protected by Firecracker 2. we always handle huge page faults using UFFDIO_COPY with write-protection mode. Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Balloon devices provide memory reclamation facilities through free page reporting, as a more efficient mechanism (in terms of latency and CPU time) than traditional ballooning. Free page reporting instructs the guest to periodically report memory that has been freed, so that we can reclaim it in the host side. It is enabled before starting the sandbox and does not require any further host-side orchestration. Enable free page reporting for all new templates using Firecracker versions >= v1.14.0. Also, allow users to optionally disable it for these versions. Older Firecracker versions don't support the feature. Trying to enable it for those, will return an error. Co-authored-by: Valenta Tomas <valenta.and.tomas@gmail.com> Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Test that if we always copy with WP on, write-protection will be handled correctly by the kernel. Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Add machinery for being able to coordinate the steps of the helper process that operates the UFFD serve loop. Then add tests for various scenarios of UFFD remove and page fault events. Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Having 1M parallel operations sometimes causes intermittent issues in tests. Use 10K instead as used in other tests. Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
We are handling page fault events within go routines without much synchronization across across iterations of the serve loop (apart from when we are about to shut down the loop). This works fine when we only have page fault events, but it might be problematic with remove events in the mix. When we handle a remove event, we need to make sure that no page fault handling Go routine is still runing for the same page, as this cuases a race condition for the book keeping of the page state. Handle this, by ensuring that all outstanding Go routines have completed before handling remove events. Do that only when we do have remove events in the queue. Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Previous commit added logic to wait for all Go routines handling page faults from previous iterations before handling remove events. There is one more race condition, though, between handling remove events and prefaulting operations. The previous fix did not handle this one. Re-use the settleRequests RW Mutex to instead of waiting the errorgroup. All pagefault operations are taking the read lock, so page- and pre-faulting can happen concurrently. Handling remove events takes the write lock, so no pre/page-faulting can happen while we handle those. Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
In crossProcessServe(), `cleanup` is initially set to a function that
signals the original fdExit and drains exitUffd (a buffered channel of
capacity 1).
When a gated test issues a pause ('P'), stopServe() calls the original
cleanup, which drains exitUffd. On resume ('R'), startServe() creates a
new fdExit and Serve goroutine, and reassigns both `stopServe` and
`cleanup` to a new stop function.
However, `defer cleanup()` captures the *value* of cleanup at the time
the defer statement executes — the original closure — not the variable
itself. So when SIGUSR1 arrives and crossProcessServe returns, the
deferred call invokes the original cleanup, which blocks forever on
<-exitUffd because that channel was already drained during the pause
step. The subprocess never exits, cmd.Wait() in the parent test hangs,
and the parallel test slot is never released, causing unrelated tests
waiting at t.Parallel() to time out.
Fix by deferring through the variable so the call always dispatches to
whatever cleanup currently points to at return time:
defer func() { cleanup() }()
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
readEvents stored pointers into a loop-local [24]byte arg variable. Since Go may reuse the same stack slot across iterations, all pointers could alias the last event's data when multiple events arrive in one batch (e.g. gated tests). This caused the handler to serve faults at wrong addresses, leaving faulting threads blocked — producing the 11-minute test timeout in CI. Also fix HasHugePages/HasFreePageReporting semver checks to correctly handle major versions > 1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
There is a race between goroutine fault handlers and the main serve loop: a goroutine can get EAGAIN from the ioctl (due to a concurrent VMA change on a different page), defer its fault, and push it to the deferred queue AFTER the main loop has already drained that queue and re-entered poll(). If no new UFFD events arrive, poll blocks forever and the deferred fault is never retried — the faulting guest thread hangs indefinitely. Fix: add a self-pipe (wakeupPipe) monitored by poll(). When a goroutine defers a fault, it writes a byte to the pipe, guaranteeing poll() returns and the deferred fault gets drained on the next iteration. Also restructured the serve loop so that uffd reads only happen when POLLIN is set on the uffd fd (not unconditionally), and deferred faults are always drained regardless of which fd triggered the wakeup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
madvise(MADV_DONTNEED) returns immediately but the UFFD_EVENT_REMOVE is delivered asynchronously. Without a sleep, pageStatesOnce() may read state before the handler has processed the remove event. The gated tests already had sleeps; the non-gated TestRemove tests did not, causing flaky "4k selective remove" failures. Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
The 4K removed-page read path uses a three-step sequence: zero(DONTWAKE) → writeProtect(WP) → wake(). If step 1 succeeds but step 2 returns EAGAIN (concurrent REMOVE), the page is mapped but the guest thread stays asleep. On retry, zero() returns EEXIST and we returned early without calling wake(), leaving the thread blocked forever. Fix by unconditionally calling wake() on EEXIST. If the thread is sleeping due to DONTWAKE this unblocks it; if already awake the wake is a harmless no-op. Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
… test The cross-process UFFD test helper was spawned without -test.timeout=0, so Go's default 10m timeout killed it before the parent's 20m timeout, leaving madvise/page-fault syscalls permanently blocked. Also update Firecracker v1.14 version to v1.14.1_76f16f0 (the 458ca91 release does not exist) and enable FreePageReporting in the smoke test for FC versions >= v1.14 so the full balloon→REMOVE→zero-fill path is exercised in CI.
…t/free-page-reporting # Conflicts: # packages/orchestrator/pkg/sandbox/uffd/userfaultfd/userfaultfd.go
…t/free-page-reporting # Conflicts: # packages/orchestrator/template-manager.proto
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 3e1a67d. Configure here.
…t/free-page-reporting
This field lives on feat/free-page-reporting only. It was temporarily dropped on feat/firecracker-v1.14 so that branch can merge cleanly to main without exposing a half-wired API field. Re-add it here (and regenerate pb.go) so this branch retains ownership of the proto change, and future main→feat/free-page-reporting merges don't conflict on this file.
This was referenced Apr 18, 2026
… hash Previously `parts[1]` was accessed unconditionally after `strings.Split(fcVersion, "_")`. Any version string without an underscore (e.g. a bare "v1.10.1") would panic with index out of range. The commitHash field is never read anywhere, so the simplest fix is a bounds check before assignment. Mirrors the same fix in #2436 so this branch stays conflict-free regardless of merge order.
…t/free-page-reporting # Conflicts: # packages/api/internal/sandbox/sandbox_features.go # packages/orchestrator/pkg/sandbox/uffd/memory/mapping.go # packages/orchestrator/pkg/sandbox/uffd/userfaultfd/prefault.go # packages/orchestrator/pkg/sandbox/uffd/userfaultfd/userfaultfd.go
This was referenced Apr 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Enabling Firecracker free-page-reporting feature requires us to handle remove events (UFFD_EVENT_REMOVE) in our userfaultfd handler. These events are triggered whenever Firecracker calls madvise(MADV_DONTNEED) (or similar) on a range of guest memory addresses.
The main thing that changes on our logic is that page faults in a page that has previously been removed need to be served with a zero page rather than a page from the snapshot file.
This commit changes the page fault serving logic to:
This is dependent on the part of #1858 that enables free page reporting on the Firecracker side.
Note
High Risk
High risk because it changes low-level userfaultfd event handling and page population semantics (including new ioctls and concurrency/defer logic), which can affect VM stability and memory correctness. It also introduces a new template/sandbox configuration flag that changes runtime behavior based on Firecracker version and feature flags.
Overview
This PR upgrades the default Firecracker version to
v1.14.1and introduces opt-in free page reporting, wiring a newfreePageReportingsetting from template creation through template build/sandbox startup to Firecracker balloon configuration. To support this, the userfaultfd handler is reworked to read and processUFFD_EVENT_REMOVEevents, track per-page state, zero-fill pages after removal, and defer/retry faults that race with removes, with tests expanded to validate remove, write-protect, and gated ordering scenarios.Written by Cursor Bugbot for commit b35c9f7. This will update automatically on new commits. Configure here.