refactor(uffd): drain pagefault events into a batch per poll cycle by ValentaTomas · Pull Request #2445 · e2b-dev/infra

ValentaTomas · 2026-04-18T21:56:46Z

Extracted from #1896. Pure structural refactor of the UFFD event loop, no behaviour change.

Userfaultfd.Serve previously read exactly one event per poll/read cycle. This drains all queued events into a slice (until EAGAIN) and processes them as a batch in the same iteration.

Why: the upcoming UFFD_EVENT_REMOVE handling needs REMOVE and PAGEFAULT events from the same poll cycle delivered together, so REMOVEs can be applied to the page tracker before the matching faults run.

Same EINTR/EAGAIN handling, same WRITE/read branching, same wg.Go dispatch, same faultPage signature, same eagain/noData counters. The only structural addition is the outer for _, pagefault := range pagefaults loop, which accounts for most of the diff (indentation churn).

Note: the arm64-tests / packages/shared failure is the unrelated TestChangeResponseHeader flake fixed by #2443; this PR doesn't touch packages/shared.

cursor · 2026-04-18T21:56:51Z

PR Summary

Medium Risk
Refactors the Userfaultfd.Serve event loop in a low-level memory fault handler; while intended as behavior-preserving, subtle differences in EAGAIN/EINTR handling or batch ordering could affect page-fault processing under load.

Overview
Userfaultfd.Serve now drains all queued UFFD_EVENT_PAGEFAULT messages into a batch (reading until non-blocking EAGAIN) and then processes that batch in one poll cycle, instead of handling only a single event per poll/read iteration. Counter logging and error handling were adjusted accordingly, and the unexpected-flag comment was clarified to document why synchronous WP/MINOR faults aren’t expected.

^{Reviewed by Cursor Bugbot for commit a1a4dae. Bugbot is set up for automated code reviews on this repo. Configure here.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ce956f4c94

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-18T22:01:13Z

+		if n == 0 {
+			break


Return EOF from readEvents instead of silently breaking

In readEvents, n == 0 is handled as a normal loop termination, but for userfaultfd this indicates EOF/closed descriptor (the userfaultfd(2) example treats this as a fatal condition). Returning an empty batch here makes Serve continue its poll loop without surfacing closure, which can leave the handler stuck (or spinning) after the uffd is closed during teardown/failure paths. This should be propagated as an explicit error or shutdown signal.

Useful? React with 👍 / 👎.

claude · 2026-04-18T22:01:50Z

+
+		msg := (*UffdMsg)(unsafe.Pointer(&buf[0]))
+
+		if event := getMsgEvent(msg); event != UFFD_EVENT_PAGEFAULT {


When a non-PAGEFAULT event is encountered mid-batch, return nil, ErrUnexpectedEventType silently discards any pagefaults already accumulated in pagefaults. Once UFFD_EVENT_REMOVE is registered in the follow-up PR, a batch like [PF1, PF2, REMOVE, PF3] will cause readEvents to drop PF1 and PF2 before the caller ever sees them — those guest memory accesses will never be resolved, hanging the VM.

Consider returning the partial slice alongside the error: return pagefaults, ErrUnexpectedEventType, and let Serve decide whether to flush what was collected before propagating the error.

claude · 2026-04-18T22:07:06Z

+		// Read returned 0 bytes, meaning the writing end has been closed.
+		// This should not happen in practice unless something (us or
+		// Firecracker) has closed the fd.
+		if n == 0 {
+			break
+		}
+
+		msg := (*UffdMsg)(unsafe.Pointer(&buf[0]))


🔴 In readEvents(), when syscall.Read returns n=0 (EOF / fd closed), the function silently breaks and returns (nil, nil) instead of propagating an error. Back in Serve(), unix.Poll() on a closed uffd fd keeps returning POLLIN (and POLLHUP), but POLLHUP only increments a counter and never causes Serve() to return—resulting in a 100% CPU spin loop with no diagnostic output and no clean shutdown. Fix by returning a sentinel error (e.g. io.EOF or a named ErrFdClosed) when n==0 so Serve() can exit cleanly.

Extended reasoning...

The bug: silent swallowing of EOF in readEvents().

When the userfaultfd file descriptor is closed (e.g. because Firecracker exits or crashes), the kernel raises POLLHUP on the fd. The POLLIN bit may also be set depending on kernel version and whether the waitqueues are still active. In Serve(), the POLLHUP flag is handled only by incrementing uffdErrorCounter—it never causes Serve() to return or break the loop. If POLLIN is not set, the !hasEvent check fires a continue, spinning back to poll() immediately; poll() returns POLLHUP again; infinite loop at 100% CPU.

If POLLIN is set alongside POLLHUP, readEvents() is called. Inside readEvents(), syscall.Read returns n=0, err=nil (the standard EOF behavior for a closed fd). The code reaches the if n == 0 { break } guard, breaks out of the drain loop, and returns (pagefaults, nil)—an empty slice with no error.

Why the existing code does not prevent the spin.

Serve() receives (nil, nil) from readEvents(). The for _, pf := range pagefaults loop is a no-op on an empty slice. Control falls through to the end of the outer for loop and jumps back to unix.Poll(). Since the fd closure is permanent, poll() immediately returns POLLIN (or POLLHUP) again on the next call. This repeats indefinitely, consuming a full CPU core with no log output and no way to observe that anything is wrong.

Step-by-step proof.

Firecracker exits; the kernel closes its handle on the uffd fd.

unix.Poll(pollFds, -1) returns immediately with uffdFd.Revents = POLLIN | POLLHUP.

uffdErrorCounter.Increase(POLLHUP) is called; Serve() does not return.

hasEvent(uffdFd.Revents, unix.POLLIN) is true → readEvents(ctx) is called.

syscall.Read returns n=0, err=nil → the if n == 0 { break } fires → readEvents returns (nil, nil).

for _, pf := range nil is a no-op → outer loop falls through to step 2.

Steps 2–6 repeat without pause at 100% CPU.

The only external escape would be fdExit.Reader() independently signaling POLLIN on pollFds[1], but if Firecracker crashed without the orchestrator's process monitor firing Stop(), that signal never arrives.

How to fix.

Return a sentinel error from readEvents() when n==0:

Serve() already returns on any non-nil error from readEvents(), so this single change is sufficient to break the spin. Optionally add a u.logger.Warn before returning so operators can observe the event in logs.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 4c356cc. Configure here.}

cursor · 2026-04-18T22:26:02Z

-				// There is no error so we can proceed.
-
-				eagainCounter.Log(ctx)
-				noDataCounter.Log(ctx)


eagainCounter now always outputs constant value of 1

Low Severity

eagainCounter lost its diagnostic value in this refactor. Every drain loop now terminates with exactly one EAGAIN (the normal "queue empty" signal), so Increase is called once, then Log immediately prints and resets. The counter always outputs count: 1 on every poll cycle, producing constant debug-log noise with zero information. The PR description itself notes this counter "no longer makes sense" and states it is dropped, but it was retained.

Additional Locations (1)

packages/orchestrator/pkg/sandbox/uffd/userfaultfd/userfaultfd.go#L230-L231

^{Reviewed by Cursor Bugbot for commit 4c356cc. Configure here.}

Instead of reading exactly one UFFD event per poll/read cycle, drain all queued events into a slice (until EAGAIN) and process them as a batch. Structural prerequisite for upcoming UFFD_EVENT_REMOVE handling: REMOVE and PAGEFAULT events for the same address need to be delivered from the same poll cycle so REMOVEs can be applied to the page tracker before the matching faults run. No behaviour change for the current PAGEFAULT-only registration: the same EINTR/EAGAIN handling, the same WRITE/read branching, the same `wg.Go` dispatch, the same `faultPage` signature, and the same eagain/noData counters.

ValentaTomas requested review from dobrac and jakubno as code owners April 18, 2026 21:56

e2b-request-same-site-reviewers bot assigned tvi Apr 18, 2026

ValentaTomas force-pushed the refactor/uffd-batch-read-events branch from ce956f4 to 0c877e1 Compare April 18, 2026 21:58

ValentaTomas changed the base branch from feat/firecracker-v1.14 to main April 18, 2026 21:58

chatgpt-codex-connector bot reviewed Apr 18, 2026

View reviewed changes

claude bot reviewed Apr 18, 2026

View reviewed changes

ValentaTomas marked this pull request as draft April 18, 2026 22:16

ValentaTomas force-pushed the refactor/uffd-batch-read-events branch from 0c877e1 to 4c356cc Compare April 18, 2026 22:19

ValentaTomas changed the title ~~refactor(uffd): drain pagefault events into a batch via readEvents()~~ refactor(uffd): drain pagefault events into a batch per poll cycle Apr 18, 2026

cursor bot reviewed Apr 18, 2026

View reviewed changes

ValentaTomas force-pushed the refactor/uffd-batch-read-events branch from 4c356cc to d76e1ff Compare April 18, 2026 22:52

ValentaTomas force-pushed the refactor/uffd-batch-read-events branch from d76e1ff to a1a4dae Compare April 18, 2026 23:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(uffd): drain pagefault events into a batch per poll cycle#2445

refactor(uffd): drain pagefault events into a batch per poll cycle#2445
ValentaTomas wants to merge 1 commit intomainfrom
refactor/uffd-batch-read-events

ValentaTomas commented Apr 18, 2026 •

edited

Loading

Uh oh!

cursor bot commented Apr 18, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 18, 2026

Uh oh!

claude bot Apr 18, 2026

Uh oh!

claude bot Apr 18, 2026

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		msg := (*UffdMsg)(unsafe.Pointer(&buf[0]))

		if event := getMsgEvent(msg); event != UFFD_EVENT_PAGEFAULT {

Conversation

ValentaTomas commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor bot commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Apr 18, 2026

Choose a reason for hiding this comment

eagainCounter now always outputs constant value of 1

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ValentaTomas commented Apr 18, 2026 •

edited

Loading

cursor bot commented Apr 18, 2026 •

edited

Loading