fix(spawn): prevent nested spawnSync from corrupting event loop handle by ant-kurt · Pull Request #27993 · oven-sh/bun

ant-kurt · 2026-03-10T20:14:59Z

What does this PR do?

Fixes a bug where nested spawnSync calls corrupt vm.event_loop_handle, causing the process to exit unexpectedly when the main event loop becomes orphaned.

Root cause

SpawnSyncEventLoop (introduced in #24436) is a singleton that saves vm.event_loop_handle in prepare() and restores it in cleanup(). If spawnSync is called recursively — which can happen when an async subprocess's completion callback runs during the isolated loop's tick and itself calls spawnSync — the inner prepare() overwrites the singleton's original_event_loop_handle with the already-overridden value. Both cleanups then restore to the isolated loop instead of the main loop.

The result: stdin, timers, and async subprocess sockets (all registered on the main loop) are orphaned. Once pending work drains, the process exits with code 0.

Evidence

Observed on Rocky Linux 8. A perf trace of the affected process shows:

Two epoll_create1 calls on the main thread (main loop + isolated spawnSync loop)
epoll_ctl(isolated_epfd, EPOLL_CTL_DEL, fd) = -1 ENOENT — attempting to unregister async subprocess sockets from the wrong loop
Zero read() calls on /dev/pts/* for the entire 13-second run (stdin never polled)
Stack pointer from socketpair() calls shows nesting: inner spawnSync ~14KB deeper than outer on a downward-growing stack
Process exits cleanly via exit_group() after the last epoll_pwait on the isolated epfd returns 0

The race requires an async spawn's completion callback to be queued when spawnSync starts, then drained by the isolated loop's tick. On fast machines with newer kernels this window is too narrow to hit; on Rocky 8 (older kernel, slower syscalls, EINVAL retries on memfd_create(MFD_EXEC) and getrandom(GRND_INSECURE)) it triggers consistently.

Fix

Two complementary changes:

1. Stack-local save/restore at the call site (`js_bun_spawn_bindings.zig`)

Each spawnSync invocation saves vm.event_loop_handle on its own stack frame before prepare(), and restores from that stack-local after cleanup(). LIFO defer order guarantees the outermost restore runs last with the correct value, regardless of what the singleton's field contains.

2. Nesting counter on the singleton (`SpawnSyncEventLoop.zig`)

prepare()/cleanup() only save/restore on the outermost call (nesting_depth == 0). Nested calls are no-ops. This makes the singleton independently correct even if the call-site stack-local is removed in a future refactor.

Either fix alone resolves the bug; both together give defense in depth.

Testing

TODO — this is why the PR is a draft.

The race is timing-dependent and does not trigger on fast CI machines. The fix has been verified manually on the affected Rocky 8 system (process now stays running). Still working out how to add deterministic test coverage — options being considered:

Test-only Zig binding that exposes vm.event_loop_handle as an integer for before/after comparison
Fault-injection env var that forces the isolated loop to drain the main queue
Artificial syscall latency to widen the race window

Feedback on preferred testing approach welcome.

How did you verify your code works?

Built patched bun release binary with both fixes
Embedded it in a downstream application that was failing on Rocky 8
Confirmed the application now starts and stays running in the previously-failing environment

SpawnSyncEventLoop is a singleton that saves vm.event_loop_handle in prepare() and restores it in cleanup(). If spawnSync is called recursively, the inner prepare() overwrites the singleton's original_event_loop_handle with the already-overridden value, and both cleanups restore to the isolated loop instead of the main loop. Stdin, timers, and async subprocess sockets registered on the main loop are then orphaned, and the process exits once pending work drains. Observed on Rocky Linux 8 where startup timing causes an async spawn's completion callback to run during spawnSync's isolated-loop tick, and that callback calls spawnSync again. Trace shows two epoll_create1 calls, cross-loop fd confusion (epoll_ctl DEL → ENOENT), and stack pointer evidence of nesting (~14KB deeper for the inner call). Two complementary fixes: 1. Stack-local save/restore at the call site (js_bun_spawn_bindings.zig): each spawnSync invocation saves vm.event_loop_handle on its own stack frame. LIFO defer order guarantees the outermost restore runs last with the correct value, regardless of what the singleton's field contains. 2. Nesting counter on the singleton (SpawnSyncEventLoop.zig): prepare()/cleanup() only act on the outermost call; nested calls are no-ops. Makes the singleton independently correct. Either fix alone resolves the bug; both together give defense in depth.

robobun · 2026-03-10T20:15:15Z

^{Updated 12:32 PM PT - Mar 11th, 2026}

❌ @alii, your commit 0f645d0 has 1 failures in Build #39312 (All Failures):

test/integration/next-pages/test/dev-server.test.ts - code 1 on 🍎 13 aarch64

🧪 To try this PR locally:

bunx bun-pr 27993

That installs a local version of the PR into your bun-27993 executable, so you can run:

bun-27993 --bun

Merge branch 'main' into spawnsync-nested-event-loop-fix

0f645d0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(spawn): prevent nested spawnSync from corrupting event loop handle#27993

fix(spawn): prevent nested spawnSync from corrupting event loop handle#27993
ant-kurt wants to merge 2 commits intomainfrom
spawnsync-nested-event-loop-fix

ant-kurt commented Mar 10, 2026

Uh oh!

robobun commented Mar 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ant-kurt commented Mar 10, 2026

What does this PR do?

Root cause

Evidence

Fix

1. Stack-local save/restore at the call site (js_bun_spawn_bindings.zig)

2. Nesting counter on the singleton (SpawnSyncEventLoop.zig)

Testing

How did you verify your code works?

Uh oh!

robobun commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. Stack-local save/restore at the call site (`js_bun_spawn_bindings.zig`)

2. Nesting counter on the singleton (`SpawnSyncEventLoop.zig`)

robobun commented Mar 10, 2026 •

edited

Loading