Skip to content

fix(spawn): prevent nested spawnSync from corrupting event loop handle#27993

Draft
ant-kurt wants to merge 2 commits intomainfrom
spawnsync-nested-event-loop-fix
Draft

fix(spawn): prevent nested spawnSync from corrupting event loop handle#27993
ant-kurt wants to merge 2 commits intomainfrom
spawnsync-nested-event-loop-fix

Conversation

@ant-kurt
Copy link
Collaborator

What does this PR do?

Fixes a bug where nested spawnSync calls corrupt vm.event_loop_handle, causing the process to exit unexpectedly when the main event loop becomes orphaned.

Root cause

SpawnSyncEventLoop (introduced in #24436) is a singleton that saves vm.event_loop_handle in prepare() and restores it in cleanup(). If spawnSync is called recursively — which can happen when an async subprocess's completion callback runs during the isolated loop's tick and itself calls spawnSync — the inner prepare() overwrites the singleton's original_event_loop_handle with the already-overridden value. Both cleanups then restore to the isolated loop instead of the main loop.

The result: stdin, timers, and async subprocess sockets (all registered on the main loop) are orphaned. Once pending work drains, the process exits with code 0.

Evidence

Observed on Rocky Linux 8. A perf trace of the affected process shows:

  • Two epoll_create1 calls on the main thread (main loop + isolated spawnSync loop)
  • epoll_ctl(isolated_epfd, EPOLL_CTL_DEL, fd) = -1 ENOENT — attempting to unregister async subprocess sockets from the wrong loop
  • Zero read() calls on /dev/pts/* for the entire 13-second run (stdin never polled)
  • Stack pointer from socketpair() calls shows nesting: inner spawnSync ~14KB deeper than outer on a downward-growing stack
  • Process exits cleanly via exit_group() after the last epoll_pwait on the isolated epfd returns 0

The race requires an async spawn's completion callback to be queued when spawnSync starts, then drained by the isolated loop's tick. On fast machines with newer kernels this window is too narrow to hit; on Rocky 8 (older kernel, slower syscalls, EINVAL retries on memfd_create(MFD_EXEC) and getrandom(GRND_INSECURE)) it triggers consistently.

Fix

Two complementary changes:

1. Stack-local save/restore at the call site (js_bun_spawn_bindings.zig)

Each spawnSync invocation saves vm.event_loop_handle on its own stack frame before prepare(), and restores from that stack-local after cleanup(). LIFO defer order guarantees the outermost restore runs last with the correct value, regardless of what the singleton's field contains.

2. Nesting counter on the singleton (SpawnSyncEventLoop.zig)

prepare()/cleanup() only save/restore on the outermost call (nesting_depth == 0). Nested calls are no-ops. This makes the singleton independently correct even if the call-site stack-local is removed in a future refactor.

Either fix alone resolves the bug; both together give defense in depth.

Testing

TODO — this is why the PR is a draft.

The race is timing-dependent and does not trigger on fast CI machines. The fix has been verified manually on the affected Rocky 8 system (process now stays running). Still working out how to add deterministic test coverage — options being considered:

  • Test-only Zig binding that exposes vm.event_loop_handle as an integer for before/after comparison
  • Fault-injection env var that forces the isolated loop to drain the main queue
  • Artificial syscall latency to widen the race window

Feedback on preferred testing approach welcome.

How did you verify your code works?

  • Built patched bun release binary with both fixes
  • Embedded it in a downstream application that was failing on Rocky 8
  • Confirmed the application now starts and stays running in the previously-failing environment

SpawnSyncEventLoop is a singleton that saves vm.event_loop_handle in
prepare() and restores it in cleanup(). If spawnSync is called
recursively, the inner prepare() overwrites the singleton's
original_event_loop_handle with the already-overridden value, and both
cleanups restore to the isolated loop instead of the main loop. Stdin,
timers, and async subprocess sockets registered on the main loop are
then orphaned, and the process exits once pending work drains.

Observed on Rocky Linux 8 where startup timing causes an async spawn's
completion callback to run during spawnSync's isolated-loop tick, and
that callback calls spawnSync again. Trace shows two epoll_create1
calls, cross-loop fd confusion (epoll_ctl DEL → ENOENT), and stack
pointer evidence of nesting (~14KB deeper for the inner call).

Two complementary fixes:

1. Stack-local save/restore at the call site
   (js_bun_spawn_bindings.zig): each spawnSync invocation saves
   vm.event_loop_handle on its own stack frame. LIFO defer order
   guarantees the outermost restore runs last with the correct value,
   regardless of what the singleton's field contains.

2. Nesting counter on the singleton (SpawnSyncEventLoop.zig):
   prepare()/cleanup() only act on the outermost call; nested calls are
   no-ops. Makes the singleton independently correct.

Either fix alone resolves the bug; both together give defense in depth.
@robobun
Copy link
Collaborator

robobun commented Mar 10, 2026

Updated 12:32 PM PT - Mar 11th, 2026

@alii, your commit 0f645d0 has 1 failures in Build #39312 (All Failures):


🧪   To try this PR locally:

bunx bun-pr 27993

That installs a local version of the PR into your bun-27993 executable, so you can run:

bun-27993 --bun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants