fix(spawn): prevent nested spawnSync from corrupting event loop handle#27993
Draft
fix(spawn): prevent nested spawnSync from corrupting event loop handle#27993
Conversation
SpawnSyncEventLoop is a singleton that saves vm.event_loop_handle in prepare() and restores it in cleanup(). If spawnSync is called recursively, the inner prepare() overwrites the singleton's original_event_loop_handle with the already-overridden value, and both cleanups restore to the isolated loop instead of the main loop. Stdin, timers, and async subprocess sockets registered on the main loop are then orphaned, and the process exits once pending work drains. Observed on Rocky Linux 8 where startup timing causes an async spawn's completion callback to run during spawnSync's isolated-loop tick, and that callback calls spawnSync again. Trace shows two epoll_create1 calls, cross-loop fd confusion (epoll_ctl DEL → ENOENT), and stack pointer evidence of nesting (~14KB deeper for the inner call). Two complementary fixes: 1. Stack-local save/restore at the call site (js_bun_spawn_bindings.zig): each spawnSync invocation saves vm.event_loop_handle on its own stack frame. LIFO defer order guarantees the outermost restore runs last with the correct value, regardless of what the singleton's field contains. 2. Nesting counter on the singleton (SpawnSyncEventLoop.zig): prepare()/cleanup() only act on the outermost call; nested calls are no-ops. Makes the singleton independently correct. Either fix alone resolves the bug; both together give defense in depth.
Collaborator
|
Updated 12:32 PM PT - Mar 11th, 2026
❌ @alii, your commit 0f645d0 has 1 failures in 🧪 To try this PR locally: bunx bun-pr 27993That installs a local version of the PR into your bun-27993 --bun |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fixes a bug where nested
spawnSynccalls corruptvm.event_loop_handle, causing the process to exit unexpectedly when the main event loop becomes orphaned.Root cause
SpawnSyncEventLoop(introduced in #24436) is a singleton that savesvm.event_loop_handleinprepare()and restores it incleanup(). IfspawnSyncis called recursively — which can happen when an async subprocess's completion callback runs during the isolated loop's tick and itself callsspawnSync— the innerprepare()overwrites the singleton'soriginal_event_loop_handlewith the already-overridden value. Both cleanups then restore to the isolated loop instead of the main loop.The result: stdin, timers, and async subprocess sockets (all registered on the main loop) are orphaned. Once pending work drains, the process exits with code 0.
Evidence
Observed on Rocky Linux 8. A
perf traceof the affected process shows:epoll_create1calls on the main thread (main loop + isolated spawnSync loop)epoll_ctl(isolated_epfd, EPOLL_CTL_DEL, fd) = -1 ENOENT— attempting to unregister async subprocess sockets from the wrong loopread()calls on/dev/pts/*for the entire 13-second run (stdin never polled)socketpair()calls shows nesting: inner spawnSync ~14KB deeper than outer on a downward-growing stackexit_group()after the last epoll_pwait on the isolated epfd returns 0The race requires an async spawn's completion callback to be queued when
spawnSyncstarts, then drained by the isolated loop's tick. On fast machines with newer kernels this window is too narrow to hit; on Rocky 8 (older kernel, slower syscalls, EINVAL retries onmemfd_create(MFD_EXEC)andgetrandom(GRND_INSECURE)) it triggers consistently.Fix
Two complementary changes:
1. Stack-local save/restore at the call site (
js_bun_spawn_bindings.zig)Each spawnSync invocation saves
vm.event_loop_handleon its own stack frame beforeprepare(), and restores from that stack-local aftercleanup(). LIFO defer order guarantees the outermost restore runs last with the correct value, regardless of what the singleton's field contains.2. Nesting counter on the singleton (
SpawnSyncEventLoop.zig)prepare()/cleanup()only save/restore on the outermost call (nesting_depth == 0). Nested calls are no-ops. This makes the singleton independently correct even if the call-site stack-local is removed in a future refactor.Either fix alone resolves the bug; both together give defense in depth.
Testing
TODO — this is why the PR is a draft.
The race is timing-dependent and does not trigger on fast CI machines. The fix has been verified manually on the affected Rocky 8 system (process now stays running). Still working out how to add deterministic test coverage — options being considered:
vm.event_loop_handleas an integer for before/after comparisonFeedback on preferred testing approach welcome.
How did you verify your code works?