fix(api): skip kill after waiting for pause to complete by jakubno · Pull Request #2412 · e2b-dev/infra

jakubno · 2026-04-16T13:41:40Z

When a kill request arrives while a sandbox is being paused, the code now waits for the pause transition to finish and returns success instead of proceeding to kill the sandbox. Pause is a terminal transition — the sandbox is already being removed, so there is nothing left to kill.

cursor · 2026-04-16T13:41:47Z

PR Summary

Medium Risk
Touches sandbox lifecycle state-transition logic in both storage backends, which could change cleanup behavior if callers previously relied on Kill after Pause to perform additional work. Scope is small and covered by updated concurrency/idempotency tests.

Overview
Updates both in-memory and Redis-backed sandbox removal flows so that once a sandbox is in Pausing, a subsequent Kill request (including one that waited for the pause transition) returns alreadyDone and does not transition to Killing. Tests are adjusted to assert the state remains Pausing and to remove expectations that a follow-up kill callback advances or completes a kill transition.

^{Reviewed by Cursor Bugbot for commit cb4a46b. Bugbot is set up for automated code reviews on this repo. Configure here.}

claude · 2026-04-16T13:58:16Z

+		case currentState == sandbox.StatePausing:
+			// Pause is a terminal transition — the sandbox is already being
+			// removed, so there's nothing left to kill. Just report done.
+			return true, func(context.Context, error) {}, nil


🔴 The PR changes kill-while-pausing to return alreadyDone=true (no-op), but four existing tests that document the old behavior were not updated and will now fail. The tests must be updated to assert alreadyDone=true and State==StatePausing instead of alreadyDone=false and State==StateKilling.

Extended reasoning...

The PR introduces a new semantic: when a kill request arrives while a sandbox is being paused, the code now treats pause as a terminal transition and returns alreadyDone=true rather than waiting for pause to complete and then retrying the kill. This is implemented in two places: memory/operations.go lines 195-198 (new case currentState == sandbox.StatePausing) and redis/state_change.go lines 343-345 (new if sbx.State == sandbox.StatePausing check after waiting for transition).

Four existing tests document the OLD behavior and will now fail:

memory/operations_test.go TestStartRemoving_PauseThenKill (lines 86-128): asserts assert.False(t, alreadyDone2) (line 121) and assert.Equal(t, sandbox.StateKilling, sbx.State()) (line 123).

memory/operations_test.go TestConcurrency/'eviction flag is not propagated on retry after waiting' (lines 762-815): asserts assert.False(t, killAlreadyDone) (line 810) and assert.Equal(t, sandbox.StateKilling, sbx.State()) (line 812).

redis/state_change_test.go TestStartRemoving_PauseThenKill (lines 113-169): asserts assert.False(t, alreadyDone2) (line 156) and assert.Equal(t, sandbox.StateKilling, updated.State) (line 162).

redis/state_change_test.go TestConcurrency/'eviction flag is not propagated on retry after waiting' (lines 907-966): asserts assert.False(t, killAlreadyDone) (line 958) and assert.Equal(t, sandbox.StateKilling, got.State) (line 963).

Step-by-step proof for memory/operations_test.go TestStartRemoving_PauseThenKill:

Test calls StartRemoving with StatePausing action, which sets sbx._data.State = StatePausing and sbx.transition = newErrorOnce().

A goroutine sleeping 100ms holds the pause transition open.

Test then calls StartRemoving with StateKilling action.

startRemoving sees transition \!= nil, reads currentState = StatePausing, releases lock, and calls transition.WaitWithContext.

After waiting, enters the switch. currentState == newState is false (Pausing ≠ Killing). New case currentState == sandbox.StatePausing matches — returns true, emptyCallback, nil.

Test receives alreadyDone2 = true and state remains StatePausing.

assert.False(t, alreadyDone2) FAILS. assert.Equal(t, sandbox.StateKilling, sbx.State()) FAILS.

The same failure pattern applies symmetrically to the redis tests via the new if sbx.State == sandbox.StatePausing guard added after waitForTransition returns.

The fix is to update all four tests to assert the new semantics: alreadyDone=true and State==StatePausing when a kill is requested while a pause is in progress.

claude · 2026-04-16T13:58:16Z

+	if sbx.State == sandbox.StatePausing {
+		return sbx, true, func(context.Context, error) {}, nil
+	}
+
 	// Retry with new state after transition completes
 	return s.StartRemoving(ctx, teamID, sbx.SandboxID, sandbox.RemoveOpts{Action: stateAction})


🔴 The PR skips kill in the concurrent case (kill arrives while pause is in-flight), but the sequential case (kill arrives after pause fully commits) is unaffected: AllowedTransitions[StatePausing][StateKilling] = true still allows kill to proceed via the normal path, resulting in timing-dependent behavior where the same logical operation—killing a paused sandbox—produces opposite results depending on a millisecond race. If the stated semantics ('pause is terminal, nothing to kill') should hold universally, AllowedTransitions[StatePausing][StateKilling] should be removed to close both paths consistently.

Extended reasoning...

What the bug is:
The PR adds logic to skip kill when a sandbox is currently being paused (concurrent case). In the Redis path (handleExistingTransition), after waiting for an in-flight pause transition, the code now checks if sbx.State == sandbox.StatePausing and returns alreadyDone=true. Similarly, in the memory path, case currentState == sandbox.StatePausing returns alreadyDone=true. This covers the concurrent race. However, AllowedTransitions[StatePausing][StateKilling] = true remains in states.go, meaning a kill that arrives after the pause completes (no active transition) will never reach handleExistingTransition—it proceeds straight through the normal allowed-transitions path and transitions the sandbox to StateKilling.

The specific code path that triggers it:
In the sequential case, StartRemoving in the Redis path finds transactionID == "" (no active transition), falls through to the AllowedTransitions check at line ~130, finds AllowedTransitions[StatePausing][StateKilling] = true, and proceeds with the kill. The new if sbx.State == sandbox.StatePausing guard is inside handleExistingTransition, which is only entered when transactionID \!= "". Once the pause transition has completed and cleared its key, the guard is unreachable.

Why existing code doesn't prevent it:
AllowedTransitions[StatePausing][StateKilling] = true is unchanged by the PR. No guard exists in the non-transition-in-progress code path to reject kill requests for sandboxes already in StatePausing. The PR's new code only intercepts the case where the pause is still running.

Impact:
A kill request for a paused sandbox produces opposite results depending on a timing race: arrive 1 ms before the pause commits → kill is skipped (alreadyDone=true); arrive 1 ms after the pause commits → kill proceeds (state=StateKilling). Additionally, the test TestStartRemoving_PauseThenKill (redis/state_change_test.go:113) previously asserted alreadyDone2=false and State==StateKilling in the concurrent case, but with the PR's changes that test now breaks because the new code returns alreadyDone=true for exactly that scenario.

Addressing the refutation:
The refutation argues that AllowedTransitions[StatePausing][StateKilling] was intentionally preserved for operator cleanup of stuck/failed pauses, and that the PR comment explicitly scopes only to the concurrent ('WHILE') case. This is a reasonable interpretation. However, the PR's own stated rationale—"Pause is a terminal transition — the sandbox is already being removed, so there is nothing left to kill"—applies with equal force once the pause has completed: the sandbox is in StatePausing in both cases, and the cleanup argument applies identically. If the sequential path is intentionally preserved, the PR comment should document this distinction, and the concurrent-case test should be updated to match the new expected behavior. As written, the PR breaks TestStartRemoving_PauseThenKill's assertions without updating them, and leaves behavior asymmetric without documentation.

How to fix it:
Either (a) remove AllowedTransitions[StatePausing][StateKilling] to make 'kill of a paused sandbox always skips' consistent across both paths, or (b) explicitly document that the sequential path is intentional (cleanup/force-kill), update TestStartRemoving_PauseThenKill to reflect the new concurrent behavior, and add a comment explaining why the two paths differ.

Step-by-step proof:

Sandbox is in StateRunning. Client A calls StartRemoving with action Pause. AllowedTransitions[StateRunning][StatePausing]=true, so the pause transition begins; sandbox state is set to StatePausing and transitionKey is populated.

Pause transition callback fires; transitionKey is deleted; sandbox remains in StatePausing.

Client B calls StartRemoving with action Kill. transactionID == "" (no active transition), so handleExistingTransition is not called. Code reaches sandbox.AllowedTransitions[sbx.State][newState] check: AllowedTransitions[StatePausing][StateKilling] = true. Kill proceeds; sandbox moves to StateKilling. alreadyDone=false is returned.

Had Client B arrived 1 ms earlier (before step 2), it would have entered handleExistingTransition, waited for the pause to complete, then hit the new if sbx.State == sandbox.StatePausing guard and returned alreadyDone=true. Opposite result, same logical operation.

Move the StatePausing guard from the unlocked wait path into the locked 'no transition' section of startRemoving/StartRemoving so it reads actual state under the mutex (memory) or distributed lock (Redis). The recursive retry after waiting naturally re-acquires the lock and hits the check.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit cb4a46b. Configure here.}

cursor · 2026-04-16T14:10:35Z

+	// Pause is a terminal transition — the sandbox is already being
+	// removed, so there's nothing left to kill. Just report done.
+	if sbx._data.State == sandbox.StatePausing {
+		return true, func(context.Context, error) {}, nil


Overly broad pausing check skips all transitions, not just kill

Medium Severity

The new guard if sbx._data.State == sandbox.StatePausing returns alreadyDone=true for any requested transition, not just kill. The comment says "nothing left to kill," but the condition doesn't verify that newState == sandbox.StateKilling. A snapshot request (StateActionSnapshot) targeting a StatePausing sandbox — previously rejected with InvalidStateTransitionError — now silently returns success. The condition needs to be scoped to newState == sandbox.StateKilling to match the stated intent.

Additional Locations (1)

packages/api/internal/sandbox/storage/redis/state_change.go#L108-L110

^{Reviewed by Cursor Bugbot for commit cb4a46b. Configure here.}

jakubno requested review from ValentaTomas and dobrac as code owners April 16, 2026 13:41

e2b-request-same-site-reviewers bot assigned arkamar Apr 16, 2026

jakubno marked this pull request as draft April 16, 2026 13:43

claude bot reviewed Apr 16, 2026

View reviewed changes

cursor bot reviewed Apr 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(api): skip kill after waiting for pause to complete#2412

fix(api): skip kill after waiting for pause to complete#2412
jakubno wants to merge 2 commits intomainfrom
fix/skip-kill-during-pause

jakubno commented Apr 16, 2026

Uh oh!

cursor bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

claude bot Apr 16, 2026

Uh oh!

claude bot Apr 16, 2026

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jakubno commented Apr 16, 2026

Uh oh!

cursor bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

claude bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Apr 16, 2026

Choose a reason for hiding this comment

Overly broad pausing check skips all transitions, not just kill

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cursor bot commented Apr 16, 2026 •

edited

Loading