fix(api): skip kill after waiting for pause to complete#2412
fix(api): skip kill after waiting for pause to complete#2412
Conversation
When a kill request arrives while a sandbox is being paused, the code now waits for the pause transition to finish and returns success instead of proceeding to kill the sandbox. Pause is a terminal transition — the sandbox is already being removed, so there is nothing left to kill.
PR SummaryMedium Risk Overview Reviewed by Cursor Bugbot for commit cb4a46b. Bugbot is set up for automated code reviews on this repo. Configure here. |
| case currentState == sandbox.StatePausing: | ||
| // Pause is a terminal transition — the sandbox is already being | ||
| // removed, so there's nothing left to kill. Just report done. | ||
| return true, func(context.Context, error) {}, nil |
There was a problem hiding this comment.
🔴 The PR changes kill-while-pausing to return alreadyDone=true (no-op), but four existing tests that document the old behavior were not updated and will now fail. The tests must be updated to assert alreadyDone=true and State==StatePausing instead of alreadyDone=false and State==StateKilling.
Extended reasoning...
The PR introduces a new semantic: when a kill request arrives while a sandbox is being paused, the code now treats pause as a terminal transition and returns alreadyDone=true rather than waiting for pause to complete and then retrying the kill. This is implemented in two places: memory/operations.go lines 195-198 (new case currentState == sandbox.StatePausing) and redis/state_change.go lines 343-345 (new if sbx.State == sandbox.StatePausing check after waiting for transition).
Four existing tests document the OLD behavior and will now fail:
memory/operations_test.goTestStartRemoving_PauseThenKill(lines 86-128): assertsassert.False(t, alreadyDone2)(line 121) andassert.Equal(t, sandbox.StateKilling, sbx.State())(line 123).memory/operations_test.goTestConcurrency/'eviction flag is not propagated on retry after waiting'(lines 762-815): assertsassert.False(t, killAlreadyDone)(line 810) andassert.Equal(t, sandbox.StateKilling, sbx.State())(line 812).redis/state_change_test.goTestStartRemoving_PauseThenKill(lines 113-169): assertsassert.False(t, alreadyDone2)(line 156) andassert.Equal(t, sandbox.StateKilling, updated.State)(line 162).redis/state_change_test.goTestConcurrency/'eviction flag is not propagated on retry after waiting'(lines 907-966): assertsassert.False(t, killAlreadyDone)(line 958) andassert.Equal(t, sandbox.StateKilling, got.State)(line 963).
Step-by-step proof for memory/operations_test.go TestStartRemoving_PauseThenKill:
- Test calls
StartRemovingwithStatePausingaction, which setssbx._data.State = StatePausingandsbx.transition = newErrorOnce(). - A goroutine sleeping 100ms holds the pause transition open.
- Test then calls
StartRemovingwithStateKillingaction. startRemovingseestransition \!= nil, readscurrentState = StatePausing, releases lock, and callstransition.WaitWithContext.- After waiting, enters the switch.
currentState == newStateis false (Pausing ≠ Killing). New casecurrentState == sandbox.StatePausingmatches — returnstrue, emptyCallback, nil. - Test receives
alreadyDone2 = trueand state remainsStatePausing. assert.False(t, alreadyDone2)FAILS.assert.Equal(t, sandbox.StateKilling, sbx.State())FAILS.
The same failure pattern applies symmetrically to the redis tests via the new if sbx.State == sandbox.StatePausing guard added after waitForTransition returns.
The fix is to update all four tests to assert the new semantics: alreadyDone=true and State==StatePausing when a kill is requested while a pause is in progress.
| if sbx.State == sandbox.StatePausing { | ||
| return sbx, true, func(context.Context, error) {}, nil | ||
| } | ||
|
|
||
| // Retry with new state after transition completes | ||
| return s.StartRemoving(ctx, teamID, sbx.SandboxID, sandbox.RemoveOpts{Action: stateAction}) |
There was a problem hiding this comment.
🔴 The PR skips kill in the concurrent case (kill arrives while pause is in-flight), but the sequential case (kill arrives after pause fully commits) is unaffected: AllowedTransitions[StatePausing][StateKilling] = true still allows kill to proceed via the normal path, resulting in timing-dependent behavior where the same logical operation—killing a paused sandbox—produces opposite results depending on a millisecond race. If the stated semantics ('pause is terminal, nothing to kill') should hold universally, AllowedTransitions[StatePausing][StateKilling] should be removed to close both paths consistently.
Extended reasoning...
What the bug is:
The PR adds logic to skip kill when a sandbox is currently being paused (concurrent case). In the Redis path (handleExistingTransition), after waiting for an in-flight pause transition, the code now checks if sbx.State == sandbox.StatePausing and returns alreadyDone=true. Similarly, in the memory path, case currentState == sandbox.StatePausing returns alreadyDone=true. This covers the concurrent race. However, AllowedTransitions[StatePausing][StateKilling] = true remains in states.go, meaning a kill that arrives after the pause completes (no active transition) will never reach handleExistingTransition—it proceeds straight through the normal allowed-transitions path and transitions the sandbox to StateKilling.
The specific code path that triggers it:
In the sequential case, StartRemoving in the Redis path finds transactionID == "" (no active transition), falls through to the AllowedTransitions check at line ~130, finds AllowedTransitions[StatePausing][StateKilling] = true, and proceeds with the kill. The new if sbx.State == sandbox.StatePausing guard is inside handleExistingTransition, which is only entered when transactionID \!= "". Once the pause transition has completed and cleared its key, the guard is unreachable.
Why existing code doesn't prevent it:
AllowedTransitions[StatePausing][StateKilling] = true is unchanged by the PR. No guard exists in the non-transition-in-progress code path to reject kill requests for sandboxes already in StatePausing. The PR's new code only intercepts the case where the pause is still running.
Impact:
A kill request for a paused sandbox produces opposite results depending on a timing race: arrive 1 ms before the pause commits → kill is skipped (alreadyDone=true); arrive 1 ms after the pause commits → kill proceeds (state=StateKilling). Additionally, the test TestStartRemoving_PauseThenKill (redis/state_change_test.go:113) previously asserted alreadyDone2=false and State==StateKilling in the concurrent case, but with the PR's changes that test now breaks because the new code returns alreadyDone=true for exactly that scenario.
Addressing the refutation:
The refutation argues that AllowedTransitions[StatePausing][StateKilling] was intentionally preserved for operator cleanup of stuck/failed pauses, and that the PR comment explicitly scopes only to the concurrent ('WHILE') case. This is a reasonable interpretation. However, the PR's own stated rationale—"Pause is a terminal transition — the sandbox is already being removed, so there is nothing left to kill"—applies with equal force once the pause has completed: the sandbox is in StatePausing in both cases, and the cleanup argument applies identically. If the sequential path is intentionally preserved, the PR comment should document this distinction, and the concurrent-case test should be updated to match the new expected behavior. As written, the PR breaks TestStartRemoving_PauseThenKill's assertions without updating them, and leaves behavior asymmetric without documentation.
How to fix it:
Either (a) remove AllowedTransitions[StatePausing][StateKilling] to make 'kill of a paused sandbox always skips' consistent across both paths, or (b) explicitly document that the sequential path is intentional (cleanup/force-kill), update TestStartRemoving_PauseThenKill to reflect the new concurrent behavior, and add a comment explaining why the two paths differ.
Step-by-step proof:
- Sandbox is in
StateRunning. Client A callsStartRemovingwith actionPause.AllowedTransitions[StateRunning][StatePausing]=true, so the pause transition begins; sandbox state is set toStatePausingandtransitionKeyis populated. - Pause transition callback fires;
transitionKeyis deleted; sandbox remains inStatePausing. - Client B calls
StartRemovingwith actionKill.transactionID == ""(no active transition), sohandleExistingTransitionis not called. Code reachessandbox.AllowedTransitions[sbx.State][newState]check:AllowedTransitions[StatePausing][StateKilling] = true. Kill proceeds; sandbox moves toStateKilling.alreadyDone=falseis returned. - Had Client B arrived 1 ms earlier (before step 2), it would have entered
handleExistingTransition, waited for the pause to complete, then hit the newif sbx.State == sandbox.StatePausingguard and returnedalreadyDone=true. Opposite result, same logical operation.
Move the StatePausing guard from the unlocked wait path into the locked 'no transition' section of startRemoving/StartRemoving so it reads actual state under the mutex (memory) or distributed lock (Redis). The recursive retry after waiting naturally re-acquires the lock and hits the check.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit cb4a46b. Configure here.
| // Pause is a terminal transition — the sandbox is already being | ||
| // removed, so there's nothing left to kill. Just report done. | ||
| if sbx._data.State == sandbox.StatePausing { | ||
| return true, func(context.Context, error) {}, nil |
There was a problem hiding this comment.
Overly broad pausing check skips all transitions, not just kill
Medium Severity
The new guard if sbx._data.State == sandbox.StatePausing returns alreadyDone=true for any requested transition, not just kill. The comment says "nothing left to kill," but the condition doesn't verify that newState == sandbox.StateKilling. A snapshot request (StateActionSnapshot) targeting a StatePausing sandbox — previously rejected with InvalidStateTransitionError — now silently returns success. The condition needs to be scoped to newState == sandbox.StateKilling to match the stated intent.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit cb4a46b. Configure here.


When a kill request arrives while a sandbox is being paused, the code now waits for the pause transition to finish and returns success instead of proceeding to kill the sandbox. Pause is a terminal transition — the sandbox is already being removed, so there is nothing left to kill.