Skip to content

fix(api): skip kill after waiting for pause to complete#2412

Draft
jakubno wants to merge 2 commits intomainfrom
fix/skip-kill-during-pause
Draft

fix(api): skip kill after waiting for pause to complete#2412
jakubno wants to merge 2 commits intomainfrom
fix/skip-kill-during-pause

Conversation

@jakubno
Copy link
Copy Markdown
Member

@jakubno jakubno commented Apr 16, 2026

When a kill request arrives while a sandbox is being paused, the code now waits for the pause transition to finish and returns success instead of proceeding to kill the sandbox. Pause is a terminal transition — the sandbox is already being removed, so there is nothing left to kill.

When a kill request arrives while a sandbox is being paused, the code
now waits for the pause transition to finish and returns success
instead of proceeding to kill the sandbox. Pause is a terminal
transition — the sandbox is already being removed, so there is
nothing left to kill.
@cursor
Copy link
Copy Markdown

cursor bot commented Apr 16, 2026

PR Summary

Medium Risk
Touches sandbox lifecycle state-transition logic in both storage backends, which could change cleanup behavior if callers previously relied on Kill after Pause to perform additional work. Scope is small and covered by updated concurrency/idempotency tests.

Overview
Updates both in-memory and Redis-backed sandbox removal flows so that once a sandbox is in Pausing, a subsequent Kill request (including one that waited for the pause transition) returns alreadyDone and does not transition to Killing. Tests are adjusted to assert the state remains Pausing and to remove expectations that a follow-up kill callback advances or completes a kill transition.

Reviewed by Cursor Bugbot for commit cb4a46b. Bugbot is set up for automated code reviews on this repo. Configure here.

@jakubno jakubno marked this pull request as draft April 16, 2026 13:43
Comment on lines +195 to +198
case currentState == sandbox.StatePausing:
// Pause is a terminal transition — the sandbox is already being
// removed, so there's nothing left to kill. Just report done.
return true, func(context.Context, error) {}, nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The PR changes kill-while-pausing to return alreadyDone=true (no-op), but four existing tests that document the old behavior were not updated and will now fail. The tests must be updated to assert alreadyDone=true and State==StatePausing instead of alreadyDone=false and State==StateKilling.

Extended reasoning...

The PR introduces a new semantic: when a kill request arrives while a sandbox is being paused, the code now treats pause as a terminal transition and returns alreadyDone=true rather than waiting for pause to complete and then retrying the kill. This is implemented in two places: memory/operations.go lines 195-198 (new case currentState == sandbox.StatePausing) and redis/state_change.go lines 343-345 (new if sbx.State == sandbox.StatePausing check after waiting for transition).

Four existing tests document the OLD behavior and will now fail:

  1. memory/operations_test.go TestStartRemoving_PauseThenKill (lines 86-128): asserts assert.False(t, alreadyDone2) (line 121) and assert.Equal(t, sandbox.StateKilling, sbx.State()) (line 123).
  2. memory/operations_test.go TestConcurrency/'eviction flag is not propagated on retry after waiting' (lines 762-815): asserts assert.False(t, killAlreadyDone) (line 810) and assert.Equal(t, sandbox.StateKilling, sbx.State()) (line 812).
  3. redis/state_change_test.go TestStartRemoving_PauseThenKill (lines 113-169): asserts assert.False(t, alreadyDone2) (line 156) and assert.Equal(t, sandbox.StateKilling, updated.State) (line 162).
  4. redis/state_change_test.go TestConcurrency/'eviction flag is not propagated on retry after waiting' (lines 907-966): asserts assert.False(t, killAlreadyDone) (line 958) and assert.Equal(t, sandbox.StateKilling, got.State) (line 963).

Step-by-step proof for memory/operations_test.go TestStartRemoving_PauseThenKill:

  1. Test calls StartRemoving with StatePausing action, which sets sbx._data.State = StatePausing and sbx.transition = newErrorOnce().
  2. A goroutine sleeping 100ms holds the pause transition open.
  3. Test then calls StartRemoving with StateKilling action.
  4. startRemoving sees transition \!= nil, reads currentState = StatePausing, releases lock, and calls transition.WaitWithContext.
  5. After waiting, enters the switch. currentState == newState is false (Pausing ≠ Killing). New case currentState == sandbox.StatePausing matches — returns true, emptyCallback, nil.
  6. Test receives alreadyDone2 = true and state remains StatePausing.
  7. assert.False(t, alreadyDone2) FAILS. assert.Equal(t, sandbox.StateKilling, sbx.State()) FAILS.

The same failure pattern applies symmetrically to the redis tests via the new if sbx.State == sandbox.StatePausing guard added after waitForTransition returns.

The fix is to update all four tests to assert the new semantics: alreadyDone=true and State==StatePausing when a kill is requested while a pause is in progress.

Comment on lines 343 to 348
if sbx.State == sandbox.StatePausing {
return sbx, true, func(context.Context, error) {}, nil
}

// Retry with new state after transition completes
return s.StartRemoving(ctx, teamID, sbx.SandboxID, sandbox.RemoveOpts{Action: stateAction})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The PR skips kill in the concurrent case (kill arrives while pause is in-flight), but the sequential case (kill arrives after pause fully commits) is unaffected: AllowedTransitions[StatePausing][StateKilling] = true still allows kill to proceed via the normal path, resulting in timing-dependent behavior where the same logical operation—killing a paused sandbox—produces opposite results depending on a millisecond race. If the stated semantics ('pause is terminal, nothing to kill') should hold universally, AllowedTransitions[StatePausing][StateKilling] should be removed to close both paths consistently.

Extended reasoning...

What the bug is:
The PR adds logic to skip kill when a sandbox is currently being paused (concurrent case). In the Redis path (handleExistingTransition), after waiting for an in-flight pause transition, the code now checks if sbx.State == sandbox.StatePausing and returns alreadyDone=true. Similarly, in the memory path, case currentState == sandbox.StatePausing returns alreadyDone=true. This covers the concurrent race. However, AllowedTransitions[StatePausing][StateKilling] = true remains in states.go, meaning a kill that arrives after the pause completes (no active transition) will never reach handleExistingTransition—it proceeds straight through the normal allowed-transitions path and transitions the sandbox to StateKilling.

The specific code path that triggers it:
In the sequential case, StartRemoving in the Redis path finds transactionID == "" (no active transition), falls through to the AllowedTransitions check at line ~130, finds AllowedTransitions[StatePausing][StateKilling] = true, and proceeds with the kill. The new if sbx.State == sandbox.StatePausing guard is inside handleExistingTransition, which is only entered when transactionID \!= "". Once the pause transition has completed and cleared its key, the guard is unreachable.

Why existing code doesn't prevent it:
AllowedTransitions[StatePausing][StateKilling] = true is unchanged by the PR. No guard exists in the non-transition-in-progress code path to reject kill requests for sandboxes already in StatePausing. The PR's new code only intercepts the case where the pause is still running.

Impact:
A kill request for a paused sandbox produces opposite results depending on a timing race: arrive 1 ms before the pause commits → kill is skipped (alreadyDone=true); arrive 1 ms after the pause commits → kill proceeds (state=StateKilling). Additionally, the test TestStartRemoving_PauseThenKill (redis/state_change_test.go:113) previously asserted alreadyDone2=false and State==StateKilling in the concurrent case, but with the PR's changes that test now breaks because the new code returns alreadyDone=true for exactly that scenario.

Addressing the refutation:
The refutation argues that AllowedTransitions[StatePausing][StateKilling] was intentionally preserved for operator cleanup of stuck/failed pauses, and that the PR comment explicitly scopes only to the concurrent ('WHILE') case. This is a reasonable interpretation. However, the PR's own stated rationale—"Pause is a terminal transition — the sandbox is already being removed, so there is nothing left to kill"—applies with equal force once the pause has completed: the sandbox is in StatePausing in both cases, and the cleanup argument applies identically. If the sequential path is intentionally preserved, the PR comment should document this distinction, and the concurrent-case test should be updated to match the new expected behavior. As written, the PR breaks TestStartRemoving_PauseThenKill's assertions without updating them, and leaves behavior asymmetric without documentation.

How to fix it:
Either (a) remove AllowedTransitions[StatePausing][StateKilling] to make 'kill of a paused sandbox always skips' consistent across both paths, or (b) explicitly document that the sequential path is intentional (cleanup/force-kill), update TestStartRemoving_PauseThenKill to reflect the new concurrent behavior, and add a comment explaining why the two paths differ.

Step-by-step proof:

  1. Sandbox is in StateRunning. Client A calls StartRemoving with action Pause. AllowedTransitions[StateRunning][StatePausing]=true, so the pause transition begins; sandbox state is set to StatePausing and transitionKey is populated.
  2. Pause transition callback fires; transitionKey is deleted; sandbox remains in StatePausing.
  3. Client B calls StartRemoving with action Kill. transactionID == "" (no active transition), so handleExistingTransition is not called. Code reaches sandbox.AllowedTransitions[sbx.State][newState] check: AllowedTransitions[StatePausing][StateKilling] = true. Kill proceeds; sandbox moves to StateKilling. alreadyDone=false is returned.
  4. Had Client B arrived 1 ms earlier (before step 2), it would have entered handleExistingTransition, waited for the pause to complete, then hit the new if sbx.State == sandbox.StatePausing guard and returned alreadyDone=true. Opposite result, same logical operation.

Move the StatePausing guard from the unlocked wait path into the
locked 'no transition' section of startRemoving/StartRemoving so it
reads actual state under the mutex (memory) or distributed lock
(Redis). The recursive retry after waiting naturally re-acquires the
lock and hits the check.
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit cb4a46b. Configure here.

// Pause is a terminal transition — the sandbox is already being
// removed, so there's nothing left to kill. Just report done.
if sbx._data.State == sandbox.StatePausing {
return true, func(context.Context, error) {}, nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overly broad pausing check skips all transitions, not just kill

Medium Severity

The new guard if sbx._data.State == sandbox.StatePausing returns alreadyDone=true for any requested transition, not just kill. The comment says "nothing left to kill," but the condition doesn't verify that newState == sandbox.StateKilling. A snapshot request (StateActionSnapshot) targeting a StatePausing sandbox — previously rejected with InvalidStateTransitionError — now silently returns success. The condition needs to be scoped to newState == sandbox.StateKilling to match the stated intent.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit cb4a46b. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants