Skip to content

fix(orchestrator): resolve StorageDiff hang when Init fails#2408

Open
dobrac wants to merge 3 commits intomainfrom
fix/storage-diff-init-hang
Open

fix(orchestrator): resolve StorageDiff hang when Init fails#2408
dobrac wants to merge 3 commits intomainfrom
fix/storage-diff-init-hang

Conversation

@dobrac
Copy link
Copy Markdown
Contributor

@dobrac dobrac commented Apr 16, 2026

  1. Permanent hang: StorageDiff.Init() did not call SetError on the SetOnce channel when OpenSeekable failed, leaving it unresolved forever. Any concurrent goroutine waiting on the chunker (via Slice, ReadAt, Close, FileSize, WriteTo) would block indefinitely — including peer server goroutines handling ReadAtBuildSeekable streams.

  2. Cache poisoning on Init failure: When Init failed, the SetOnce-based delete path had no compare-and-swap semantics — a key-only cache.Delete could silently evict a valid replacement entry stored by a concurrent goroutine after disk-pressure eviction, triggering Close() on a live, in-use diff.

While investigating I've found that the SetOnce[Chunker] async-init pattern provided no benefit: every caller of DiffStore.Get uses the returned diff immediately (ReadAt/Slice), so the "lazy blocking" via chunker.Wait() just moves the wait from Get to the first read — identical real-world latency.

This made a structural simplification possible: replace the GetOrSet + SetOnce + delete-on-failure pattern with singleflight.Do + cache-after-success. Failed inits are never cached, so there is nothing to delete.

cc @tvi for the analysis and debugging

StorageDiff.Init() did not call SetError on the SetOnce channel when
OpenSeekable failed, leaving it unresolved forever. Any concurrent
goroutine waiting on the chunker (via Slice, ReadAt, etc.) would block
indefinitely.

Use a defer to guarantee SetError is called on every error path, and
remove the now-poisoned cache entry so subsequent callers retry with a
fresh StorageDiff instead of immediately getting the stale error.
@cursor
Copy link
Copy Markdown

cursor bot commented Apr 16, 2026

PR Summary

Medium Risk
Touches concurrency and caching behavior in the build diff hot path; mistakes could cause nil dereferences or subtle contention/regressions under load.

Overview
Fixes StorageDiff initialization so failures no longer leave callers blocked, and updates DiffStore.Get to use singleflight + explicit cache Get/Set to deduplicate concurrent Diff.Init work (and only cache a diff after successful init).

Reviewed by Cursor Bugbot for commit 6b41821. Bugbot is set up for automated code reviews on this repo. Configure here.

@dobrac dobrac added the bug Something isn't working label Apr 16, 2026
@dobrac dobrac marked this pull request as draft April 16, 2026 06:29
@dobrac dobrac marked this pull request as ready for review April 16, 2026 06:34
Comment thread packages/orchestrator/pkg/sandbox/build/cache.go Outdated
Comment thread packages/orchestrator/pkg/sandbox/build/cache.go Outdated
@dobrac dobrac marked this pull request as draft April 16, 2026 06:52
@dobrac dobrac marked this pull request as ready for review April 16, 2026 08:20
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both previously flagged issues are fully resolved by the singleflight redesign — LGTM.

Extended reasoning...

Overview

The PR touches two files in the orchestrator's build cache hot path: cache.go (DiffStore.Get) and storage_diff.go (StorageDiff). The changes replace the GetOrSet + SetOnce[Chunker] + delete-on-failure pattern with singleflight.Do + explicit cache Set only on success.

Security risks

No security-sensitive code is touched. This is an internal concurrency fix with no auth, crypto, or permission implications.

Level of scrutiny

Medium: concurrency code in a hot path. However, the singleflight pattern is well-established, the double-check inside the group function is correct, and the simplification (removing SetOnce indirection) reduces the surface area for future races.

Prior review resolution

My previous review raised two issues against an earlier revision of this PR: (1) a key-only cache.Delete that could evict a valid replacement entry (red), and (2) spurious Warn logs from the eviction handler calling Close on a never-initialized diff (yellow). Both are resolved by the current approach — failed inits are never placed in the cache, so no delete is ever needed and the eviction handler is never triggered for failed inits. Both inline comments are marked resolved by the author. No new bugs were found.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants