fix(orchestrator): resolve StorageDiff hang when Init fails#2408
fix(orchestrator): resolve StorageDiff hang when Init fails#2408
Conversation
StorageDiff.Init() did not call SetError on the SetOnce channel when OpenSeekable failed, leaving it unresolved forever. Any concurrent goroutine waiting on the chunker (via Slice, ReadAt, etc.) would block indefinitely. Use a defer to guarantee SetError is called on every error path, and remove the now-poisoned cache entry so subsequent callers retry with a fresh StorageDiff instead of immediately getting the stale error.
PR SummaryMedium Risk Overview Reviewed by Cursor Bugbot for commit 6b41821. Bugbot is set up for automated code reviews on this repo. Configure here. |
There was a problem hiding this comment.
Both previously flagged issues are fully resolved by the singleflight redesign — LGTM.
Extended reasoning...
Overview
The PR touches two files in the orchestrator's build cache hot path: cache.go (DiffStore.Get) and storage_diff.go (StorageDiff). The changes replace the GetOrSet + SetOnce[Chunker] + delete-on-failure pattern with singleflight.Do + explicit cache Set only on success.
Security risks
No security-sensitive code is touched. This is an internal concurrency fix with no auth, crypto, or permission implications.
Level of scrutiny
Medium: concurrency code in a hot path. However, the singleflight pattern is well-established, the double-check inside the group function is correct, and the simplification (removing SetOnce indirection) reduces the surface area for future races.
Prior review resolution
My previous review raised two issues against an earlier revision of this PR: (1) a key-only cache.Delete that could evict a valid replacement entry (red), and (2) spurious Warn logs from the eviction handler calling Close on a never-initialized diff (yellow). Both are resolved by the current approach — failed inits are never placed in the cache, so no delete is ever needed and the eviction handler is never triggered for failed inits. Both inline comments are marked resolved by the author. No new bugs were found.
Permanent hang:
StorageDiff.Init()did not callSetErroron theSetOncechannel whenOpenSeekablefailed, leaving it unresolved forever. Any concurrent goroutine waiting on the chunker (viaSlice,ReadAt,Close,FileSize,WriteTo) would block indefinitely — including peer server goroutines handlingReadAtBuildSeekablestreams.Cache poisoning on Init failure: When
Initfailed, theSetOnce-based delete path had no compare-and-swap semantics — a key-onlycache.Deletecould silently evict a valid replacement entry stored by a concurrent goroutine after disk-pressure eviction, triggeringClose()on a live, in-use diff.While investigating I've found that the
SetOnce[Chunker]async-init pattern provided no benefit: every caller ofDiffStore.Getuses the returned diff immediately (ReadAt/Slice), so the "lazy blocking" viachunker.Wait()just moves the wait fromGetto the first read — identical real-world latency.This made a structural simplification possible: replace the
GetOrSet+SetOnce+ delete-on-failure pattern withsingleflight.Do+ cache-after-success. Failed inits are never cached, so there is nothing to delete.cc @tvi for the analysis and debugging