fix(storage): restructure GCS read retries with fresh context per attempt#2405
fix(storage): restructure GCS read retries with fresh context per attempt#2405
Conversation
…empt The GCS client library retries within the caller's context, so the previous configuration (10 retry attempts sharing a single 10s timeout) left later attempts with almost no time to complete — the context expired before they could do meaningful work. Replace the library's built-in retry with our own loop that creates a fresh 5s context.WithTimeout per attempt (4 attempts, exponential backoff 200ms-2s). This gives every attempt a full deadline and proper backoff between them. Also remove the 10s timeout that wrapped OpenRangeReader's entire reader lifetime — it was silently cancelling progressive 4MiB chunk reads mid-stream. Callers (e.g. StreamingChunker with its 60s fetch timeout) now control the reader deadline via their own context.
…EOF propagation The previous commit (c6774c6b0) set MaxAttempts(1) on the shared object handle, which correctly disabled library retries for ReadAt (handled by retryWithBackoff) but inadvertently removed retries from Delete, Size, Put, WriteTo, and OpenRangeReader. Fix by introducing a second handle (readAtHandle) with retries disabled, used exclusively by readAtOnce. The primary handle retains the original 10-attempt library retry config for all other operations. Also fix readAtOnce swallowing io.EOF — the cache layer's isCompleteRead relies on io.EOF to identify cacheable short reads (last chunk). Without it, end-of-file chunks were never cached, causing repeated GCS fetches. Additionally replace time.After with time.NewTimer in retryWithBackoff to avoid a minor timer leak on context cancellation.
…backoff Previously, when ctx.Done() fired during the backoff sleep, the loop continued to the next iteration and called fn() on the already-cancelled context — wasting a gRPC dial before the ctx.Err() check broke out. Now we return directly from the select case.
…struction OpenSeekable and OpenBlob had identical handle/readAtHandle construction logic copy-pasted. Extract it into gcpStorage.newObject() so both methods delegate to a single place.
retryWithBackoff now returns the attempt count (1-indexed), which ReadAt passes as a retry.attempts attribute on the existing orchestrator.storage.gcs.read histogram and counters. This makes retry frequency visible in Grafana without adding new instruments.
PR SummaryMedium Risk Overview Reviewed by Cursor Bugbot for commit c5957bc. Bugbot is set up for automated code reviews on this repo. Configure here. |
There was a problem hiding this comment.
This PR restructures production-critical GCS read paths; removing fixed timeouts from OpenRangeReader and WriteTo shifts deadline responsibility to callers, which warrants validating the full call graph.
Extended reasoning...
Overview
This PR restructures GCS read retry behavior across four files: a new gcs_retry.go helper implementing exponential backoff with a fresh context.WithTimeout per attempt, comprehensive unit tests in gcs_retry_test.go, and core changes to storage_google.go that remove the shared 10s timeout from ReadAt, OpenRangeReader, and WriteTo. The gcp_multipart.go change is a comment-only update.
Security risks
No security concerns — this is purely a retry/timeout restructuring within an internal storage abstraction. No auth, crypto, or permission changes.
Level of scrutiny
This deserves human review. The changes are well-motivated and the implementation logic in retryWithBackoff is correct, but removing the fixed timeouts from OpenRangeReader and WriteTo shifts deadline responsibility entirely to callers. If any caller omits a deadline context, reads could hang indefinitely. The PR documents this design decision explicitly, but verifying that all callers set appropriate deadlines requires reading the broader call graph — something worth a human to validate.
Other factors
The only bug found is a nit in the test: the backoff-growth assertion uses prevGap*3/4 (75% threshold) instead of verifying growth, contradicting the comment claiming >1.5x. This is a test quality issue only — production behavior is unaffected.
Fix dogsled (triple blank identifier), nlreturn (missing blank lines before returns), and testifylint (assert.ErrorIs -> require.ErrorIs).
… retries to 1 Increase googlePerAttemptTimeout from 3s to 5s — 3s may not be enough when hitting a slow GCS slot. Also fix the constants comment to accurately describe the original failure mode (slow responses leaving 1-2 truncated attempts, not all attempts starved). Reduce sliceMaxRetries from 3 to 1 since ReadAt now does 3 proper retries internally. Combined worst-case: 2 × 16s ≈ 32s per page fault.
Add gcs-per-attempt-timeout-ms and gcs-max-read-attempts LaunchDarkly flags so retry parameters can be tuned in production without redeployment. The flags are resolved in sandbox Factory.CreateSandbox/ResumeSandbox (where the LD evaluation context with team/template is available) and embedded in context.Context via storage.WithReadRetryConfig. The storage layer reads them at ReadAt call time, falling back to package-level defaults when no override is present.
Add ±25% random jitter to the exponential backoff sleep in retryWithBackoff. During GCS brownouts affecting many sandboxes, deterministic backoff intervals would cause all retries to fire simultaneously. The jitter spreads retry load across time while preserving the exponential progression.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 838489d. Configure here.
|
Opening it for a review, the test failures don't seem to be related. |
…ckoff sleep The backoff sleep path was returning the stale GCS error from the previous attempt instead of the context cancellation error, making it impossible for callers to distinguish shutdown from transient GCS failure.
…test Fixes testifylint lint violation.

The GCS client library retries within the caller's context. Our previous
config shared a single 10s context.WithTimeout across 10 retry attempts,
so later attempts had almost no time left — the context expired before
they could do meaningful work. This was the root cause behind transient
GCS failures killing sandboxes via UFFD page faults.
#2389 added a retry at the Slice level as an immediate fix. This change
addresses the underlying problem in the storage layer:
(3 attempts, exponential backoff), instead of relying on the library's
built-in retry within a shared 10s context.
which was cancelling progressive 4 MiB streaming reads mid-chunk.
Callers now control the deadline via their own context.
large object downloads. Removed in favor of the caller's context.
object handle.