Skip to content

fix(storage): restructure GCS read retries with fresh context per attempt#2405

Open
arkamar wants to merge 12 commits intomainfrom
feat/gcs-retry
Open

fix(storage): restructure GCS read retries with fresh context per attempt#2405
arkamar wants to merge 12 commits intomainfrom
feat/gcs-retry

Conversation

@arkamar
Copy link
Copy Markdown
Contributor

@arkamar arkamar commented Apr 15, 2026

The GCS client library retries within the caller's context. Our previous
config shared a single 10s context.WithTimeout across 10 retry attempts,
so later attempts had almost no time left — the context expired before
they could do meaningful work. This was the root cause behind transient
GCS failures killing sandboxes via UFFD page faults.

#2389 added a retry at the Slice level as an immediate fix. This change
addresses the underlying problem in the storage layer:

  • ReadAt now uses its own retry loop with a fresh 3s context per attempt
    (3 attempts, exponential backoff), instead of relying on the library's
    built-in retry within a shared 10s context.
  • OpenRangeReader had a 10s timeout wrapping the entire reader lifetime,
    which was cancelling progressive 4 MiB streaming reads mid-chunk.
    Callers now control the deadline via their own context.
  • WriteTo had the same 10s context starving the library's retries for
    large object downloads. Removed in favor of the caller's context.
  • Non-read operations keep the library's built-in retry via a separate
    object handle.

arkamar added 5 commits April 15, 2026 17:21
…empt

The GCS client library retries within the caller's context, so the previous
configuration (10 retry attempts sharing a single 10s timeout) left later
attempts with almost no time to complete — the context expired before they
could do meaningful work.

Replace the library's built-in retry with our own loop that creates a fresh
5s context.WithTimeout per attempt (4 attempts, exponential backoff 200ms-2s).
This gives every attempt a full deadline and proper backoff between them.

Also remove the 10s timeout that wrapped OpenRangeReader's entire reader
lifetime — it was silently cancelling progressive 4MiB chunk reads mid-stream.
Callers (e.g. StreamingChunker with its 60s fetch timeout) now control the
reader deadline via their own context.
…EOF propagation

The previous commit (c6774c6b0) set MaxAttempts(1) on the shared object
handle, which correctly disabled library retries for ReadAt (handled by
retryWithBackoff) but inadvertently removed retries from Delete, Size,
Put, WriteTo, and OpenRangeReader.

Fix by introducing a second handle (readAtHandle) with retries disabled,
used exclusively by readAtOnce. The primary handle retains the original
10-attempt library retry config for all other operations.

Also fix readAtOnce swallowing io.EOF — the cache layer's isCompleteRead
relies on io.EOF to identify cacheable short reads (last chunk). Without
it, end-of-file chunks were never cached, causing repeated GCS fetches.

Additionally replace time.After with time.NewTimer in retryWithBackoff
to avoid a minor timer leak on context cancellation.
…backoff

Previously, when ctx.Done() fired during the backoff sleep, the loop
continued to the next iteration and called fn() on the already-cancelled
context — wasting a gRPC dial before the ctx.Err() check broke out.
Now we return directly from the select case.
…struction

OpenSeekable and OpenBlob had identical handle/readAtHandle construction
logic copy-pasted. Extract it into gcpStorage.newObject() so both
methods delegate to a single place.
retryWithBackoff now returns the attempt count (1-indexed), which
ReadAt passes as a retry.attempts attribute on the existing
orchestrator.storage.gcs.read histogram and counters. This makes
retry frequency visible in Grafana without adding new instruments.
@cursor
Copy link
Copy Markdown

cursor bot commented Apr 15, 2026

PR Summary

Medium Risk
Touches core GCS read paths used during sandbox startup/page faults; incorrect retry/timeout tuning could increase latency or still cause transient read failures under load.

Overview
Refactors GCS reads to avoid a single shared timeout across retries by implementing an explicit ReadAt retry loop with fresh per-attempt deadlines, jittered exponential backoff, and telemetry for attempt counts; the per-attempt timeout and max attempts are now feature-flagged and injected once into sandbox request contexts. It also removes the fixed 10s wrapper timeout around OpenRangeReader and WriteTo so long progressive reads/downloads are governed by the caller’s context, and reduces UFFD Slice()-level retries since ReadAt now retries internally.

Reviewed by Cursor Bugbot for commit c5957bc. Bugbot is set up for automated code reviews on this repo. Configure here.

@arkamar arkamar marked this pull request as draft April 15, 2026 20:09
Comment thread packages/shared/pkg/storage/gcs_retry.go
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR restructures production-critical GCS read paths; removing fixed timeouts from OpenRangeReader and WriteTo shifts deadline responsibility to callers, which warrants validating the full call graph.

Extended reasoning...

Overview

This PR restructures GCS read retry behavior across four files: a new gcs_retry.go helper implementing exponential backoff with a fresh context.WithTimeout per attempt, comprehensive unit tests in gcs_retry_test.go, and core changes to storage_google.go that remove the shared 10s timeout from ReadAt, OpenRangeReader, and WriteTo. The gcp_multipart.go change is a comment-only update.

Security risks

No security concerns — this is purely a retry/timeout restructuring within an internal storage abstraction. No auth, crypto, or permission changes.

Level of scrutiny

This deserves human review. The changes are well-motivated and the implementation logic in retryWithBackoff is correct, but removing the fixed timeouts from OpenRangeReader and WriteTo shifts deadline responsibility entirely to callers. If any caller omits a deadline context, reads could hang indefinitely. The PR documents this design decision explicitly, but verifying that all callers set appropriate deadlines requires reading the broader call graph — something worth a human to validate.

Other factors

The only bug found is a nit in the test: the backoff-growth assertion uses prevGap*3/4 (75% threshold) instead of verifying growth, contradicting the comment claiming >1.5x. This is a test quality issue only — production behavior is unaffected.

Comment thread packages/shared/pkg/storage/gcs_retry_test.go
arkamar added 4 commits April 16, 2026 08:56
Fix dogsled (triple blank identifier), nlreturn (missing blank lines
before returns), and testifylint (assert.ErrorIs -> require.ErrorIs).
… retries to 1

Increase googlePerAttemptTimeout from 3s to 5s — 3s may not be enough
when hitting a slow GCS slot. Also fix the constants comment to accurately
describe the original failure mode (slow responses leaving 1-2 truncated
attempts, not all attempts starved).

Reduce sliceMaxRetries from 3 to 1 since ReadAt now does 3 proper
retries internally. Combined worst-case: 2 × 16s ≈ 32s per page fault.
Add gcs-per-attempt-timeout-ms and gcs-max-read-attempts LaunchDarkly
flags so retry parameters can be tuned in production without redeployment.

The flags are resolved in sandbox Factory.CreateSandbox/ResumeSandbox
(where the LD evaluation context with team/template is available) and
embedded in context.Context via storage.WithReadRetryConfig. The storage
layer reads them at ReadAt call time, falling back to package-level
defaults when no override is present.
Add ±25% random jitter to the exponential backoff sleep in
retryWithBackoff. During GCS brownouts affecting many sandboxes,
deterministic backoff intervals would cause all retries to fire
simultaneously. The jitter spreads retry load across time while
preserving the exponential progression.
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 838489d. Configure here.

Comment thread packages/shared/pkg/storage/gcs_retry.go
@arkamar arkamar assigned jakubno and unassigned bchalios Apr 17, 2026
@arkamar arkamar marked this pull request as ready for review April 17, 2026 07:47
@arkamar
Copy link
Copy Markdown
Contributor Author

arkamar commented Apr 17, 2026

Opening it for a review, the test failures don't seem to be related.

Comment thread packages/shared/pkg/storage/gcs_retry.go
arkamar added 2 commits April 17, 2026 10:34
…ckoff sleep

The backoff sleep path was returning the stale GCS error from the
previous attempt instead of the context cancellation error, making it
impossible for callers to distinguish shutdown from transient GCS failure.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants