fix(storage): restructure GCS read retries with fresh context per attempt by arkamar · Pull Request #2405 · e2b-dev/infra

arkamar · 2026-04-15T20:08:54Z

The GCS client library retries within the caller's context. Our previous
config shared a single 10s context.WithTimeout across 10 retry attempts,
so later attempts had almost no time left — the context expired before
they could do meaningful work. This was the root cause behind transient
GCS failures killing sandboxes via UFFD page faults.

#2389 added a retry at the Slice level as an immediate fix. This change
addresses the underlying problem in the storage layer:

ReadAt now uses its own retry loop with a fresh 3s context per attempt
(3 attempts, exponential backoff), instead of relying on the library's
built-in retry within a shared 10s context.
OpenRangeReader had a 10s timeout wrapping the entire reader lifetime,
which was cancelling progressive 4 MiB streaming reads mid-chunk.
Callers now control the deadline via their own context.
WriteTo had the same 10s context starving the library's retries for
large object downloads. Removed in favor of the caller's context.
Non-read operations keep the library's built-in retry via a separate
object handle.

…empt The GCS client library retries within the caller's context, so the previous configuration (10 retry attempts sharing a single 10s timeout) left later attempts with almost no time to complete — the context expired before they could do meaningful work. Replace the library's built-in retry with our own loop that creates a fresh 5s context.WithTimeout per attempt (4 attempts, exponential backoff 200ms-2s). This gives every attempt a full deadline and proper backoff between them. Also remove the 10s timeout that wrapped OpenRangeReader's entire reader lifetime — it was silently cancelling progressive 4MiB chunk reads mid-stream. Callers (e.g. StreamingChunker with its 60s fetch timeout) now control the reader deadline via their own context.

…EOF propagation The previous commit (c6774c6b0) set MaxAttempts(1) on the shared object handle, which correctly disabled library retries for ReadAt (handled by retryWithBackoff) but inadvertently removed retries from Delete, Size, Put, WriteTo, and OpenRangeReader. Fix by introducing a second handle (readAtHandle) with retries disabled, used exclusively by readAtOnce. The primary handle retains the original 10-attempt library retry config for all other operations. Also fix readAtOnce swallowing io.EOF — the cache layer's isCompleteRead relies on io.EOF to identify cacheable short reads (last chunk). Without it, end-of-file chunks were never cached, causing repeated GCS fetches. Additionally replace time.After with time.NewTimer in retryWithBackoff to avoid a minor timer leak on context cancellation.

…backoff Previously, when ctx.Done() fired during the backoff sleep, the loop continued to the next iteration and called fn() on the already-cancelled context — wasting a gRPC dial before the ctx.Err() check broke out. Now we return directly from the select case.

…struction OpenSeekable and OpenBlob had identical handle/readAtHandle construction logic copy-pasted. Extract it into gcpStorage.newObject() so both methods delegate to a single place.

retryWithBackoff now returns the attempt count (1-indexed), which ReadAt passes as a retry.attempts attribute on the existing orchestrator.storage.gcs.read histogram and counters. This makes retry frequency visible in Grafana without adding new instruments.

cursor · 2026-04-15T20:09:02Z

PR Summary

Medium Risk
Touches core GCS read paths used during sandbox startup/page faults; incorrect retry/timeout tuning could increase latency or still cause transient read failures under load.

Overview
Refactors GCS reads to avoid a single shared timeout across retries by implementing an explicit ReadAt retry loop with fresh per-attempt deadlines, jittered exponential backoff, and telemetry for attempt counts; the per-attempt timeout and max attempts are now feature-flagged and injected once into sandbox request contexts. It also removes the fixed 10s wrapper timeout around OpenRangeReader and WriteTo so long progressive reads/downloads are governed by the caller’s context, and reduces UFFD Slice()-level retries since ReadAt now retries internally.

^{Reviewed by Cursor Bugbot for commit c5957bc. Bugbot is set up for automated code reviews on this repo. Configure here.}

claude

This PR restructures production-critical GCS read paths; removing fixed timeouts from OpenRangeReader and WriteTo shifts deadline responsibility to callers, which warrants validating the full call graph.

Extended reasoning...

Overview

This PR restructures GCS read retry behavior across four files: a new gcs_retry.go helper implementing exponential backoff with a fresh context.WithTimeout per attempt, comprehensive unit tests in gcs_retry_test.go, and core changes to storage_google.go that remove the shared 10s timeout from ReadAt, OpenRangeReader, and WriteTo. The gcp_multipart.go change is a comment-only update.

Security risks

No security concerns — this is purely a retry/timeout restructuring within an internal storage abstraction. No auth, crypto, or permission changes.

Level of scrutiny

This deserves human review. The changes are well-motivated and the implementation logic in retryWithBackoff is correct, but removing the fixed timeouts from OpenRangeReader and WriteTo shifts deadline responsibility entirely to callers. If any caller omits a deadline context, reads could hang indefinitely. The PR documents this design decision explicitly, but verifying that all callers set appropriate deadlines requires reading the broader call graph — something worth a human to validate.

Other factors

The only bug found is a nit in the test: the backoff-growth assertion uses prevGap*3/4 (75% threshold) instead of verifying growth, contradicting the comment claiming >1.5x. This is a test quality issue only — production behavior is unaffected.

Fix dogsled (triple blank identifier), nlreturn (missing blank lines before returns), and testifylint (assert.ErrorIs -> require.ErrorIs).

… retries to 1 Increase googlePerAttemptTimeout from 3s to 5s — 3s may not be enough when hitting a slow GCS slot. Also fix the constants comment to accurately describe the original failure mode (slow responses leaving 1-2 truncated attempts, not all attempts starved). Reduce sliceMaxRetries from 3 to 1 since ReadAt now does 3 proper retries internally. Combined worst-case: 2 × 16s ≈ 32s per page fault.

Add gcs-per-attempt-timeout-ms and gcs-max-read-attempts LaunchDarkly flags so retry parameters can be tuned in production without redeployment. The flags are resolved in sandbox Factory.CreateSandbox/ResumeSandbox (where the LD evaluation context with team/template is available) and embedded in context.Context via storage.WithReadRetryConfig. The storage layer reads them at ReadAt call time, falling back to package-level defaults when no override is present.

Add ±25% random jitter to the exponential backoff sleep in retryWithBackoff. During GCS brownouts affecting many sandboxes, deterministic backoff intervals would cause all retries to fire simultaneously. The jitter spreads retry load across time while preserving the exponential progression.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 838489d. Configure here.}

arkamar · 2026-04-17T07:48:25Z

Opening it for a review, the test failures don't seem to be related.

…ckoff sleep The backoff sleep path was returning the stale GCS error from the previous attempt instead of the context cancellation error, making it impossible for callers to distinguish shutdown from transient GCS failure.

…test Fixes testifylint lint violation.

arkamar added 5 commits April 15, 2026 17:21

refactor(storage): extract shared newObject helper for GCS handle con…

c815081

…struction OpenSeekable and OpenBlob had identical handle/readAtHandle construction logic copy-pasted. Extract it into gcpStorage.newObject() so both methods delegate to a single place.

arkamar requested review from ValentaTomas, dobrac and jakubno as code owners April 15, 2026 20:08

e2b-request-same-site-reviewers bot assigned bchalios Apr 15, 2026

arkamar marked this pull request as draft April 15, 2026 20:09

claude bot reviewed Apr 15, 2026

View reviewed changes

Comment thread packages/shared/pkg/storage/gcs_retry.go

claude bot reviewed Apr 15, 2026

View reviewed changes

Comment thread packages/shared/pkg/storage/gcs_retry_test.go

arkamar added 4 commits April 16, 2026 08:56

fix(storage): address lint issues in retry code

1e2281e

Fix dogsled (triple blank identifier), nlreturn (missing blank lines before returns), and testifylint (assert.ErrorIs -> require.ErrorIs).

cursor bot reviewed Apr 16, 2026

View reviewed changes

Comment thread packages/shared/pkg/storage/gcs_retry.go

Merge branch 'main' into feat/gcs-retry

de5b5ed

arkamar assigned jakubno and unassigned bchalios Apr 17, 2026

arkamar marked this pull request as ready for review April 17, 2026 07:47

claude bot reviewed Apr 17, 2026

View reviewed changes

Comment thread packages/shared/pkg/storage/gcs_retry.go

arkamar added 2 commits April 17, 2026 10:34

fix(storage): use require.ErrorIs instead of assert.ErrorIs in retry …

c5957bc

…test Fixes testifylint lint violation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(storage): restructure GCS read retries with fresh context per attempt#2405

fix(storage): restructure GCS read retries with fresh context per attempt#2405
arkamar wants to merge 12 commits intomainfrom
feat/gcs-retry

arkamar commented Apr 15, 2026

Uh oh!

cursor bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

claude bot left a comment

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

arkamar commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

arkamar commented Apr 15, 2026

Uh oh!

cursor bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

arkamar commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cursor bot commented Apr 15, 2026 •

edited

Loading