feat(transport): add HTTP retry with exponential backoff#1520
feat(transport): add HTTP retry with exponential backoff#1520
Conversation
|
b083a57 to
a264f66
Compare
|
@sentry review |
|
@cursor review |
|
@sentry review |
|
@cursor review |
|
@cursor review |
|
@cursor review |
243c880 to
d6aa792
Compare
|
@cursor review |
fbbffb2 to
abd5815
Compare
|
@cursor review |
f030cf9 to
22e3fc4
Compare
|
@sentry review |
|
@cursor review |
22e3fc4 to
ffce486
Compare
|
@cursor review |
b133fd6 to
93db900
Compare
Use a progress callback (CURLOPT_XFERINFOFUNCTION) that checks an atomic shutdown flag. When http_transport_shutdown_timeout fires, it sets the flag via curl_client_shutdown, causing curl_easy_perform to abort promptly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Replace retry_dump_queue with retry_seal, leveraging the new bgworker draining flag from upstream. Retry envelopes are already persisted to disk by retry_enqueue, so they don't need explicit dumping — only in-memory http_send_task envelopes are dumped by http_dump_queue. On shutdown timeout: flush cleanup tasks + seal retry to prevent a detached worker from writing duplicate envelopes. Disable http_retry in the crash daemon — it's a one-shot process that sends once and exits. Delegate plain cache writes (retry_count < 0) to the new sentry__envelope_write_to_cache for minidump splitting. Add no-http-retry to all cache_keep integration test runs to ensure they exercise the cache_keep code path, not the retry path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Retain backwards-compatible behavior by defaulting http_retry to false. Add "http-retry" flag to the example app for opting in. Retry integration tests now pass "http-retry" explicitly, while cache tests keep "no-http-retry" for when the default is eventually flipped. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The envelope tag (next_tag, envelope->tag, sentry__envelope_get_tag) and sealed_tag on retry were only used by retry_dump_cb, which was replaced by retry_seal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0c94acd to
a485974
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
The else branch clearing retry_func on retry allocation failure had no effect: http_send_task decides retry vs cache based on state->retry (the pointer), and http_transport_retry already no-ops when it's NULL. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When cache_keep was enabled and retries were exhausted, handle_result would move the envelope to cache format regardless of whether the send succeeded. Only cache on network failure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Inline single-use env dicts and use plain `env` where reused. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When bgworker shutdown times out, remaining queued envelopes are dumped to disk. With retry enabled, these should be written to cache as retry files so they get retried on next startup. The merge that replaced retry_dump_queue with retry_seal lost this: dump_queue was writing to the run directory unconditionally. Write directly to cache via sentry__run_write_cache, bypassing the sealed retry_enqueue since the seal is already set at this point. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Revert http_dump_task_cb change — dumped envelopes belong in *.run/ and get picked up by process_old_runs on next startup. Fix tests to count envelopes in both cache/ (retry) and *.run/ (shutdown dump), since slow transports may not process all sends before the timeout. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The -00-/-01- assertions verified "stop on first failure" (only one envelope gets bumped per retry cycle). On slow transports like WinHTTP (~2s per connect failure), the startup poll may never fire during the second run because old-run envelope sends block the bgworker until shutdown. The retry file from the first run stays unprocessed. Retry bumping behavior is already covered by test_http_retry_multiple_attempts (single envelope, deterministic). This test focuses on total envelope preservation across cache/ and *.run/, which is the meaningful invariant for multiple envelopes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
retry_enqueue transitioned the state from STARTUP to RUNNING to prevent the startup poll from re-processing same-session envelopes via timestamp collision. This was needed when timestamps had second precision, but since the switch to millisecond precision (sentry__usec_time / 1000), the collision window is 1ms — negligible. The transition caused two problems: - The startup poll used backoff mode instead of immediate-send for pre-existing retry files from previous sessions - retry_flush_task during shutdown became a no-op, skipping high-count files entirely Removing the transition lets the startup poll and flush task work as designed: pre-existing retry files are sent promptly on startup and flushed on shutdown regardless of same-session enqueue activity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The -00-/-01- assertions depend on the startup poll or flush task processing the retry file from the previous run. On slow transports (WinHTTP ~2s per connect failure), the flush task is queued behind old-run send tasks and the bgworker exits (draining) before reaching it. The retry file is never bumped. Retry count bumping is deterministically tested by test_http_retry_multiple_attempts (single envelope). This test verifies no envelopes are lost across cache/ and *.run/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Restore the STARTUP→RUNNING transition in retry_enqueue — removing it caused the startup poll to re-process same-session envelopes (files written by enqueue with ts == startup_time got picked up and bumped). The transition is needed to prevent re-processing, but it made retry_flush_task a no-op during shutdown (state already RUNNING). Fix by making flush unconditional: its job is to send all pending retry files before shutdown, regardless of whether the startup poll already ran. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use startup_time instead of UINT64_MAX to filter files in flush task. UINT64_MAX bypassed all filtering and picked up envelopes written by retry_enqueue during the same session, bumping them from -00- to -01-. With startup_time, the flush only processes files from previous sessions (ts < startup_time), matching the startup poll's behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Swap http_send_envelope parameter order to match sentry_retry_send_func_t so it can be passed directly to sentry__retry_start without a wrapper. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
write_envelope is used by run_write_envelope and run_write_external which never need retry format. Move the retry path building into sentry__run_write_cache using sentry__run_make_cache_path, removing duplicate snprintf logic. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Leftover from the envelope tag mechanism removed in a485974. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add HTTP retry with exponential backoff for network failures, modeled after Crashpad's upload retry behavior.
Note
The new
http_retryoption was made opt-in to avoid a behavioral change.Note
Adds an explicit 15s connect timeout to both curl and WinHTTP, matching the value used by Crashpad. This reduces the time the transport worker is blocked on unreachable hosts, previously ~75s on curl (OS TCP SYN retry limit) and 60s on WinHTTP.
Failed envelopes are stored as
<db>/cache/<ts>-<n>-<uuid>.envelopeand retried on startup after a 100ms throttle, and then with exponential backoff (15min, 30min, 1h, 2h, 4h, 8h). When retries are exhausted, and offline caching is enabled, envelopes are stored as<db>/cache/<uuid>.envelopeinstead of being discarded.flowchart TD startup --> sretry{retry?} sretry -->|yes| throttle throttle -. 100ms .-> send send -->|success| discard send -->|fail| retry{retry?} retry -->|no| cache{cache?} cache -->|no| discard cache -->|yes| cache-dump[<db>/cache/<br/><uuid>.envelope] retry -->|yes| cache-retry[<db>/cache/<br/><ts>-<n>-<uuid>.envelope] cache-retry --> backoff backoff -. 2ⁿ×15min .-> sendBuilds upon:
See also:
Closes: #1316