vault: gracefully handle individual blob broadcast failures in Observation#21765
Conversation
…ation Previously, if any single payload failed to broadcast as a blob during the Observation phase, the entire observation was aborted and returned an error. This is unnecessarily disruptive — one problematic payload (e.g. transient network issue, malformed data) would prevent all other valid payloads from being included in the observation, stalling the OCR round. Now, individual broadcast failures are logged as warnings (with the request ID and error details) and the failed payload is simply excluded from PendingQueueItems. The remaining payloads continue to be broadcast and observed normally. The blob broadcast logic is extracted into a dedicated broadcastBlobPayloads method for clarity. Made-with: Cursor
|
👋 prashantkumar1982, thanks for creating this pull request! To help reviewers, please consider creating future PRs as drafts first. This allows you to self-review and make any final changes before notifying the team. Once you're ready, you can mark it as "Ready for review" to request feedback. Thanks! |
|
✅ No conflicts with other open PRs targeting |
|
I see you updated files related to
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c16097773c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Check ctx.Err() when BroadcastBlob fails so that context.Canceled and context.DeadlineExceeded are returned immediately rather than swallowed. This preserves fail-fast semantics for expired OCR rounds while still skipping item-specific transient errors. Made-with: Cursor
| var g errgroup.Group | ||
| for i, payload := range payloads { | ||
| g.Go(func() error { | ||
| blobHandle, err := fetcher.BroadcastBlob(ctx, payload, ocr3_1types.BlobExpirationHintSequenceNumber{SeqNr: seqNr + 2}) |
There was a problem hiding this comment.
@prashantkumar1982 The way I read this a single request that takes a long time will delay the whole batch, and could even cause it to fail since there's no actual timeout associated with the request (ctx will only be cancelled when the epoch changes)
Is it worth adding an explicit timeout for these requests?
There was a problem hiding this comment.
Hmm, yes if there's a reason to believe these calls can be stuck for a long time.
My understanding was that these were local calls, and unlikely to stall the whole observation phase for a long time.
Each parallel BroadcastBlob call now gets a 2-second timeout derived from the parent context. A slow individual broadcast will be cancelled and skipped without stalling the rest of the batch. Parent context cancellation still propagates immediately for round-level failures. Made-with: Cursor
|





Summary
During the Observation phase, pending queue payloads are broadcast as blobs in parallel. Previously, if any single broadcast failed (transient network error, malformed data, etc.), the entire observation was aborted — no payloads were included, and the OCR round stalled.
This changes the behavior so that individual failures are isolated: a failed broadcast is logged as a warning (with the request ID and error) and that payload is excluded from
PendingQueueItems. All remaining payloads continue to be broadcast and observed normally.What changed
broadcastBlobPayloadsmethod for readability. It accepts payloads and request IDs, runs broadcasts concurrently, and returns only the successfully broadcast blob handles.Why
The observation step is critical to OCR round progress. Aborting it entirely because one out of N payloads hit a transient failure is disproportionate — especially since the failed payload can simply be retried in a future round. Graceful degradation keeps rounds moving and avoids cascading stalls.