Skip to content

vault: gracefully handle individual blob broadcast failures in Observation#21765

Merged
prashantkumar1982 merged 3 commits intodevelopfrom
vault/graceful-blob-broadcast-failures
Mar 30, 2026
Merged

vault: gracefully handle individual blob broadcast failures in Observation#21765
prashantkumar1982 merged 3 commits intodevelopfrom
vault/graceful-blob-broadcast-failures

Conversation

@prashantkumar1982
Copy link
Copy Markdown
Contributor

@prashantkumar1982 prashantkumar1982 commented Mar 28, 2026

Summary

During the Observation phase, pending queue payloads are broadcast as blobs in parallel. Previously, if any single broadcast failed (transient network error, malformed data, etc.), the entire observation was aborted — no payloads were included, and the OCR round stalled.

This changes the behavior so that individual failures are isolated: a failed broadcast is logged as a warning (with the request ID and error) and that payload is excluded from PendingQueueItems. All remaining payloads continue to be broadcast and observed normally.

What changed

  • New behavior: A single blob broadcast failure no longer aborts the whole observation. The failed item is skipped, a warning is logged, and the rest proceed.
  • Refactor: The parallel broadcast logic is extracted into a broadcastBlobPayloads method for readability. It accepts payloads and request IDs, runs broadcasts concurrently, and returns only the successfully broadcast blob handles.

Why

The observation step is critical to OCR round progress. Aborting it entirely because one out of N payloads hit a transient failure is disproportionate — especially since the failed payload can simply be retried in a future round. Graceful degradation keeps rounds moving and avoids cascading stalls.

…ation

Previously, if any single payload failed to broadcast as a blob during the
Observation phase, the entire observation was aborted and returned an error.
This is unnecessarily disruptive — one problematic payload (e.g. transient
network issue, malformed data) would prevent all other valid payloads from
being included in the observation, stalling the OCR round.

Now, individual broadcast failures are logged as warnings (with the request
ID and error details) and the failed payload is simply excluded from
PendingQueueItems. The remaining payloads continue to be broadcast and
observed normally.

The blob broadcast logic is extracted into a dedicated
broadcastBlobPayloads method for clarity.

Made-with: Cursor
@github-actions
Copy link
Copy Markdown
Contributor

👋 prashantkumar1982, thanks for creating this pull request!

To help reviewers, please consider creating future PRs as drafts first. This allows you to self-review and make any final changes before notifying the team.

Once you're ready, you can mark it as "Ready for review" to request feedback. Thanks!

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 28, 2026

✅ No conflicts with other open PRs targeting develop

@github-actions
Copy link
Copy Markdown
Contributor

I see you updated files related to core. Please run make gocs in the root directory to add a changeset as well as in the text include at least one of the following tags:

  • #added For any new functionality added.
  • #breaking_change For any functionality that requires manual action for the node to boot.
  • #bugfix For bug fixes.
  • #changed For any change to the existing functionality.
  • #db_update For any feature that introduces updates to database schema.
  • #deprecation_notice For any upcoming deprecation functionality.
  • #internal For changesets that need to be excluded from the final changelog.
  • #nops For any feature that is NOP facing and needs to be in the official Release Notes for the release.
  • #removed For any functionality/config that is removed.
  • #updated For any functionality that is updated.
  • #wip For any change that is not ready yet and external communication about it should be held off till it is feature complete.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c16097773c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Check ctx.Err() when BroadcastBlob fails so that context.Canceled and
context.DeadlineExceeded are returned immediately rather than swallowed.
This preserves fail-fast semantics for expired OCR rounds while still
skipping item-specific transient errors.

Made-with: Cursor
@trunk-io
Copy link
Copy Markdown

trunk-io bot commented Mar 28, 2026

Static BadgeStatic BadgeStatic BadgeStatic Badge

View Full Report ↗︎Docs

var g errgroup.Group
for i, payload := range payloads {
g.Go(func() error {
blobHandle, err := fetcher.BroadcastBlob(ctx, payload, ocr3_1types.BlobExpirationHintSequenceNumber{SeqNr: seqNr + 2})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prashantkumar1982 The way I read this a single request that takes a long time will delay the whole batch, and could even cause it to fail since there's no actual timeout associated with the request (ctx will only be cancelled when the epoch changes)

Is it worth adding an explicit timeout for these requests?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yes if there's a reason to believe these calls can be stuck for a long time.
My understanding was that these were local calls, and unlikely to stall the whole observation phase for a long time.

Each parallel BroadcastBlob call now gets a 2-second timeout derived from
the parent context. A slow individual broadcast will be cancelled and
skipped without stalling the rest of the batch. Parent context
cancellation still propagates immediately for round-level failures.

Made-with: Cursor
@cl-sonarqube-production
Copy link
Copy Markdown

@prashantkumar1982 prashantkumar1982 added this pull request to the merge queue Mar 30, 2026
Merged via the queue into develop with commit 17ffab6 Mar 30, 2026
263 of 267 checks passed
@prashantkumar1982 prashantkumar1982 deleted the vault/graceful-blob-broadcast-failures branch March 30, 2026 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants