fix(execution): queued execution finalization and async correlation#3535
fix(execution): queued execution finalization and async correlation#3535PlaneInABottle wants to merge 9 commits intosimstudioai:stagingfrom
Conversation
PR SummaryHigh Risk Overview Threads a new Refactors Written by Cursor Bugbot for commit a8e6e0a. This will update automatically on new commits. Configure here. |
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
Greptile SummaryThis PR fixes two related reliability issues in async execution paths: it moves execution finalization out of fire-and-forget wrappers into the core execution path so terminal state is durably written for webhook, schedule, and workflow runs, and it pre-assigns correlation identifiers at the point of enqueueing so background jobs can trace back to the originating request without generating new IDs. Key changes:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant R as Route Handler<br/>(schedule/webhook/workflow)
participant Q as Job Queue
participant BG as Background Executor<br/>(schedule/webhook/workflow)
participant PP as preprocessExecution
participant LS as LoggingSession
participant EC as executeWorkflowCore
participant LOG as ExecutionLogger
R->>R: generate executionId + correlation
R->>Q: enqueue(payload { executionId, requestId, correlation }, metadata { correlation })
Q-->>BG: deliver job
BG->>BG: buildXxxCorrelation(payload)<br/>reuse or generate executionId/requestId
BG->>LS: new LoggingSession(workflowId, executionId, triggerType, requestId)
BG->>PP: preprocessExecution({ ..., triggerData: { correlation }, loggingSession })
alt Preprocessing fails
PP->>LS: logPreprocessingError({ ..., triggerData }) → safeCompleteWithError
LS->>LOG: startWorkflowExecution + completeWithError<br/>(correlation preserved in executionData)
else Preprocessing succeeds
BG->>LS: safeStart({ triggerData: { correlation } })
LS->>LOG: startWorkflowExecution<br/>executionData.correlation = triggerData.correlation
BG->>EC: executeWorkflowCore({ snapshot, loggingSession })
EC->>LS: safeStart (with metadata.correlation)
alt Execution succeeds
EC->>EC: finalizeExecutionOutcome()
EC->>LS: safeComplete / safeCompleteWithCancellation / safeCompleteWithPause
LS->>LOG: completeWorkflowExecution<br/>buildCompletedExecutionData preserves correlation
else Execution throws
EC->>EC: finalizeExecutionError()
EC->>LS: safeCompleteWithError
EC->>EC: markExecutionFinalizedByCore(error, executionId)
EC-->>BG: rethrow error
BG->>BG: wasExecutionFinalizedByCore(error, executionId) → true<br/>skip duplicate finalization
end
end
Last reviewed commit: 78db4ee |
Let executeWorkflowCore own normal-path logging start so queued workflow and schedule executions persist the richer deployment and environment metadata instead of an earlier placeholder start record.
|
@PlaneInABottle I'll review this. But can you mark the bugbot/greptile comments as resolved or respond to them if they're inaccurate. |
Prevent leaked core finalization markers from accumulating while keeping outer recovery paths idempotent. Preserve best-effort logging completion by reusing settled completion promises instead of reopening duplicate terminal writes.
Keep execution finalization cleanup best-effort so cancellation cleanup failures do not overwrite successful or failed outcomes. Restore webhook processor formatting to the repository Biome style to avoid noisy formatter churn.
Retry minimal logging for early failures, only mark core finalization after a log row actually completes, and let paused completions fall back cleanly.
Scan all finalized execution ids during TTL cleanup so refreshed keys cannot keep expired guards alive, and cover the reused-id ordering regression.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| this.completionAttemptFailed = true | ||
| throw error | ||
| }) | ||
|
|
There was a problem hiding this comment.
Escalation path in completion deduplication is dead code
Low Severity
shouldStartNewCompletionAttempt can never return true because completionAttemptFailed is only set in runCompletionAttempt's .catch handler, which only fires if the run() function rejects. The run() functions are _safeCompleteImpl, _safeCompleteWithErrorImpl, etc. — all of which swallow every error internally (via their completeWithCostOnlyLog fallback, which also swallows errors). So completionAttemptFailed is permanently false, the escalation condition is unreachable, and the PR's stated guarantee of "allowing the specific escalation path from a failed non-error completion attempt into error finalization" never actually fires.


Summary
This PR isolates the first two remediation steps from a larger webhook/async execution incident branch.
It does two things:
Problem
We were seeing cases where async/queued executions could finish logically, but terminal-state finalization was still dependent on detached follow-up behavior outside the core execution path.
That creates two operational problems:
Root cause
1) Terminal-state finalization was not owned strongly enough by the core execution path
Some queued execution flows still depended on wrapper-side cleanup/recovery behavior to finish terminal logging/finalization. That means the execution engine and the durable terminal-state write were not fully coupled.
2) Async correlation was not propagated consistently across queued boundaries
Queued webhook/schedule/workflow execution paths did not always carry the same correlation information all the way through preprocessing, enqueueing, and background execution. That made it harder to connect:
What changed
A. Finalize runs inside core execution
This PR moves terminal finalization responsibility into the core execution flow so the same code path that determines the execution outcome is also responsible for finalizing it durably.
Concretely, the changes ensure that:
B. Preserve async correlation across queued execution paths
This PR also threads correlation data through queued execution paths so the same execution/request identity survives across:
Concretely, the changes add or preserve:
Files / surfaces affected
Main execution/finalization surfaces:
apps/sim/lib/workflows/executor/execution-core.tsapps/sim/lib/logs/execution/logging-session.tsapps/sim/background/workflow-execution.tsapps/sim/background/webhook-execution.tsapps/sim/background/schedule-execution.tsAsync correlation / queueing surfaces:
apps/sim/app/api/workflows/[id]/execute/route.tsapps/sim/app/api/schedules/execute/route.tsapps/sim/lib/webhooks/processor.tsapps/sim/lib/execution/preprocessing.tsapps/sim/lib/core/async-jobs/types.tsapps/sim/lib/core/async-jobs/backends/trigger-dev.tsWhy this PR is intentionally scoped this way
This PR is intentionally limited to the first two fixes from the broader incident work:
It does not include the later follow-up work around:
That split is intentional so this PR stays focused on the minimum behavior changes needed to:
Testing
Targeted tests were rerun on this isolated branch:
Finalization / execution-core coverage
bun --cwd apps/sim vitest run lib/workflows/executor/execution-core.test.ts lib/logs/execution/logging-session.test.tsAsync correlation coverage
bun --cwd apps/sim vitest run background/async-execution-correlation.test.ts background/async-preprocessing-correlation.test.ts lib/execution/preprocessing.test.ts lib/execution/preprocessing.webhook-correlation.test.ts "app/api/workflows/[id]/execute/route.async.test.ts" "app/api/schedules/execute/route.test.ts" "app/api/webhooks/trigger/[path]/route.test.ts"Results on this PR branch:
Reviewer notes
The large diff is mostly because the webhook execution / processor surfaces sit at the boundary where correlation has to be preserved end-to-end.
The intended review focus is: