-
Notifications
You must be signed in to change notification settings - Fork 0
E2B sandbox timeout on simple hello_world.py validation #95
Description
Problem
When submitting simple Python tasks through the dashboard, sandbox validation produces misleading results. Originally reported as a timeout, but investigation revealed deeper structural issues.
Investigation Results
Reproduction test (dev-suite/scripts/reproduce_timeout.py) ran all 5 scenarios in under 1s total — sandbox creation (0.77s), full orchestrator path (0.29s), simple execution (0.09s). The timeout is not reproducible and was likely a transient E2B API/network issue.
Langfuse trace analysis (18 traces exported) confirmed:
timed_out=falseon every single sandbox result across 6 traced runs- Every sandbox run reports:
"/bin/sh: 1: ruff: not found" - Sandbox returns
exit_code=0despite ruff failing (Python wrapper script succeeds even though subprocess fails) pytest tests/ -vcollects 0 tests becausetests/doesn't exist for simple script tasks
Root Causes (Confirmed)
1. ruff not installed on code-interpreter-v1
Every Python validation run fails immediately with ruff: not found (exit 127). Since commands are joined with &&, this prevents pytest from ever running.
2. && chain masks failures
_run_sandbox_validation() joins commands with " && ".join(commands). When ruff fails, pytest never executes. But the Python wrapper script that calls subprocess.run() itself exits cleanly, so SandboxResult.exit_code=0. QA sees a "passed" sandbox with no test results.
3. No validation strategy for simple scripts
validation_commands.py has one strategy: ruff + pytest tests/. A single-file script like hello_world.py with no test suite gets this treatment, which validates nothing meaningful. There's no "just run the file and check exit code" path.
4. No validation_skipped flag
SandboxResult has no way to distinguish "tests passed" from "nothing was tested." QA cannot tell the difference.
Fix Plan
- Sequential command execution — Replace
&&join with independentsubprocess.run()calls per command. Aggregate results. A missing ruff doesn't prevent pytest. - Validation strategy enum —
SCRIPT_EXEC(run file, check exit + stdout),TEST_SUITE(current ruff + pytest),LINT_ONLY(syntax check only). Selection based on parsed_files content. - Ruff availability guard —
which ruffcheck before running. Skip with warning if unavailable. validation_skippedflag — Add toSandboxResultso QA knows when nothing was actually tested.- Tests for all new paths.
Files to modify
src/sandbox/e2b_runner.py— Sequential execution,run_script()method,validation_skippedflagsrc/sandbox/validation_commands.py— Strategy enum, selection logicsrc/orchestrator.py—sandbox_validate_nodeuses new strategytests/— New tests for each strategy
Context
- Original task:
task-3ee2b4c7(hello_world.py via dashboard) - 6,896 tokens used, $0.08 cost, 3/3 retries exhausted
- Related: API runner bypasses Langfuse tracing — dashboard tasks are un-traced #97 (API runner skips Langfuse — dashboard tasks un-traced)
Metadata
Metadata
Assignees
Labels
Projects
Status