Skip to content

E2B sandbox timeout on simple hello_world.py validation #95

@Abernaughty

Description

@Abernaughty

Problem

When submitting simple Python tasks through the dashboard, sandbox validation produces misleading results. Originally reported as a timeout, but investigation revealed deeper structural issues.

Investigation Results

Reproduction test (dev-suite/scripts/reproduce_timeout.py) ran all 5 scenarios in under 1s total — sandbox creation (0.77s), full orchestrator path (0.29s), simple execution (0.09s). The timeout is not reproducible and was likely a transient E2B API/network issue.

Langfuse trace analysis (18 traces exported) confirmed:

  • timed_out=false on every single sandbox result across 6 traced runs
  • Every sandbox run reports: "/bin/sh: 1: ruff: not found"
  • Sandbox returns exit_code=0 despite ruff failing (Python wrapper script succeeds even though subprocess fails)
  • pytest tests/ -v collects 0 tests because tests/ doesn't exist for simple script tasks

Root Causes (Confirmed)

1. ruff not installed on code-interpreter-v1

Every Python validation run fails immediately with ruff: not found (exit 127). Since commands are joined with &&, this prevents pytest from ever running.

2. && chain masks failures

_run_sandbox_validation() joins commands with " && ".join(commands). When ruff fails, pytest never executes. But the Python wrapper script that calls subprocess.run() itself exits cleanly, so SandboxResult.exit_code=0. QA sees a "passed" sandbox with no test results.

3. No validation strategy for simple scripts

validation_commands.py has one strategy: ruff + pytest tests/. A single-file script like hello_world.py with no test suite gets this treatment, which validates nothing meaningful. There's no "just run the file and check exit code" path.

4. No validation_skipped flag

SandboxResult has no way to distinguish "tests passed" from "nothing was tested." QA cannot tell the difference.

Fix Plan

  1. Sequential command execution — Replace && join with independent subprocess.run() calls per command. Aggregate results. A missing ruff doesn't prevent pytest.
  2. Validation strategy enumSCRIPT_EXEC (run file, check exit + stdout), TEST_SUITE (current ruff + pytest), LINT_ONLY (syntax check only). Selection based on parsed_files content.
  3. Ruff availability guardwhich ruff check before running. Skip with warning if unavailable.
  4. validation_skipped flag — Add to SandboxResult so QA knows when nothing was actually tested.
  5. Tests for all new paths.

Files to modify

  • src/sandbox/e2b_runner.py — Sequential execution, run_script() method, validation_skipped flag
  • src/sandbox/validation_commands.py — Strategy enum, selection logic
  • src/orchestrator.pysandbox_validate_node uses new strategy
  • tests/ — New tests for each strategy

Context

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions