E2B sandbox timeout on simple hello_world.py validation

## Problem

When submitting simple Python tasks through the dashboard, sandbox validation produces misleading results. Originally reported as a timeout, but investigation revealed deeper structural issues.

## Investigation Results

**Reproduction test** (`dev-suite/scripts/reproduce_timeout.py`) ran all 5 scenarios in under 1s total — sandbox creation (0.77s), full orchestrator path (0.29s), simple execution (0.09s). **The timeout is not reproducible** and was likely a transient E2B API/network issue.

**Langfuse trace analysis** (18 traces exported) confirmed:
- `timed_out=false` on every single sandbox result across 6 traced runs
- Every sandbox run reports: `"/bin/sh: 1: ruff: not found"`
- Sandbox returns `exit_code=0` despite ruff failing (Python wrapper script succeeds even though subprocess fails)
- `pytest tests/ -v` collects 0 tests because `tests/` doesn't exist for simple script tasks

## Root Causes (Confirmed)

### 1. `ruff` not installed on `code-interpreter-v1`
Every Python validation run fails immediately with `ruff: not found` (exit 127). Since commands are joined with `&&`, this prevents pytest from ever running.

### 2. `&&` chain masks failures
`_run_sandbox_validation()` joins commands with `" && ".join(commands)`. When ruff fails, pytest never executes. But the Python wrapper script that calls `subprocess.run()` itself exits cleanly, so `SandboxResult.exit_code=0`. QA sees a "passed" sandbox with no test results.

### 3. No validation strategy for simple scripts
`validation_commands.py` has one strategy: `ruff + pytest tests/`. A single-file script like `hello_world.py` with no test suite gets this treatment, which validates nothing meaningful. There's no "just run the file and check exit code" path.

### 4. No `validation_skipped` flag
`SandboxResult` has no way to distinguish "tests passed" from "nothing was tested." QA cannot tell the difference.

## Fix Plan

1. **Sequential command execution** — Replace `&&` join with independent `subprocess.run()` calls per command. Aggregate results. A missing ruff doesn't prevent pytest.
2. **Validation strategy enum** — `SCRIPT_EXEC` (run file, check exit + stdout), `TEST_SUITE` (current ruff + pytest), `LINT_ONLY` (syntax check only). Selection based on parsed_files content.
3. **Ruff availability guard** — `which ruff` check before running. Skip with warning if unavailable.
4. **`validation_skipped` flag** — Add to `SandboxResult` so QA knows when nothing was actually tested.
5. **Tests** for all new paths.

## Files to modify

- `src/sandbox/e2b_runner.py` — Sequential execution, `run_script()` method, `validation_skipped` flag
- `src/sandbox/validation_commands.py` — Strategy enum, selection logic
- `src/orchestrator.py` — `sandbox_validate_node` uses new strategy
- `tests/` — New tests for each strategy

## Context

- Original task: `task-3ee2b4c7` (hello_world.py via dashboard)
- 6,896 tokens used, $0.08 cost, 3/3 retries exhausted
- Related: #97 (API runner skips Langfuse — dashboard tasks un-traced)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

E2B sandbox timeout on simple hello_world.py validation #95

Problem

Investigation Results

Root Causes (Confirmed)

1. `ruff` not installed on `code-interpreter-v1`

2. `&&` chain masks failures

3. No validation strategy for simple scripts

4. No `validation_skipped` flag

Fix Plan

Files to modify

Context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

E2B sandbox timeout on simple hello_world.py validation #95

Description

Problem

Investigation Results

Root Causes (Confirmed)

1. ruff not installed on code-interpreter-v1

2. && chain masks failures

3. No validation strategy for simple scripts

4. No validation_skipped flag

Fix Plan

Files to modify

Context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1. `ruff` not installed on `code-interpreter-v1`

2. `&&` chain masks failures

4. No `validation_skipped` flag