Skip to content

fix(sandbox): validation strategy dispatch, sequential execution, run_script (#95)#98

Merged
Abernaughty merged 12 commits intomainfrom
fix/sandbox-validation-strategy
Apr 2, 2026
Merged

fix(sandbox): validation strategy dispatch, sequential execution, run_script (#95)#98
Abernaughty merged 12 commits intomainfrom
fix/sandbox-validation-strategy

Conversation

@Abernaughty
Copy link
Copy Markdown
Owner

@Abernaughty Abernaughty commented Apr 2, 2026

Summary

Fixes #95 — sandbox validation now dispatches on a ValidationStrategy enum instead of blindly running ruff + pytest for all Python files.

Root Cause (from investigation)

The original issue was reported as an E2B timeout, but reproduction tests and Langfuse trace analysis revealed three structural bugs:

  1. ruff not installed on code-interpreter-v1 — every sandbox run failed with ruff: not found (exit 127)
  2. && chain masked failures — ruff failure prevented pytest from running, but the Python wrapper exited 0, so SandboxResult.exit_code=0
  3. No "just run the file" pathhello_world.py with no test suite got ruff + pytest tests/, which validated nothing

The timeout was not reproducible (sandbox creation + execution completes in <1s).

Changes

validation_commands.py

  • ValidationStrategy enum: TEST_SUITE, SCRIPT_EXEC, LINT_ONLY, SKIP
  • Smart strategy selection: detects test files via TEST_INDICATORS set
    • Python with tests → TEST_SUITE (ruff + pytest)
    • Python without tests → SCRIPT_EXEC (run the script, check exit code)
    • Frontend → TEST_SUITE (pnpm check + tsc)
    • Non-code → SKIP
  • Ruff guard: which ruff check before running lint — skips with [WARN] if unavailable
  • script_file field on ValidationPlan for SCRIPT_EXEC targets

e2b_runner.py

  • validation_skipped and warnings fields on SandboxResult
  • run_tests() refactored: sequential command execution (each subprocess.run() independent). Missing ruff no longer blocks pytest. Parses __RESULTS__ JSON for per-command exit codes.
  • run_script(): new method for SCRIPT_EXEC — writes files, runs python &lt;file&gt;, reports 1 passed/1 failed

orchestrator.py

  • _run_sandbox_tests() replaces _run_sandbox_validation() — passes commands list (not compound string)
  • _run_sandbox_script() — new helper for SCRIPT_EXEC dispatch
  • sandbox_validate_node dispatches on plan.strategy:
    • SKIP → returns SandboxResult(validation_skipped=True)
    • SCRIPT_EXEC → calls _run_sandbox_script()
    • TEST_SUITE → calls _run_sandbox_tests()
  • Warnings from sandbox surfaced in trace log

Tests

  • test_validation_commands.py: 24 tests covering strategy selection for all file type combinations
  • test_sandbox_validate.py: 15 tests covering strategy dispatch, validation_skipped flag, warnings in trace, graph wiring

Related

Summary by CodeRabbit

  • New Features

    • Strategy-based sandbox validation: SKIP, SCRIPT_EXEC (single-file run), and TEST_SUITE (sequential commands); single-script execution added and test commands run sequentially with per-command results and warnings.
  • Bug Fixes

    • Validation now explicitly skips when inputs are missing or parsed files are empty; skip results and trace entries replace ambiguous/null responses.
  • Tests

    • Expanded coverage for strategy dispatch, script vs test execution, skip handling, warning propagation, and trace entries.

)

- Add ValidationStrategy enum (TEST_SUITE, SCRIPT_EXEC, LINT_ONLY, SKIP)
- Add strategy, script_file fields to ValidationPlan
- Detect test files via TEST_INDICATORS to select strategy
- SCRIPT_EXEC for single scripts without test suites
- Ruff guard: `which ruff` check before running lint
- Split PYTHON_COMMANDS into PYTHON_LINT_COMMANDS + PYTHON_TEST_COMMANDS
- Updated tests for new strategy selection logic
- Updated sandbox_validate_node tests for strategy dispatch
… warnings (#95)

- SandboxResult: add validation_skipped and warnings fields
- run_tests(): sequential command execution (drop && chain)
  Each command runs independently, aggregated results.
  Missing tools (ruff) no longer block subsequent commands (pytest).
- run_script(): new method for SCRIPT_EXEC strategy
  Executes `python <file>`, checks exit code + stdout.
  Reports as 1 passed/1 failed test for QA context.
- Parse __RESULTS__ JSON for per-command exit codes
- Detect rc=127 (command not found) and surface as warnings
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 2, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a ValidationStrategy-driven sandbox validation flow (SKIP, SCRIPT_EXEC, LINT_ONLY, TEST_SUITE); orchestrator dispatches to _run_sandbox_script or _run_sandbox_tests; E2B runner now accepts command lists, runs commands sequentially, adds run_script, and includes validation_skipped and warnings in results.

Changes

Cohort / File(s) Summary
Orchestrator & dispatch
dev-suite/src/orchestrator.py
Refactored sandbox_validate_node to dispatch by ValidationStrategy; added guards to skip when parsed_files is empty; calls _run_sandbox_script or _run_sandbox_tests; appends per-strategy trace entries and surfaces result.warnings.
Runner: tests & script execution
dev-suite/src/sandbox/e2b_runner.py
SandboxResult gains validation_skipped: bool and warnings: list[str]. run_tests now accepts commands: list[str] and executes them sequentially, aggregating outputs/RCs and deriving warnings; added run_script(...) to execute a single Python script and map exit/output to result fields.
Validation planning & formatting
dev-suite/src/sandbox/validation_commands.py
Introduced ValidationStrategy enum and ValidationPlan.strategy/script_file. get_validation_plan() now selects TEST_SUITE, SCRIPT_EXEC, LINT_ONLY, or SKIP. format_validation_summary() renders per-strategy summaries (includes python <script_file> for scripts).
Tests: updated expectations & coverage
dev-suite/tests/test_sandbox_validate.py, dev-suite/tests/test_validation_commands.py
Expanded tests to assert strategy selection and dispatch, verify skipped behavior when parsed_files empty, confirm warnings propagate to trace, and validate format_validation_summary() output for all strategies.

Sequence Diagram(s)

sequenceDiagram
    participant Orchestrator as Orchestrator
    participant ValidationCommands as ValidationCommands
    participant E2BRunner as E2BRunner
    participant Sandbox as Sandbox

    Orchestrator->>ValidationCommands: get_validation_plan(target_files)
    ValidationCommands-->>Orchestrator: plan { strategy, commands?, script_file? }

    alt plan.strategy == SCRIPT_EXEC
        Orchestrator->>E2BRunner: run_script(script_file, project_files, timeout)
        E2BRunner->>Sandbox: write files + exec ["python", script_file]
        Sandbox-->>E2BRunner: exit_code + output
        E2BRunner-->>Orchestrator: SandboxResult(exit_code, tests_passed/failed, warnings)
    else plan.strategy == TEST_SUITE
        Orchestrator->>E2BRunner: run_tests(commands=[...], template?)
        loop For each command
            E2BRunner->>Sandbox: exec command (shell)
            Sandbox-->>E2BRunner: rc + output
        end
        E2BRunner-->>Orchestrator: SandboxResult(exit_code, warnings)
    else plan.strategy == SKIP
        Orchestrator-->>Orchestrator: produce SandboxResult(validation_skipped=true, exit_code=0, output_summary)
    end

    Orchestrator->>Orchestrator: append warnings to trace & store result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐰
I hopped through code with whiskers bright,
Split scripts and suites to set things right.
No chained ruff to hush pytest's song,
Warnings now chirp when commands go wrong.
A nimble hop — validations hum along.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 58.90% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(sandbox): validation strategy dispatch, sequential execution, run_script (#95)' directly and clearly summarizes the main changes: introducing strategy-based dispatch, refactoring to sequential execution, and adding a new run_script method.
Linked Issues check ✅ Passed The PR fully addresses all coding requirements from issue #95: implements sequential execution (removed && chaining), adds ValidationStrategy enum with proper selection logic, implements run_script() method, adds validation_skipped flag, and includes comprehensive test coverage for all strategies.
Out of Scope Changes check ✅ Passed All changes are directly aligned with issue #95 objectives: strategy dispatch, sequential execution, run_script implementation, validation_skipped flag, ruff availability guard, and comprehensive test coverage. No unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/sandbox-validation-strategy

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@dev-suite/src/orchestrator.py`:
- Around line 599-601: The sandbox validation branch changed contract by
returning SandboxResult(validation_skipped=True) for SKIP/missing-key paths but
callers (qa_node and dev-suite/src/api/runner.py around runner.py:298-318) still
expect None for "no run", so revert those branches in sandbox_validate (the
block handling plan.strategy == ValidationStrategy.SKIP and the similar block at
lines ~630-633) to return None instead of a SandboxResult; keep
SandboxResult(...) only for actual runs so existing callers that check
"sandbox_result is not None" / "exit_code == 0" continue to behave correctly.

In `@dev-suite/src/sandbox/e2b_runner.py`:
- Around line 344-365: run_tests() currently uses the wrapper process exit code
instead of aggregating child command exit codes embedded in the "__RESULTS__"
JSON trailer, and _parse_test_output() reads the merged log without extracting
that trailer; update the execution and parsing flow so self.run() preserves the
full stdout (including the "__RESULTS__..."), then modify _parse_test_output()
to locate and json.parse the "__RESULTS__" trailer produced by run_code, compute
an overall exit_code (e.g., fail if any per-command rc != 0, or pick first
non-zero) and set result.exit_code to that aggregate before returning; reference
symbols: run_tests(), self.run(), _parse_test_output(), output_summary, and the
"__RESULTS__" trailer to guide where to change behavior.
- Around line 370-376: The outer sandbox timeout passed to self.run (currently
timeout + 30) is too small for sequential child commands; change it to account
for the number of sequential commands so the outer timeout >= sum of per-command
timeouts plus slack. Compute the total expected timeout (for example
total_timeout = timeout * max(1, num_commands) + 30 or derive num_commands from
full_code/plan) and pass total_timeout instead of timeout + 30 to self.run
(retain profile=SandboxProfile.LOCKED and the same env_vars/template
parameters).

In `@dev-suite/src/sandbox/validation_commands.py`:
- Around line 90-121: The tests detection currently uses substring checks
(indicator in lower) in _has_test_files and _get_primary_script which falsely
flag names like latest_utils.py; update both functions to match test indicators
on filename/path boundaries instead—e.g., evaluate against the lowercase
basename and directory components, using startswith for prefixes like "test_"
and exact name matches like "tests" or suffix checks like "_test" (or use a
word-boundary regex) against TEST_INDICATORS so only true test files/dirs are
detected while leaving normal modules (e.g., latest_utils.py or
contest_helper.py) unaffected.
- Around line 75-77: The pytest invocation in PYTHON_TEST_COMMANDS is hardcoded
to "tests/" which mismatches _has_test_files() detection and can miss or fail to
run discovered tests; update the PYTHON_TEST_COMMANDS entry to remove the
explicit "tests/" path so pytest will auto-discover tests project-wide (e.g.,
use "python -m pytest -v --tb=short"), ensuring PYTHON_TEST_COMMANDS is the
updated string and leaving _has_test_files() logic unchanged.
- Around line 71-73: The current PYTHON_LINT_COMMANDS entry uses the shell chain
"which ruff && ruff check ... || echo ..." which masks ruff failures by turning
non-zero ruff exits into zero via the trailing ||; replace that single-line
chain with a shell if-then-else construct so ruff's exit code is preserved: test
for ruff with "if which ruff >/dev/null 2>&1; then ruff check . --select=E,F,W
--no-fix; else echo '[WARN] ruff not available -- lint skipped'; fi". Update the
PYTHON_LINT_COMMANDS list to use this if/then/else command string so validation
failures from ruff propagate.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 398d9b69-e125-411d-890a-7ef79c230846

📥 Commits

Reviewing files that changed from the base of the PR and between 80e0afb and 413745c.

📒 Files selected for processing (5)
  • dev-suite/src/orchestrator.py
  • dev-suite/src/sandbox/e2b_runner.py
  • dev-suite/src/sandbox/validation_commands.py
  • dev-suite/tests/test_sandbox_validate.py
  • dev-suite/tests/test_validation_commands.py

CR #1 (Critical): Aggregate child exit codes from __RESULTS__ trailer
  - run_tests() now computes aggregate_exit_code from per-command rc values
  - Non-zero child exits properly propagate instead of being masked by wrapper

CR #2 (Critical): Scale outer timeout for sequential commands
  - overall_timeout = max(len(commands), 1) * timeout + 30

CR #3 (Major): Will be fixed in orchestrator.py (pushed locally)
  - SKIP/missing-key paths revert to returning None

CR #4 (Major): Fix ruff guard to preserve exit code
  - Replaced A && B || C with if/then/else construct
  - command -v ruff instead of which ruff

CR #5 (Major): Drop hardcoded tests/ from pytest
  - python -m pytest -v --tb=short (auto-discover)

CR #6 (Major): Boundary-based test file detection
  - _is_test_file() matches on basename boundaries
  - Avoids false positives like latest_utils.py, contest_helper.py
  - Added 6 regression tests for boundary matching
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
dev-suite/tests/test_validation_commands.py (1)

184-188: Test name is misleading—doesn't actually test script_file preference.

The docstring claims to test that _get_primary_script prefers non-test files, but since test_main.py is detected as a test file, the strategy becomes TEST_SUITE (line 188). This test never exercises the script_file selection logic.

To test _get_primary_script's preference behavior, you'd need a SCRIPT_EXEC scenario with multiple candidate files (none of which are tests).

Suggested rename and additional test
-    def test_script_file_prefers_non_test(self):
-        """_get_primary_script should prefer non-test files."""
+    def test_test_file_presence_triggers_test_suite(self):
+        """Presence of test file alongside main file triggers TEST_SUITE."""
         files = ["src/main.py", "test_main.py"]
         plan = get_validation_plan(files)
         assert plan.strategy == ValidationStrategy.TEST_SUITE
+
+    def test_script_file_selects_first_py_file(self):
+        """When multiple non-test .py files exist, script_file is the first one."""
+        files = ["src/secondary.py", "src/main.py"]
+        plan = get_validation_plan(files)
+        assert plan.strategy == ValidationStrategy.SCRIPT_EXEC
+        assert plan.script_file == "src/secondary.py"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dev-suite/tests/test_validation_commands.py` around lines 184 - 188, The test
test_script_file_prefers_non_test currently asserts
ValidationStrategy.TEST_SUITE and never exercises _get_primary_script's
preference logic; update the test to create a SCRIPT_EXEC scenario (e.g., use
multiple non-test candidate files like "src/main.py" and "src/other.py" and call
get_validation_plan) and assert the returned plan.strategy is
ValidationStrategy.SCRIPT_EXEC and that the chosen primary script (via whatever
field on the plan references the primary script) is the non-test preferred file;
alternatively, rename the test to reflect it actually verifies test-suite
detection and add a new test that calls get_validation_plan with multiple
non-test files to verify _get_primary_script prefers the correct script.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@dev-suite/tests/test_validation_commands.py`:
- Around line 66-69: The _get_primary_script() logic currently uses
PYTHON_EXTENSIONS = {".py", ".pyi"} which incorrectly treats .pyi stubs as
executable; change the selection to only consider code extensions (e.g., replace
or split PYTHON_EXTENSIONS into CODE_EXTENSIONS = {".py"} and STUB_EXTENSIONS =
{".pyi"} or simply make _get_primary_script() use {".py"}), update the docstring
to reflect it returns only .py files, and ensure any callers that used
PYTHON_EXTENSIONS for execution (e.g., code that sets
ValidationStrategy.SCRIPT_EXEC) now use the code-only set so .pyi files are not
chosen as primary scripts.

---

Nitpick comments:
In `@dev-suite/tests/test_validation_commands.py`:
- Around line 184-188: The test test_script_file_prefers_non_test currently
asserts ValidationStrategy.TEST_SUITE and never exercises _get_primary_script's
preference logic; update the test to create a SCRIPT_EXEC scenario (e.g., use
multiple non-test candidate files like "src/main.py" and "src/other.py" and call
get_validation_plan) and assert the returned plan.strategy is
ValidationStrategy.SCRIPT_EXEC and that the chosen primary script (via whatever
field on the plan references the primary script) is the non-test preferred file;
alternatively, rename the test to reflect it actually verifies test-suite
detection and add a new test that calls get_validation_plan with multiple
non-test files to verify _get_primary_script prefers the correct script.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a118dde8-4783-42d4-af3e-a0a4ae0df524

📥 Commits

Reviewing files that changed from the base of the PR and between 413745c and 673c1cf.

📒 Files selected for processing (2)
  • dev-suite/src/sandbox/validation_commands.py
  • dev-suite/tests/test_validation_commands.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • dev-suite/src/sandbox/validation_commands.py

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@dev-suite/src/orchestrator.py`:
- Around line 612-628: The dispatch in sandbox_validate misses handling
ValidationStrategy.LINT_ONLY so lint-only plans fall through to the skip branch;
add an explicit branch for plan.strategy == ValidationStrategy.LINT_ONLY that
treats it like TEST_SUITE by calling _run_sandbox_tests with plan.commands,
template, and parsed_files (same parameters as the TEST_SUITE branch) and append
an appropriate trace message (e.g., "sandbox_validate: running lint-only
{len(plan.commands)} commands"); ensure the existing TEST_SUITE branch remains
unchanged and that the LINT_ONLY branch returns the same result shape as other
strategies.
- Around line 603-625: The code dispatches SCRIPT_EXEC/TEST_SUITE even when
parsed_files is empty, causing spurious "file not found" / "no tests collected"
errors; change the sandbox_validate logic to detect an empty parsed_files (from
state.get("parsed_files", [])) and skip sandbox validation for both
ValidationStrategy.SCRIPT_EXEC and ValidationStrategy.TEST_SUITE by returning or
setting result to None and appending a trace message like "sandbox_validate:
skipping sandbox run - no materialized files"; specifically guard before calling
_run_sandbox_script and _run_sandbox_tests so neither function is invoked when
parsed_files is falsy.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ef363ee1-c4b5-47ca-b32a-76438b1154a9

📥 Commits

Reviewing files that changed from the base of the PR and between 673c1cf and adefdc0.

📒 Files selected for processing (2)
  • dev-suite/src/orchestrator.py
  • dev-suite/tests/test_sandbox_validate.py

…, .pyi exclusion

CR fixes addressed:
- e2b_runner.py: aggregate exit_code from __RESULTS__ trailer instead of
  wrapper process exit code (Critical — child cmd failures were silent)
- e2b_runner.py: scale outer sandbox timeout to num_commands * timeout + 30
  (Critical — sequential commands could be killed prematurely)
- validation_commands.py: exclude .pyi stubs from SCRIPT_EXEC selection via
  EXECUTABLE_PYTHON_EXTENSIONS; .pyi-only files route to LINT_ONLY
- test_validation_commands.py: update .pyi test for LINT_ONLY, rename
  misleading test, add script_file selection coverage, add .pyi+.py coexist test
…patch

CR fixes addressed:
- orchestrator.py: guard against empty parsed_files before dispatching
  SCRIPT_EXEC/TEST_SUITE/LINT_ONLY to sandbox (Major — spurious errors)
- orchestrator.py: fold LINT_ONLY into TEST_SUITE branch so lint-only
  plans execute their commands instead of falling through to skip (Minor)
- _make_state now populates parsed_files from target_files by default
  so sandbox dispatch reaches mocked runners
- Tests needing empty parsed_files pass it explicitly
- Added test_empty_parsed_files_skips_sandbox (CR fix #4 coverage)
- Added test_lint_only_for_pyi_stubs (CR fix #5 coverage)
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
dev-suite/src/sandbox/e2b_runner.py (1)

374-380: ⚠️ Potential issue | 🔴 Critical

Parse the status trailer before self.run() truncates stdout.

Both paths recover the real exit status from markers appended at the end of child output, but self.run() trims output_summary before either parser runs. A noisy ruff/pytest run or chatty script can drop __RESULTS__ / __EXIT_CODE__, leaving aggregate_exit_code or script_exit at 0 and reporting a false pass. Preserve the untruncated summary for internal parsing and only truncate after extracting these markers.

Also applies to: 395-406, 461-467, 471-475

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dev-suite/src/sandbox/e2b_runner.py` around lines 374 - 380, The current flow
calls self.run(...) which returns an output_summary that is immediately
truncated before parsing for embedded markers like "__RESULTS__" and
"__EXIT_CODE__", causing loss of trailing markers; change the logic in functions
calling run (the blocks around the result = self.run(...) calls and the later
parsing that computes aggregate_exit_code and script_exit) to first keep the
full untruncated output_summary (e.g., store it as raw_output or full_summary),
parse that raw text to extract the status trailer and compute
aggregate_exit_code and script_exit from the "__RESULTS__"/"__EXIT_CODE__"
markers, and only after successful extraction perform the truncation used for
UI/display; apply the same change to the other occurrences referenced (the
blocks around lines handling aggregate_exit_code and script_exit at the other
call sites).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@dev-suite/src/sandbox/e2b_runner.py`:
- Around line 352-362: The per-command TimeoutExpired handling currently sets
rc=-1 and uses err="TIMEOUT" but doesn't expose a timed_out flag and the summary
drops the timeout marker; update the subprocess TimeoutExpired except block (the
subprocess.run call and results list entries) to append a result object that
includes a dedicated timed_out: True field (e.g. {"cmd": cmd, "rc": -1, "out":
"", "err": "TIMEOUT", "timed_out": True}) so timeouts are surfaced as explicit
booleans, and adjust any downstream summary/printing logic that reads results
(the r["err"]/r["rc"] checks and the overall sandbox timed_out calculation) to
respect per-command timed_out; apply the identical change to the other
occurrence mentioned (lines 417-419).
- Around line 400-405: The loop over cmd_results currently turns rc==127 into
only a warning (warnings.append) which masks missing runtimes/tools; change the
logic in the for cr in cmd_results loop (variables: rc, cmd_results, warnings,
aggregate_exit_code) so that a 127 exit code is treated as a non-zero failure:
either remove the special-case branch for rc == 127 so it falls through to the
existing elif rc != 0 branch (which sets aggregate_exit_code), or if you keep a
warning for 127 also ensure you set aggregate_exit_code = 127 when
aggregate_exit_code == 0; preserve the existing behavior that only the first
non-zero rc updates aggregate_exit_code.

In `@dev-suite/src/sandbox/validation_commands.py`:
- Around line 240-248: get_validation_plan can return
ValidationStrategy.LINT_ONLY for .pyi-only inputs but the orchestrator currently
only handles SCRIPT_EXEC and TEST_SUITE, so LINT_ONLY plans are never executed;
update the orchestrator's dispatch/run_validation_plan logic to handle
ValidationStrategy.LINT_ONLY by invoking the same lint execution path used for
other strategies (use the ValidationPlan.commands list and template from the
returned ValidationPlan) so lint-only validation runs for .pyi inputs; reference
get_validation_plan, ValidationPlan, and ValidationStrategy.LINT_ONLY when
locating the related code to implement this branch.
- Around line 200-209: The current has_frontend and has_python branch returns a
ValidationPlan with ValidationStrategy.TEST_SUITE and FULLSTACK_COMMANDS
unconditionally, which reintroduces running pytest even when there are no test
files; update the branch in the function that builds the ValidationPlan to call
the existing _has_test_files(target_files) (or equivalent) and only include the
pytest invocation in the commands if tests are present, otherwise either (a)
remove/replace the pytest entry from FULLSTACK_COMMANDS for this case or (b)
select a non-test suite command set (e.g., a lint/build-only plan); adjust the
logic around ValidationStrategy.TEST_SUITE and the commands list so python -m
pytest is not run when _has_test_files() returns False.

---

Duplicate comments:
In `@dev-suite/src/sandbox/e2b_runner.py`:
- Around line 374-380: The current flow calls self.run(...) which returns an
output_summary that is immediately truncated before parsing for embedded markers
like "__RESULTS__" and "__EXIT_CODE__", causing loss of trailing markers; change
the logic in functions calling run (the blocks around the result = self.run(...)
calls and the later parsing that computes aggregate_exit_code and script_exit)
to first keep the full untruncated output_summary (e.g., store it as raw_output
or full_summary), parse that raw text to extract the status trailer and compute
aggregate_exit_code and script_exit from the "__RESULTS__"/"__EXIT_CODE__"
markers, and only after successful extraction perform the truncation used for
UI/display; apply the same change to the other occurrences referenced (the
blocks around lines handling aggregate_exit_code and script_exit at the other
call sites).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 299e6ca6-696c-47f9-bac0-43c1dfa2d6f6

📥 Commits

Reviewing files that changed from the base of the PR and between adefdc0 and a322ef1.

📒 Files selected for processing (3)
  • dev-suite/src/sandbox/e2b_runner.py
  • dev-suite/src/sandbox/validation_commands.py
  • dev-suite/tests/test_validation_commands.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • dev-suite/tests/test_validation_commands.py

Comment on lines +352 to +362
r = subprocess.run(
cmd, shell=True, capture_output=True, text=True, timeout={timeout},
)
results.append({{"cmd": cmd, "rc": r.returncode, "out": r.stdout, "err": r.stderr}})
except subprocess.TimeoutExpired:
results.append({{"cmd": cmd, "rc": -1, "out": "", "err": "TIMEOUT"}})

for r in results:
print(r["out"])
if r["err"] and r["err"] != "TIMEOUT":
print(r["err"])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Surface per-command timeouts as actual timeouts.

A child timeout is reduced to rc = -1, its "TIMEOUT" marker is dropped from the summary, and the returned model still copies timed_out only from the outer sandbox call. That makes command timeouts indistinguishable from generic failures for the orchestrator and trace logs.

Also applies to: 417-419

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dev-suite/src/sandbox/e2b_runner.py` around lines 352 - 362, The per-command
TimeoutExpired handling currently sets rc=-1 and uses err="TIMEOUT" but doesn't
expose a timed_out flag and the summary drops the timeout marker; update the
subprocess TimeoutExpired except block (the subprocess.run call and results list
entries) to append a result object that includes a dedicated timed_out: True
field (e.g. {"cmd": cmd, "rc": -1, "out": "", "err": "TIMEOUT", "timed_out":
True}) so timeouts are surfaced as explicit booleans, and adjust any downstream
summary/printing logic that reads results (the r["err"]/r["rc"] checks and the
overall sandbox timed_out calculation) to respect per-command timed_out; apply
the identical change to the other occurrence mentioned (lines 417-419).

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
dev-suite/src/orchestrator.py (1)

626-642: Redundant conditional: parsed_files is guaranteed truthy here.

The guard at lines 608-615 already returns early when parsed_files is empty for these strategies. The parsed_files if parsed_files else None expressions at lines 629 and 641 can be simplified to just parsed_files.

♻️ Simplify the redundant conditionals
         if plan.strategy == ValidationStrategy.SCRIPT_EXEC:
             trace.append(f"sandbox_validate: executing script {plan.script_file}")
             result = _run_sandbox_script(
                 script_file=plan.script_file,
                 template=plan.template,
-                parsed_files=parsed_files if parsed_files else None,
+                parsed_files=parsed_files,
             )
         # CR fix: fold LINT_ONLY into TEST_SUITE -- both run commands
         # sequentially via _run_sandbox_tests(). LINT_ONLY was falling
         # through to the skip branch instead of executing plan.commands.
         elif plan.strategy in {ValidationStrategy.TEST_SUITE, ValidationStrategy.LINT_ONLY}:
             trace.append(
                 f"sandbox_validate: running {len(plan.commands)} command(s) sequentially"
             )
             result = _run_sandbox_tests(
                 commands=plan.commands,
                 template=plan.template,
-                parsed_files=parsed_files if parsed_files else None,
+                parsed_files=parsed_files,
             )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dev-suite/src/orchestrator.py` around lines 626 - 642, The conditional
ternaries passing parsed_files are redundant because parsed_files is guaranteed
non-empty earlier; update the calls to _run_sandbox_script(...) and
_run_sandbox_tests(...) to pass parsed_files directly (replace parsed_files if
parsed_files else None with parsed_files) for the branches handling plan and
ValidationStrategy.TEST_SUITE / ValidationStrategy.LINT_ONLY; ensure you update
both call sites that reference plan.template, plan.script_file, and
plan.commands while leaving all other logic unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@dev-suite/src/orchestrator.py`:
- Around line 626-642: The conditional ternaries passing parsed_files are
redundant because parsed_files is guaranteed non-empty earlier; update the calls
to _run_sandbox_script(...) and _run_sandbox_tests(...) to pass parsed_files
directly (replace parsed_files if parsed_files else None with parsed_files) for
the branches handling plan and ValidationStrategy.TEST_SUITE /
ValidationStrategy.LINT_ONLY; ensure you update both call sites that reference
plan.template, plan.script_file, and plan.commands while leaving all other logic
unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a366c29e-7ece-4e01-98e0-ed4e1f8a26a4

📥 Commits

Reviewing files that changed from the base of the PR and between a322ef1 and b56164b.

📒 Files selected for processing (2)
  • dev-suite/src/orchestrator.py
  • dev-suite/tests/test_sandbox_validate.py

… skips pytest

CR fixes addressed:
- e2b_runner.py: change `elif` to `if` so rc=127 (missing tool) both
  warns AND sets aggregate_exit_code (Critical — missing pnpm/tsc passed)
- validation_commands.py: fullstack branch now checks _has_test_files()
  and uses PYTHON_LINT_COMMANDS when no tests present (Major — "no tests
  collected" failure for mixed frontend+Python without test files)
- test_validation_commands.py: existing mixed tests now include test files,
  added test_mixed_frontend_and_python_no_tests for lint-only path
@Abernaughty Abernaughty merged commit 7b87ea4 into main Apr 2, 2026
3 checks passed
@Abernaughty Abernaughty deleted the fix/sandbox-validation-strategy branch April 2, 2026 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

E2B sandbox timeout on simple hello_world.py validation

1 participant