feat: E2E integration test — full pipeline with real LLMs (#86) by Abernaughty · Pull Request #90 · Abernaughty/agent-dev

Abernaughty · 2026-04-01T19:29:58Z

Summary

Full end-to-end integration test that proves the orchestrator pipeline works with real LLMs, real MCP tools, and real sandbox execution. This is the "moment of truth" test — #79 (code application), #80 (tool binding), #82 (code parser), #83 (tool wiring) all converge here.

What It Tests

Pipeline Stage	What's Validated
Architect (Gemini)	Produces a Blueprint with target_files + instructions
Developer (Claude + Filesystem MCP)	Generates code, uses `filesystem_read`/`filesystem_write` tools
apply_code	Parses `# --- FILE:` markers, writes files to workspace disk
sandbox_validate (E2B)	Runs pytest against written files in sandbox (conditional on `E2B_API_KEY`)
QA (Claude + Filesystem MCP)	Reviews code with read-only filesystem tools, produces failure report
flush_memory	Accumulates and persists memory writes

Test Design

Seeded Workspace (`tmp_path`)

utils.py — stub file (agent must add the greet function)
test_utils.py — expects greet('World') == 'Hello, World!' and greet('Alice') == 'Hello, Alice!'
mcp-config.json — points Filesystem MCP at the temp workspace (if npx available)

Assertions (structural, not content-based)

Pipeline doesn't crash, returns AgentState
Every graph node appears in the trace
Blueprint, generated_code, failure_report all populated
tokens_used > 0 (real API calls happened)
memory_writes has entries
Conditional: If npx available → Dev made filesystem tool calls
Conditional: If E2B key present → sandbox_result is not None
Conditional: If files parsed → they exist on disk

Gating

@pytest.mark.integration — excluded from CI by -m "not integration"
Skips if ANTHROPIC_API_KEY or GOOGLE_API_KEY missing
Degrades gracefully: no npx → single-shot mode, no E2B key → sandbox skipped

Run Command

cd dev-suite
uv run --group dev --group api pytest tests/test_e2e_integration.py -v -s -m integration

Known Gaps (follow-up issues)

PR creation not tested — publish_code_node tracked in feat: publish_code_node — branch creation + PR opening after QA pass #89
GitHub MCP not included — needs Docker, deferred
First run may surface orchestrator edge cases — that's the point

Files

File	Action	Description
`dev-suite/tests/test_e2e_integration.py`	NEW	Full pipeline integration test (2 test methods)

Estimated Cost

$0.05–$0.50 per run depending on retries

Closes #86

Summary by CodeRabbit

Tests
- Added a comprehensive end-to-end integration test for the orchestrator pipeline using real LLM providers and optional sandbox tooling.
- Tests seed a temporary workspace, conditionally enable sandbox/trace features, and are gated by required environment configuration.
- Produce a structured diagnostic report (status, token usage, blueprint, artifacts, tool calls, memory, trace, errors) and assert pipeline outputs, file writes, trace nodes, and terminal workflow status.

Full pipeline integration test proving the orchestrator works end-to-end: - Architect (Gemini) generates Blueprint - Developer (Claude) writes code using Filesystem MCP tools - apply_code parses and writes files to workspace - sandbox_validate runs tests in E2B (conditional on key) - QA (Claude) reviews with read-only tools - flush_memory persists entries Seeded workspace: utils.py stub + test_utils.py with greet() expectations. MCP config: Filesystem MCP via npx pointing at temp workspace. Assertions are structural (pipeline plumbing), not content-based (LLM output varies). Tool usage, file writes, sandbox results, and memory writes all verified conditionally based on available keys. Gated by @pytest.mark.integration — excluded from CI. Skips gracefully if API keys missing from dev-suite/.env.

coderabbitai · 2026-04-01T19:30:10Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5dfcc4e3-177e-4060-8e85-7647d3508c1d

📥 Commits

Reviewing files that changed from the base of the PR and between 9b8eb27 and a8ca107.

📒 Files selected for processing (1)

dev-suite/tests/test_e2e_integration.py

📝 Walkthrough

Walkthrough

Adds a new end-to-end pytest integration test that seeds a temporary workspace, configures MCP/environment, runs the orchestrator pipeline against real LLM providers, captures AgentState diagnostics, and asserts on blueprint, generated code, applied files, trace nodes, memory writes, token usage, and terminal workflow status.

Changes

Cohort / File(s)	Summary
E2E Integration Test Module `dev-suite/tests/test_e2e_integration.py`	New integration test implementing workspace fixtures, MCP config seeding, environment patching, helper utilities (`_has_node_in_trace`, `_print_result`), and two test methods (`test_workspace_seeded_correctly`, `test_full_pipeline`) that run `run_task()` and assert on pipeline outputs and trace nodes.
Workspace seeds `dev-suite/tests/.../utils.py`, `dev-suite/tests/.../test_utils.py`, `dev-suite/tests/.../mcp-config.json`	Test seeds minimal project files and an MCP config file in the temporary workspace for sandbox/tooling checks.
Test config / env patching `dev-suite/tests/...` (module-level fixtures and autouse methods)	Adds fixtures and autouse setup to require API keys, patch `src.orchestrator` globals (`TOKEN_BUDGET`, `MAX_RETRIES`, `WORKSPACE_ROOT`), detect optional E2B and npx availability, and gate assertions accordingly.

Sequence Diagram

sequenceDiagram
    participant Test as Test Runner
    participant Orch as Orchestrator
    participant Arch as Architect (Gemini)
    participant Dev as Developer (Claude)
    participant CodeApp as Code Application
    participant Sandbox as Sandbox
    participant QA as QA Agent

    Test->>Test: Seed workspace (utils.py, test_utils.py) & mcp-config
    Test->>Test: Configure env (API keys, TOKEN_BUDGET, WORKSPACE_ROOT)
    Test->>Orch: run_task("Add greet(name) function")

    Orch->>Arch: Request blueprint
    Arch-->>Orch: Blueprint (targets, instructions)

    Orch->>Dev: Request implementation (tool access enabled)
    Dev->>Dev: Invoke workspace tools (file ops) if available
    Dev-->>Orch: Generated code, tool call logs

    Orch->>CodeApp: Parse & apply code
    CodeApp->>CodeApp: Write files to disk
    CodeApp-->>Orch: Parsed files list

    alt E2B available
        Orch->>Sandbox: Run sandbox validation
        Sandbox-->>Orch: Sandbox results
    end

    Orch->>QA: Run QA checks
    QA-->>Orch: QA report (failures)

    Orch-->>Test: Return AgentState (tokens, memory writes, trace, status)
    Test->>Test: Assert tokens>0, blueprint exists, files on disk, memory writes, trace nodes, terminal status
    Test->>Test: Print diagnostic report

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I hopped a temp-dir, tidy and neat,

seeded a utils.py and a friendly test suite.
Architects, devs, and sandboxes played,
tracing their steps as tokens were paid.
Hooray — the pipeline danced end-to-end, neat!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: E2E integration test — full pipeline with real LLMs (`#86`)' clearly describes the main change: adding a new end-to-end integration test for the full orchestrator pipeline.
Linked Issues check	✅ Passed	The PR implementation fully aligns with issue `#86` objectives: creates reproducible E2E test with real LLMs, exercises all pipeline stages (Architect/Developer/apply_code/sandbox_validate/QA/flush_memory), validates structural assertions without content-based checks, includes proper markers and graceful degradation for missing keys.
Out of Scope Changes check	✅ Passed	All changes are scoped to issue `#86`: adds single test module with fixtures and test methods to validate the E2E pipeline; no unrelated modifications to core orchestrator logic or other systems.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/86-e2e-integration-test

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

dev-suite/tests/test_e2e_integration.py (1)

281-290: The “full graph” assertion still omits flush_memory.

Lines 281-286 stop at qa, but the workflow also registers flush_memory after that. result.memory_writes on Line 290 only shows writes were queued in state; a regression in the persistence hop would still pass this test.

Suggested fix

         assert _has_node_in_trace(trace, "apply_code"), "Trace should show apply_code ran"
         assert _has_node_in_trace(trace, "sandbox_validate"), "Trace should show sandbox_validate ran"
         assert _has_node_in_trace(trace, "qa"), "Trace should show qa ran"
+        assert _has_node_in_trace(trace, "flush_memory"), "Trace should show flush_memory ran"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@dev-suite/tests/test_e2e_integration.py` around lines 281 - 290, The trace
assertions stop at "qa" but omit the final "flush_memory" node—add assert
_has_node_in_trace(trace, "flush_memory") to ensure the full graph ran; also
replace or augment the loose assert len(result.memory_writes) > 0 by verifying
persistence (e.g., assert that entries in result.memory_writes were actually
flushed/persisted by checking a persisted flag on those writes or querying the
persistence layer/result persistence client to confirm the expected records
exist) so a regression in the persistence hop will fail the test.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@dev-suite/tests/test_e2e_integration.py`:
- Around line 135-149: The test currently can pass without apply_code writing
files because it pre-creates utils.py/test_utils.py and never asserts parsed
output was written; update the test to assert result.parsed_files is non-empty,
iterate result.parsed_files (from apply_code) and for each entry verify its
relative path is inside the temporary workspace (tmp_path), that a file exists
on disk at that workspace path, and that the file content exactly equals the
parsed content returned by apply_code; ensure you stop pre-creating the same
filenames (or use unique names) so the test fails if apply_code does not write,
and add assertions that check on-disk content equality to the parsed string for
each parsed file.
- Line 28: The test patches os.environ too late because src.orchestrator's
module-level globals TOKEN_BUDGET and MAX_RETRIES are evaluated at import via
_safe_int; update the test fixture (_configure_env) to monkeypatch those module
globals directly using monkeypatch.setattr on src.orchestrator.TOKEN_BUDGET and
src.orchestrator.MAX_RETRIES (or re-evaluate them by calling
src.orchestrator._safe_int and assigning the results) so run_task() reads the
patched values; ensure you reference src.orchestrator when calling
monkeypatch.setattr rather than only modifying os.environ.

---

Nitpick comments:
In `@dev-suite/tests/test_e2e_integration.py`:
- Around line 281-290: The trace assertions stop at "qa" but omit the final
"flush_memory" node—add assert _has_node_in_trace(trace, "flush_memory") to
ensure the full graph ran; also replace or augment the loose assert
len(result.memory_writes) > 0 by verifying persistence (e.g., assert that
entries in result.memory_writes were actually flushed/persisted by checking a
persisted flag on those writes or querying the persistence layer/result
persistence client to confirm the expected records exist) so a regression in the
persistence hop will fail the test.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2afa2492-a86d-4897-aece-ab5d5a17d6d4

📥 Commits

Reviewing files that changed from the base of the PR and between f44b635 and 9b8eb27.

📒 Files selected for processing (1)

dev-suite/tests/test_e2e_integration.py

dev-suite/tests/test_e2e_integration.py

1. Major: Patch TOKEN_BUDGET/MAX_RETRIES module globals directly via monkeypatch.setattr (not just os.environ), since they're evaluated at import time via _safe_int(). 2. Major: Strengthen parsed_files assertions — require non-empty, verify paths stay within workspace, verify on-disk content matches parsed content exactly. 3. Nitpick: Add flush_memory trace assertion to cover all 6 graph nodes.

coderabbitai bot reviewed Apr 1, 2026

View reviewed changes

dev-suite/tests/test_e2e_integration.py Show resolved Hide resolved

dev-suite/tests/test_e2e_integration.py Show resolved Hide resolved

Abernaughty merged commit e34685d into main Apr 1, 2026
3 checks passed

Abernaughty deleted the feat/86-e2e-integration-test branch April 1, 2026 19:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: E2E integration test — full pipeline with real LLMs (#86)#90

feat: E2E integration test — full pipeline with real LLMs (#86)#90
Abernaughty merged 2 commits intomainfrom
feat/86-e2e-integration-test

Abernaughty commented Apr 1, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 1, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Abernaughty commented Apr 1, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What It Tests

Test Design

Seeded Workspace (tmp_path)

Assertions (structural, not content-based)

Gating

Run Command

Known Gaps (follow-up issues)

Files

Estimated Cost

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Abernaughty commented Apr 1, 2026 •

edited by coderabbitai bot

Loading

Seeded Workspace (`tmp_path`)

coderabbitai bot commented Apr 1, 2026 •

edited

Loading