Skip to content

feat: E2E integration test — full pipeline with real LLMs (#86)#90

Merged
Abernaughty merged 2 commits intomainfrom
feat/86-e2e-integration-test
Apr 1, 2026
Merged

feat: E2E integration test — full pipeline with real LLMs (#86)#90
Abernaughty merged 2 commits intomainfrom
feat/86-e2e-integration-test

Conversation

@Abernaughty
Copy link
Copy Markdown
Owner

@Abernaughty Abernaughty commented Apr 1, 2026

Summary

Full end-to-end integration test that proves the orchestrator pipeline works with real LLMs, real MCP tools, and real sandbox execution. This is the "moment of truth" test — #79 (code application), #80 (tool binding), #82 (code parser), #83 (tool wiring) all converge here.

What It Tests

Pipeline Stage What's Validated
Architect (Gemini) Produces a Blueprint with target_files + instructions
Developer (Claude + Filesystem MCP) Generates code, uses filesystem_read/filesystem_write tools
apply_code Parses # --- FILE: markers, writes files to workspace disk
sandbox_validate (E2B) Runs pytest against written files in sandbox (conditional on E2B_API_KEY)
QA (Claude + Filesystem MCP) Reviews code with read-only filesystem tools, produces failure report
flush_memory Accumulates and persists memory writes

Test Design

Seeded Workspace (tmp_path)

  • utils.py — stub file (agent must add the greet function)
  • test_utils.py — expects greet('World') == 'Hello, World!' and greet('Alice') == 'Hello, Alice!'
  • mcp-config.json — points Filesystem MCP at the temp workspace (if npx available)

Assertions (structural, not content-based)

  • Pipeline doesn't crash, returns AgentState
  • Every graph node appears in the trace
  • Blueprint, generated_code, failure_report all populated
  • tokens_used > 0 (real API calls happened)
  • memory_writes has entries
  • Conditional: If npx available → Dev made filesystem tool calls
  • Conditional: If E2B key present → sandbox_result is not None
  • Conditional: If files parsed → they exist on disk

Gating

  • @pytest.mark.integration — excluded from CI by -m "not integration"
  • Skips if ANTHROPIC_API_KEY or GOOGLE_API_KEY missing
  • Degrades gracefully: no npx → single-shot mode, no E2B key → sandbox skipped

Run Command

cd dev-suite
uv run --group dev --group api pytest tests/test_e2e_integration.py -v -s -m integration

Known Gaps (follow-up issues)

Files

File Action Description
dev-suite/tests/test_e2e_integration.py NEW Full pipeline integration test (2 test methods)

Estimated Cost

$0.05–$0.50 per run depending on retries

Closes #86

Summary by CodeRabbit

  • Tests
    • Added a comprehensive end-to-end integration test for the orchestrator pipeline using real LLM providers and optional sandbox tooling.
    • Tests seed a temporary workspace, conditionally enable sandbox/trace features, and are gated by required environment configuration.
    • Produce a structured diagnostic report (status, token usage, blueprint, artifacts, tool calls, memory, trace, errors) and assert pipeline outputs, file writes, trace nodes, and terminal workflow status.

Full pipeline integration test proving the orchestrator works end-to-end:
- Architect (Gemini) generates Blueprint
- Developer (Claude) writes code using Filesystem MCP tools
- apply_code parses and writes files to workspace
- sandbox_validate runs tests in E2B (conditional on key)
- QA (Claude) reviews with read-only tools
- flush_memory persists entries

Seeded workspace: utils.py stub + test_utils.py with greet() expectations.
MCP config: Filesystem MCP via npx pointing at temp workspace.

Assertions are structural (pipeline plumbing), not content-based
(LLM output varies). Tool usage, file writes, sandbox results, and
memory writes all verified conditionally based on available keys.

Gated by @pytest.mark.integration — excluded from CI.
Skips gracefully if API keys missing from dev-suite/.env.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 1, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5dfcc4e3-177e-4060-8e85-7647d3508c1d

📥 Commits

Reviewing files that changed from the base of the PR and between 9b8eb27 and a8ca107.

📒 Files selected for processing (1)
  • dev-suite/tests/test_e2e_integration.py

📝 Walkthrough

Walkthrough

Adds a new end-to-end pytest integration test that seeds a temporary workspace, configures MCP/environment, runs the orchestrator pipeline against real LLM providers, captures AgentState diagnostics, and asserts on blueprint, generated code, applied files, trace nodes, memory writes, token usage, and terminal workflow status.

Changes

Cohort / File(s) Summary
E2E Integration Test Module
dev-suite/tests/test_e2e_integration.py
New integration test implementing workspace fixtures, MCP config seeding, environment patching, helper utilities (_has_node_in_trace, _print_result), and two test methods (test_workspace_seeded_correctly, test_full_pipeline) that run run_task() and assert on pipeline outputs and trace nodes.
Workspace seeds
dev-suite/tests/.../utils.py, dev-suite/tests/.../test_utils.py, dev-suite/tests/.../mcp-config.json
Test seeds minimal project files and an MCP config file in the temporary workspace for sandbox/tooling checks.
Test config / env patching
dev-suite/tests/... (module-level fixtures and autouse methods)
Adds fixtures and autouse setup to require API keys, patch src.orchestrator globals (TOKEN_BUDGET, MAX_RETRIES, WORKSPACE_ROOT), detect optional E2B and npx availability, and gate assertions accordingly.

Sequence Diagram

sequenceDiagram
    participant Test as Test Runner
    participant Orch as Orchestrator
    participant Arch as Architect (Gemini)
    participant Dev as Developer (Claude)
    participant CodeApp as Code Application
    participant Sandbox as Sandbox
    participant QA as QA Agent

    Test->>Test: Seed workspace (utils.py, test_utils.py) & mcp-config
    Test->>Test: Configure env (API keys, TOKEN_BUDGET, WORKSPACE_ROOT)
    Test->>Orch: run_task("Add greet(name) function")

    Orch->>Arch: Request blueprint
    Arch-->>Orch: Blueprint (targets, instructions)

    Orch->>Dev: Request implementation (tool access enabled)
    Dev->>Dev: Invoke workspace tools (file ops) if available
    Dev-->>Orch: Generated code, tool call logs

    Orch->>CodeApp: Parse & apply code
    CodeApp->>CodeApp: Write files to disk
    CodeApp-->>Orch: Parsed files list

    alt E2B available
        Orch->>Sandbox: Run sandbox validation
        Sandbox-->>Orch: Sandbox results
    end

    Orch->>QA: Run QA checks
    QA-->>Orch: QA report (failures)

    Orch-->>Test: Return AgentState (tokens, memory writes, trace, status)
    Test->>Test: Assert tokens>0, blueprint exists, files on disk, memory writes, trace nodes, terminal status
    Test->>Test: Print diagnostic report
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I hopped a temp-dir, tidy and neat,

seeded a utils.py and a friendly test suite.
Architects, devs, and sandboxes played,
tracing their steps as tokens were paid.
Hooray — the pipeline danced end-to-end, neat!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: E2E integration test — full pipeline with real LLMs (#86)' clearly describes the main change: adding a new end-to-end integration test for the full orchestrator pipeline.
Linked Issues check ✅ Passed The PR implementation fully aligns with issue #86 objectives: creates reproducible E2E test with real LLMs, exercises all pipeline stages (Architect/Developer/apply_code/sandbox_validate/QA/flush_memory), validates structural assertions without content-based checks, includes proper markers and graceful degradation for missing keys.
Out of Scope Changes check ✅ Passed All changes are scoped to issue #86: adds single test module with fixtures and test methods to validate the E2E pipeline; no unrelated modifications to core orchestrator logic or other systems.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/86-e2e-integration-test

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
dev-suite/tests/test_e2e_integration.py (1)

281-290: The “full graph” assertion still omits flush_memory.

Lines 281-286 stop at qa, but the workflow also registers flush_memory after that. result.memory_writes on Line 290 only shows writes were queued in state; a regression in the persistence hop would still pass this test.

Suggested fix
         assert _has_node_in_trace(trace, "apply_code"), "Trace should show apply_code ran"
         assert _has_node_in_trace(trace, "sandbox_validate"), "Trace should show sandbox_validate ran"
         assert _has_node_in_trace(trace, "qa"), "Trace should show qa ran"
+        assert _has_node_in_trace(trace, "flush_memory"), "Trace should show flush_memory ran"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dev-suite/tests/test_e2e_integration.py` around lines 281 - 290, The trace
assertions stop at "qa" but omit the final "flush_memory" node—add assert
_has_node_in_trace(trace, "flush_memory") to ensure the full graph ran; also
replace or augment the loose assert len(result.memory_writes) > 0 by verifying
persistence (e.g., assert that entries in result.memory_writes were actually
flushed/persisted by checking a persisted flag on those writes or querying the
persistence layer/result persistence client to confirm the expected records
exist) so a regression in the persistence hop will fail the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@dev-suite/tests/test_e2e_integration.py`:
- Around line 135-149: The test currently can pass without apply_code writing
files because it pre-creates utils.py/test_utils.py and never asserts parsed
output was written; update the test to assert result.parsed_files is non-empty,
iterate result.parsed_files (from apply_code) and for each entry verify its
relative path is inside the temporary workspace (tmp_path), that a file exists
on disk at that workspace path, and that the file content exactly equals the
parsed content returned by apply_code; ensure you stop pre-creating the same
filenames (or use unique names) so the test fails if apply_code does not write,
and add assertions that check on-disk content equality to the parsed string for
each parsed file.
- Line 28: The test patches os.environ too late because src.orchestrator's
module-level globals TOKEN_BUDGET and MAX_RETRIES are evaluated at import via
_safe_int; update the test fixture (_configure_env) to monkeypatch those module
globals directly using monkeypatch.setattr on src.orchestrator.TOKEN_BUDGET and
src.orchestrator.MAX_RETRIES (or re-evaluate them by calling
src.orchestrator._safe_int and assigning the results) so run_task() reads the
patched values; ensure you reference src.orchestrator when calling
monkeypatch.setattr rather than only modifying os.environ.

---

Nitpick comments:
In `@dev-suite/tests/test_e2e_integration.py`:
- Around line 281-290: The trace assertions stop at "qa" but omit the final
"flush_memory" node—add assert _has_node_in_trace(trace, "flush_memory") to
ensure the full graph ran; also replace or augment the loose assert
len(result.memory_writes) > 0 by verifying persistence (e.g., assert that
entries in result.memory_writes were actually flushed/persisted by checking a
persisted flag on those writes or querying the persistence layer/result
persistence client to confirm the expected records exist) so a regression in the
persistence hop will fail the test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2afa2492-a86d-4897-aece-ab5d5a17d6d4

📥 Commits

Reviewing files that changed from the base of the PR and between f44b635 and 9b8eb27.

📒 Files selected for processing (1)
  • dev-suite/tests/test_e2e_integration.py

1. Major: Patch TOKEN_BUDGET/MAX_RETRIES module globals directly via
   monkeypatch.setattr (not just os.environ), since they're evaluated
   at import time via _safe_int().

2. Major: Strengthen parsed_files assertions — require non-empty,
   verify paths stay within workspace, verify on-disk content matches
   parsed content exactly.

3. Nitpick: Add flush_memory trace assertion to cover all 6 graph nodes.
@Abernaughty Abernaughty merged commit e34685d into main Apr 1, 2026
3 checks passed
@Abernaughty Abernaughty deleted the feat/86-e2e-integration-test branch April 1, 2026 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: E2E integration test — full pipeline with real LLMs

1 participant