feat: E2E integration test — full pipeline with real LLMs (#86)#90
feat: E2E integration test — full pipeline with real LLMs (#86)#90Abernaughty merged 2 commits intomainfrom
Conversation
Full pipeline integration test proving the orchestrator works end-to-end: - Architect (Gemini) generates Blueprint - Developer (Claude) writes code using Filesystem MCP tools - apply_code parses and writes files to workspace - sandbox_validate runs tests in E2B (conditional on key) - QA (Claude) reviews with read-only tools - flush_memory persists entries Seeded workspace: utils.py stub + test_utils.py with greet() expectations. MCP config: Filesystem MCP via npx pointing at temp workspace. Assertions are structural (pipeline plumbing), not content-based (LLM output varies). Tool usage, file writes, sandbox results, and memory writes all verified conditionally based on available keys. Gated by @pytest.mark.integration — excluded from CI. Skips gracefully if API keys missing from dev-suite/.env.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds a new end-to-end pytest integration test that seeds a temporary workspace, configures MCP/environment, runs the orchestrator pipeline against real LLM providers, captures AgentState diagnostics, and asserts on blueprint, generated code, applied files, trace nodes, memory writes, token usage, and terminal workflow status. Changes
Sequence DiagramsequenceDiagram
participant Test as Test Runner
participant Orch as Orchestrator
participant Arch as Architect (Gemini)
participant Dev as Developer (Claude)
participant CodeApp as Code Application
participant Sandbox as Sandbox
participant QA as QA Agent
Test->>Test: Seed workspace (utils.py, test_utils.py) & mcp-config
Test->>Test: Configure env (API keys, TOKEN_BUDGET, WORKSPACE_ROOT)
Test->>Orch: run_task("Add greet(name) function")
Orch->>Arch: Request blueprint
Arch-->>Orch: Blueprint (targets, instructions)
Orch->>Dev: Request implementation (tool access enabled)
Dev->>Dev: Invoke workspace tools (file ops) if available
Dev-->>Orch: Generated code, tool call logs
Orch->>CodeApp: Parse & apply code
CodeApp->>CodeApp: Write files to disk
CodeApp-->>Orch: Parsed files list
alt E2B available
Orch->>Sandbox: Run sandbox validation
Sandbox-->>Orch: Sandbox results
end
Orch->>QA: Run QA checks
QA-->>Orch: QA report (failures)
Orch-->>Test: Return AgentState (tokens, memory writes, trace, status)
Test->>Test: Assert tokens>0, blueprint exists, files on disk, memory writes, trace nodes, terminal status
Test->>Test: Print diagnostic report
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
dev-suite/tests/test_e2e_integration.py (1)
281-290: The “full graph” assertion still omitsflush_memory.Lines 281-286 stop at
qa, but the workflow also registersflush_memoryafter that.result.memory_writeson Line 290 only shows writes were queued in state; a regression in the persistence hop would still pass this test.Suggested fix
assert _has_node_in_trace(trace, "apply_code"), "Trace should show apply_code ran" assert _has_node_in_trace(trace, "sandbox_validate"), "Trace should show sandbox_validate ran" assert _has_node_in_trace(trace, "qa"), "Trace should show qa ran" + assert _has_node_in_trace(trace, "flush_memory"), "Trace should show flush_memory ran"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dev-suite/tests/test_e2e_integration.py` around lines 281 - 290, The trace assertions stop at "qa" but omit the final "flush_memory" node—add assert _has_node_in_trace(trace, "flush_memory") to ensure the full graph ran; also replace or augment the loose assert len(result.memory_writes) > 0 by verifying persistence (e.g., assert that entries in result.memory_writes were actually flushed/persisted by checking a persisted flag on those writes or querying the persistence layer/result persistence client to confirm the expected records exist) so a regression in the persistence hop will fail the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@dev-suite/tests/test_e2e_integration.py`:
- Around line 135-149: The test currently can pass without apply_code writing
files because it pre-creates utils.py/test_utils.py and never asserts parsed
output was written; update the test to assert result.parsed_files is non-empty,
iterate result.parsed_files (from apply_code) and for each entry verify its
relative path is inside the temporary workspace (tmp_path), that a file exists
on disk at that workspace path, and that the file content exactly equals the
parsed content returned by apply_code; ensure you stop pre-creating the same
filenames (or use unique names) so the test fails if apply_code does not write,
and add assertions that check on-disk content equality to the parsed string for
each parsed file.
- Line 28: The test patches os.environ too late because src.orchestrator's
module-level globals TOKEN_BUDGET and MAX_RETRIES are evaluated at import via
_safe_int; update the test fixture (_configure_env) to monkeypatch those module
globals directly using monkeypatch.setattr on src.orchestrator.TOKEN_BUDGET and
src.orchestrator.MAX_RETRIES (or re-evaluate them by calling
src.orchestrator._safe_int and assigning the results) so run_task() reads the
patched values; ensure you reference src.orchestrator when calling
monkeypatch.setattr rather than only modifying os.environ.
---
Nitpick comments:
In `@dev-suite/tests/test_e2e_integration.py`:
- Around line 281-290: The trace assertions stop at "qa" but omit the final
"flush_memory" node—add assert _has_node_in_trace(trace, "flush_memory") to
ensure the full graph ran; also replace or augment the loose assert
len(result.memory_writes) > 0 by verifying persistence (e.g., assert that
entries in result.memory_writes were actually flushed/persisted by checking a
persisted flag on those writes or querying the persistence layer/result
persistence client to confirm the expected records exist) so a regression in the
persistence hop will fail the test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2afa2492-a86d-4897-aece-ab5d5a17d6d4
📒 Files selected for processing (1)
dev-suite/tests/test_e2e_integration.py
1. Major: Patch TOKEN_BUDGET/MAX_RETRIES module globals directly via monkeypatch.setattr (not just os.environ), since they're evaluated at import time via _safe_int(). 2. Major: Strengthen parsed_files assertions — require non-empty, verify paths stay within workspace, verify on-disk content matches parsed content exactly. 3. Nitpick: Add flush_memory trace assertion to cover all 6 graph nodes.
Summary
Full end-to-end integration test that proves the orchestrator pipeline works with real LLMs, real MCP tools, and real sandbox execution. This is the "moment of truth" test — #79 (code application), #80 (tool binding), #82 (code parser), #83 (tool wiring) all converge here.
What It Tests
filesystem_read/filesystem_writetools# --- FILE:markers, writes files to workspace diskE2B_API_KEY)Test Design
Seeded Workspace (
tmp_path)utils.py— stub file (agent must add thegreetfunction)test_utils.py— expectsgreet('World') == 'Hello, World!'andgreet('Alice') == 'Hello, Alice!'mcp-config.json— points Filesystem MCP at the temp workspace (if npx available)Assertions (structural, not content-based)
AgentStatetokens_used > 0(real API calls happened)memory_writeshas entriesGating
@pytest.mark.integration— excluded from CI by-m "not integration"ANTHROPIC_API_KEYorGOOGLE_API_KEYmissingRun Command
cd dev-suite uv run --group dev --group api pytest tests/test_e2e_integration.py -v -s -m integrationKnown Gaps (follow-up issues)
publish_code_nodetracked in feat: publish_code_node — branch creation + PR opening after QA pass #89Files
dev-suite/tests/test_e2e_integration.pyEstimated Cost
$0.05–$0.50 per run depending on retries
Closes #86
Summary by CodeRabbit