-
Notifications
You must be signed in to change notification settings - Fork 0
feat: E2E integration test — full pipeline with real LLMs #86
Copy link
Copy link
Closed
Labels
component/orchestratorLangGraph state machineLangGraph state machinepriority/P1Important - this sprintImportant - this sprintsprint/currenttype/featureNew capabilityNew capability
Milestone
Description
Summary
Create an end-to-end integration test that runs the full orchestrator pipeline with real LLM API keys — Architect (Gemini) generates a blueprint, Dev (Claude) writes code with tools, code gets applied to workspace, sandbox validates, QA reviews. This is the "moment of truth" test proving the system works.
Context
All infrastructure is in place:
- Code application pipeline (PR feat(#79): Code application pipeline — parse, write to workspace, load into sandbox #82, Issue feat: Code application pipeline — parse, write to workspace, load into sandbox #79) —
apply_code_nodeparses and writes files - Agent tool binding (PR feat: Agent tool binding — give Dev and QA agents access to workspace tools (#80) #83, Issue feat: Agent tool binding — give Dev and QA agents access to workspace tools #80) — Dev and QA can use workspace tools
- Fullstack E2B template (
pv9fsqyxoqx3eqlgtony) — Node 24 + pnpm + Python sandbox - Memory layer, tracing, SSE events all wired
What's missing is proof that it all works together end-to-end with real models.
Test Design
Setup
- Create a temp directory as
WORKSPACE_ROOT - Seed it with a minimal Python project:
utils.py— empty or with a simple existing functiontest_utils.py— a test file expecting agreet(name)functionpyproject.tomlor minimal config so pytest can run
Execution
- Call
run_task("Add a greet(name) function to utils.py that returns 'Hello, {name}!'")with:MAX_RETRIES=3TOKEN_BUDGET=100000(generous for real tool loops)WORKSPACE_ROOTpointing to temp dir- Real
ANTHROPIC_API_KEYandGOOGLE_API_KEY E2B_API_KEYoptional (sandbox validation skipped if absent)
Assertions
- Assert pipeline structure (not content quality):
result.statusisWorkflowStatus.PASSED(ideal) or at minimum not crashedresult.blueprintis not None (Architect produced output)result.generated_codeis not empty (Dev produced code)result.parsed_fileshas at least 1 entry (code parser worked)- Files exist on disk in temp workspace
result.memory_writeshas entries (memory layer engaged)result.tokens_used > 0(real API calls happened)result.tool_calls_loghas entries if tools were available
Markers
- Mark with
@pytest.mark.integration— excluded from CI via-m "not integration" - Skip if required API keys are missing (graceful degradation)
File
| File | Action | Description |
|---|---|---|
dev-suite/tests/test_e2e_integration.py |
NEW | Full pipeline integration test |
Implementation Notes
- Use the simplest possible task to minimize flakiness — "add a function" is near-impossible to fail
- Don't assert on code quality or specific file contents — LLM output varies
- Log the full trace on failure for debugging
- Consider adding a convenience script (
scripts/run_e2e.sh) that sets up env and runs the test - First run will likely surface edge cases in the orchestrator — that's valuable
Effort
Medium (1-2 sessions — the test itself is simple but first-run debugging is expected)
Depends On
- Issue feat: Code application pipeline — parse, write to workspace, load into sandbox #79 (Code application pipeline) — DONE
- Issue feat: Agent tool binding — give Dev and QA agents access to workspace tools #80 (Agent tool binding) — DONE
- Issue chore: Fix 7 skipped tests and test infrastructure gaps #81 (Fix skipped tests) — recommended before this
Blocks
- MVP validation — this proves the system works
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
component/orchestratorLangGraph state machineLangGraph state machinepriority/P1Important - this sprintImportant - this sprintsprint/currenttype/featureNew capabilityNew capability
Projects
Status
Done