feat: E2E integration test — full pipeline with real LLMs

## Summary
Create an end-to-end integration test that runs the full orchestrator pipeline with real LLM API keys — Architect (Gemini) generates a blueprint, Dev (Claude) writes code with tools, code gets applied to workspace, sandbox validates, QA reviews. This is the "moment of truth" test proving the system works.

## Context
All infrastructure is in place:
- Code application pipeline (PR #82, Issue #79) — `apply_code_node` parses and writes files
- Agent tool binding (PR #83, Issue #80) — Dev and QA can use workspace tools
- Fullstack E2B template (`pv9fsqyxoqx3eqlgtony`) — Node 24 + pnpm + Python sandbox
- Memory layer, tracing, SSE events all wired

What's missing is proof that it all works together end-to-end with real models.

## Test Design

### Setup
1. Create a temp directory as `WORKSPACE_ROOT`
2. Seed it with a minimal Python project:
   - `utils.py` — empty or with a simple existing function
   - `test_utils.py` — a test file expecting a `greet(name)` function
   - `pyproject.toml` or minimal config so pytest can run

### Execution
3. Call `run_task("Add a greet(name) function to utils.py that returns 'Hello, {name}!'")` with:
   - `MAX_RETRIES=3`
   - `TOKEN_BUDGET=100000` (generous for real tool loops)
   - `WORKSPACE_ROOT` pointing to temp dir
   - Real `ANTHROPIC_API_KEY` and `GOOGLE_API_KEY`
   - `E2B_API_KEY` optional (sandbox validation skipped if absent)

### Assertions
4. Assert pipeline structure (not content quality):
   - `result.status` is `WorkflowStatus.PASSED` (ideal) or at minimum not crashed
   - `result.blueprint` is not None (Architect produced output)
   - `result.generated_code` is not empty (Dev produced code)
   - `result.parsed_files` has at least 1 entry (code parser worked)
   - Files exist on disk in temp workspace
   - `result.memory_writes` has entries (memory layer engaged)
   - `result.tokens_used > 0` (real API calls happened)
   - `result.tool_calls_log` has entries if tools were available

### Markers
5. Mark with `@pytest.mark.integration` — excluded from CI via `-m "not integration"`
6. Skip if required API keys are missing (graceful degradation)

## File
| File | Action | Description |
|------|--------|-------------|
| `dev-suite/tests/test_e2e_integration.py` | **NEW** | Full pipeline integration test |

## Implementation Notes
- Use the simplest possible task to minimize flakiness — "add a function" is near-impossible to fail
- Don't assert on code quality or specific file contents — LLM output varies
- Log the full trace on failure for debugging
- Consider adding a convenience script (`scripts/run_e2e.sh`) that sets up env and runs the test
- First run will likely surface edge cases in the orchestrator — that's valuable

## Effort
Medium (1-2 sessions — the test itself is simple but first-run debugging is expected)

## Depends On
- Issue #79 (Code application pipeline) — DONE
- Issue #80 (Agent tool binding) — DONE
- Issue #81 (Fix skipped tests) — recommended before this

## Blocks
- MVP validation — this proves the system works

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: E2E integration test — full pipeline with real LLMs #86

Summary

Context

Test Design

Setup

Execution

Assertions

Markers

File

Implementation Notes

Effort

Depends On

Blocks

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat: E2E integration test — full pipeline with real LLMs #86

Description

Summary

Context

Test Design

Setup

Execution

Assertions

Markers

File

Implementation Notes

Effort

Depends On

Blocks

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions