Skip to content

feat: E2E integration test — full pipeline with real LLMs #86

@Abernaughty

Description

@Abernaughty

Summary

Create an end-to-end integration test that runs the full orchestrator pipeline with real LLM API keys — Architect (Gemini) generates a blueprint, Dev (Claude) writes code with tools, code gets applied to workspace, sandbox validates, QA reviews. This is the "moment of truth" test proving the system works.

Context

All infrastructure is in place:

What's missing is proof that it all works together end-to-end with real models.

Test Design

Setup

  1. Create a temp directory as WORKSPACE_ROOT
  2. Seed it with a minimal Python project:
    • utils.py — empty or with a simple existing function
    • test_utils.py — a test file expecting a greet(name) function
    • pyproject.toml or minimal config so pytest can run

Execution

  1. Call run_task("Add a greet(name) function to utils.py that returns 'Hello, {name}!'") with:
    • MAX_RETRIES=3
    • TOKEN_BUDGET=100000 (generous for real tool loops)
    • WORKSPACE_ROOT pointing to temp dir
    • Real ANTHROPIC_API_KEY and GOOGLE_API_KEY
    • E2B_API_KEY optional (sandbox validation skipped if absent)

Assertions

  1. Assert pipeline structure (not content quality):
    • result.status is WorkflowStatus.PASSED (ideal) or at minimum not crashed
    • result.blueprint is not None (Architect produced output)
    • result.generated_code is not empty (Dev produced code)
    • result.parsed_files has at least 1 entry (code parser worked)
    • Files exist on disk in temp workspace
    • result.memory_writes has entries (memory layer engaged)
    • result.tokens_used > 0 (real API calls happened)
    • result.tool_calls_log has entries if tools were available

Markers

  1. Mark with @pytest.mark.integration — excluded from CI via -m "not integration"
  2. Skip if required API keys are missing (graceful degradation)

File

File Action Description
dev-suite/tests/test_e2e_integration.py NEW Full pipeline integration test

Implementation Notes

  • Use the simplest possible task to minimize flakiness — "add a function" is near-impossible to fail
  • Don't assert on code quality or specific file contents — LLM output varies
  • Log the full trace on failure for debugging
  • Consider adding a convenience script (scripts/run_e2e.sh) that sets up env and runs the test
  • First run will likely surface edge cases in the orchestrator — that's valuable

Effort

Medium (1-2 sessions — the test itself is simple but first-run debugging is expected)

Depends On

Blocks

  • MVP validation — this proves the system works

Metadata

Metadata

Assignees

No one assigned

    Projects

    Status

    Done

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions