Skip to content

feat: QA → Architect escalation path for architectural failures (#57)#78

Merged
Abernaughty merged 10 commits intomainfrom
feat/57-qa-escalation-path
Mar 31, 2026
Merged

feat: QA → Architect escalation path for architectural failures (#57)#78
Abernaughty merged 10 commits intomainfrom
feat/57-qa-escalation-path

Conversation

@Abernaughty
Copy link
Copy Markdown
Owner

@Abernaughty Abernaughty commented Mar 31, 2026

Summary

Implements the QA to Architect escalation path so the orchestrator can distinguish between code failures (retry Lead Dev) and architectural failures (re-plan with Architect). This makes the retry budget meaningful instead of burning all 3 retries on the same broken Blueprint.

Key Finding

The escalation routing was already 90% wired from the initial architecture. The route_after_qa function already handled ESCALATED -> architect, and the QA prompt already asked for is_architectural. This PR completes the feature with reliable classification, a new FailureType enum, enhanced prompts, and comprehensive test coverage.

Changes

src/agents/qa.py — FailureType enum + model enhancement

  • New FailureType enum: CODE vs ARCHITECTURAL for explicit routing decisions
  • New failure_type field on FailureReport (Optional, backward compatible)
  • model_validator keeps failure_type and is_architectural in sync
  • failure_type takes precedence when both are provided
  • Old LLM output without failure_type still works (derives from is_architectural)

src/orchestrator.py — Enhanced prompts (3 changes)

  1. Import: FailureType now imported alongside FailureReport
  2. QA system prompt: Added failure_type field to JSON schema + detailed classification heuristics (5 code-failure signals, 5 architectural-failure signals) so the LLM reliably distinguishes bugs from design flaws
  3. Architect re-plan context: Includes failed_files list + explicit instruction to generate a COMPLETELY NEW Blueprint (not patch the old one)

src/api/events.py — QA_ESCALATION SSE event type

  • Dashboard can display "QA escalated to Architect" in the task timeline

tests/test_qa_escalation.py — 28 new tests

Category Count What it covers
FailureType enum 4 Values, string coercion, invalid raises
FailureReport model 9 Sync logic, precedence, backward compat, JSON round-trip
Routing logic 6 Code→dev, arch→architect, budget/retry limits
Graph structure 3 Nodes exist, conditional edges compile
QA classification 2 ESCALATED vs REVIEWING status
Architect re-plan 2 Failure context included, clean on first run
Retry budget 2 Escalation increments count, pass does not

Escalation Flow

QA classifies failure:
  failure_type = "code"          → retry Lead Dev (same Blueprint)
  failure_type = "architectural" → escalate to Architect (NEW Blueprint)
  
Budget limits still apply:
  retry_count >= MAX_RETRIES  → flush_memory → END
  tokens_used >= TOKEN_BUDGET → flush_memory → END

Acceptance Criteria (from #57)

  • QA agent's failure report includes a failure_type field: code or architectural
  • Orchestrator routes architectural failures back to the Architect node
  • Architect generates a new Blueprint (not a patch of the old one)
  • Architectural re-plans count against the retry budget (still max 3 total)
  • Unit tests covering both escalation paths
  • Existing tests unaffected (no regressions — FailureType is backward compatible)

Closes #57

Summary by CodeRabbit

  • New Features

    • Automated QA escalation: failures now classify as code vs architectural and route architectural issues to the architecture team for full re-planning.
    • New real-time event type to monitor QA escalations.
  • Behavior Improvements

    • More resilient QA report handling from LLMs (case- and whitespace-tolerant classification, sensible fallbacks).
  • Tests

    • Added comprehensive test coverage for escalation routing, classification rules, and retry/limits behavior.
  • Documentation

    • Updated guidance describing escalation routing and classification.

- Add FailureType enum: CODE vs ARCHITECTURAL for routing
- Add failure_type field to FailureReport (Optional, backward compat)
- model_validator keeps failure_type and is_architectural in sync
- failure_type takes precedence when both are provided
- Old LLM output without failure_type still works (derives from is_architectural)
Dashboard can now display "QA escalated to Architect" in the task
timeline when an architectural failure triggers re-planning.
…ion (#57)

1. Import FailureType from agents.qa
2. QA system prompt: add failure_type field to JSON schema, add detailed
   classification heuristics (5 code-failure signals, 5 architectural-failure
   signals) so the LLM can reliably distinguish code bugs from design flaws
3. Architect re-plan: include failed_files list in context, add explicit
   instruction to generate a COMPLETELY NEW Blueprint (not patch the old one)
Tests cover:
- FailureType enum (4 tests)
- FailureReport model sync logic (9 tests)
- Routing: code vs architectural escalation (6 tests)
- Graph structure verification (3 tests)
- QA classification behavior (2 tests)
- Architect re-plan context (2 tests)
- Retry budget accounting (2 tests)

Labels: testing, sprint/current, component/orchestrator
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5ad48e6ce1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

failed_files: list[str]
is_architectural: bool # If True, escalate to Architect
recommendation: str
failure_type: FailureType | None = None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Accept case-insensitive failure_type values

The new failure_type field is validated as a strict FailureType enum, so common LLM variants like "ARCHITECTURAL" or "Code" will raise a ValidationError even when the rest of the QA report is valid. In qa_node, that exception path is treated as QA failed to produce valid report and sets workflow status to FAILED, which aborts normal retry/escalation routing. Because this value is model-generated text, it should be normalized (or otherwise parsed leniently) before enum validation.

Useful? React with 👍 / 👎.

Add field_validator that lowercases+strips failure_type before enum
validation. LLMs may return "ARCHITECTURAL", "Code", "Architectural"
etc. — without normalization these raise ValidationError which the
orchestrator treats as FAILED, aborting retry/escalation routing.
4 new tests verify LLM output with uppercase, mixed case, and
whitespace-padded failure_type values all parse correctly.
Copy link
Copy Markdown
Owner Author

@Abernaughty Abernaughty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commits 4c3b079 and e6f5a3b. Added field_validator("failure_type", mode="before") that normalizes to lowercase+strip before enum validation. 4 new tests cover uppercase, mixed case, and whitespace-padded variants.

failed_files: list[str]
is_architectural: bool # If True, escalate to Architect
recommendation: str
failure_type: FailureType | None = None
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed in 4c3b079 -- added a field_validator("failure_type", mode="before") that calls .strip().lower() on string inputs before enum validation. Tests added for "ARCHITECTURAL", "Code", "CODE", and " architectural " variants.

@Abernaughty
Copy link
Copy Markdown
Owner Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e6f5a3b22b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

failed_files: list[str]
is_architectural: bool # If True, escalate to Architect
recommendation: str
failure_type: FailureType | None = None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Gracefully handle unknown failure_type values

Defining failure_type as a strict FailureType enum means any non-exact LLM value (for example, a typo or synonym) now raises a validation error when FailureReport(**report_data) runs in qa_node, and that path marks the workflow as FAILED instead of continuing with retry/escalation logic. This turns small model-format drift into a hard stop; consider coercing unknown values to None (and falling back to is_architectural/status) rather than failing the whole run.

Useful? React with 👍 / 👎.

#2)

Unknown LLM values like "design_flaw" or typos now gracefully fall back
to None, letting the model_validator derive failure_type from
is_architectural/status. Logs a warning for observability. Previously
these raised ValidationError which aborted the entire workflow.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 31, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 280cc388-97c9-4874-a1e7-b5ebb6d8ac4f

📥 Commits

Reviewing files that changed from the base of the PR and between 094689f and 3c64ded.

📒 Files selected for processing (2)
  • dev-suite/src/agents/qa.py
  • dev-suite/tests/test_qa_escalation.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • dev-suite/src/agents/qa.py
  • dev-suite/tests/test_qa_escalation.py

📝 Walkthrough

Walkthrough

Adds a QA→Architect escalation flow: a new FailureType enum and optional failure_type on FailureReport with validation/synchronization logic; orchestrator routing and prompts updated to route architectural failures to the Architect (new Blueprint) and code failures to the Lead Dev; new SSE event and tests added.

Changes

Cohort / File(s) Summary
QA Agent Failure Classification
dev-suite/src/agents/qa.py
Introduce FailureType enum (code, architectural) and add `failure_type: FailureType
Orchestrator Routing & Prompts
dev-suite/src/orchestrator.py
Import FailureType. Update qa_node prompt schema to include failure_type and explicit classification rules; change architect_node to include failed files when present, append recommendation newline, and instruct Architect to generate a completely new Blueprint (not a patch).
API Event Type
dev-suite/src/api/events.py
Add QA_ESCALATION = "qa_escalation" to EventType enum. Minor log/comment punctuation tweaks (em dash → double hyphen).
Test Coverage
dev-suite/tests/test_qa_escalation.py
Add comprehensive tests for FailureType enum parsing, FailureReport sync rules, LLM-style string inputs, routing via route_after_qa, retry/budget edge cases, graph wiring (qa, architect nodes), and architect re-planning context behavior.

Sequence Diagram(s)

sequenceDiagram
    participant QA as QA Agent
    participant Router as Orchestrator
    participant Dev as Lead Dev Node
    participant Arch as Architect Node
    participant Mem as Memory/Flush

    QA->>Router: Return FailureReport (includes failure_type)
    alt failure_type == "code"
        Router->>Dev: Route to Lead Dev (same Blueprint)
        Dev->>Mem: Execute / retry / eventually escalate
    else failure_type == "architectural"
        Router->>Arch: Route to Architect (with failure context)
        Arch->>Arch: Generate a completely new Blueprint
        Arch->>Dev: Send new Blueprint to Lead Dev
    else status == "pass"
        Router->>Mem: Flush memory
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 A bug or a plan, which way shall we go?

QA now whispers the truth it must show.
Code goes to devs, designs go to art—
A rabbit hops in, glad to do its part!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main feature: QA → Architect escalation for architectural failures, which matches the primary objective of the changeset.
Linked Issues check ✅ Passed All acceptance criteria from issue #57 are met: failure_type field added to FailureReport, orchestrator routes architectural failures to Architect, new Blueprint generation via enhanced prompts, architectural re-plans counted in retry budget, comprehensive tests added.
Out of Scope Changes check ✅ Passed Changes are focused on QA escalation: FailureType enum, FailureReport field, orchestrator routing logic, prompts, and tests. QA_ESCALATION SSE event is a minor addition for dashboard visibility, directly supporting the escalation feature.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/57-qa-escalation-path

Comment @coderabbitai help to get the list of available commands and usage tips.

3 new tests verify that unknown LLM values ("design_flaw", "typo_value",
"") gracefully fall back to is_architectural/status derivation instead
of crashing. Total: 35 tests in test_qa_escalation.py.
Copy link
Copy Markdown
Owner Author

@Abernaughty Abernaughty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commits 0e18164 and 094689f. The field_validator now coerces unknown strings to None (with a warning log) instead of letting them reach enum validation. The model_validator then derives failure_type from is_architectural/status as fallback. 3 new tests cover unknown values.

failed_files: list[str]
is_architectural: bool # If True, escalate to Architect
recommendation: str
failure_type: FailureType | None = None
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — the field_validator now only passes through "code" and "architectural" (after lowercasing). Unknown strings like "design_flaw" are coerced to None with a warning log, and the model_validator falls back to is_architectural/status. No more ValidationError crashes from model-format drift.

@Abernaughty
Copy link
Copy Markdown
Owner Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 094689ff0b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +90 to +91
elif self.status == "fail":
self.failure_type = FailureType.CODE
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Infer architectural failures from escalate status

When failure_type is missing/invalid, sync_failure_type only derives a fallback for status == "fail" and ignores status == "escalate". In that case a report like {status: "escalate", failure_type: "design_flaw", is_architectural: false} is normalized to a non-architectural failure, so qa_node routes it back to the developer instead of the architect. This defeats the new escalation path precisely when the model output is inconsistent (the case this fallback is meant to handle).

Useful? React with 👍 / 👎.

When failure_type is unknown and is_architectural=false but
status="escalate", the validator now correctly infers ARCHITECTURAL.
Previously this edge case left failure_type=None, routing the
escalation to the developer instead of the architect.
Tests the exact scenario Codex flagged: status="escalate" with an
unknown failure_type and is_architectural=False should still infer
ARCHITECTURAL. Total: 36 tests.
Copy link
Copy Markdown
Owner Author

@Abernaughty Abernaughty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in bf22951 and 3c64ded. The sync_failure_type model_validator now handles status="escalate" as an explicit branch — when failure_type is None (unknown/missing) and is_architectural=False, status="escalate" correctly infers ARCHITECTURAL. Test added for the exact edge case.

elif self.is_architectural:
self.failure_type = FailureType.ARCHITECTURAL
elif self.status == "fail":
self.failure_type = FailureType.CODE
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added elif self.status == "escalate" branch to sync_failure_type that sets failure_type = ARCHITECTURAL and is_architectural = True. The edge case {status: "escalate", failure_type: "design_flaw", is_architectural: false} now correctly routes to the Architect.

@Abernaughty
Copy link
Copy Markdown
Owner Author

@codex review

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
dev-suite/tests/test_qa_escalation.py (1)

220-228: Use orchestrator constants for retry/token boundary tests.

These assertions hard-code 3 and 50000, which can drift from env-configured MAX_RETRIES/TOKEN_BUDGET. Import the constants from src.orchestrator for stable intent-based tests.

Proposed test tweak
     def test_escalation_respects_max_retries(self):
-        route, WS = self._import_route()
-        state = {"status": WS.ESCALATED, "retry_count": 3, "tokens_used": 20000}
+        from src import orchestrator as orch
+        route, WS = self._import_route()
+        state = {"status": WS.ESCALATED, "retry_count": orch.MAX_RETRIES, "tokens_used": 20000}
         assert route(state) == "flush_memory"

     def test_escalation_respects_token_budget(self):
-        route, WS = self._import_route()
-        state = {"status": WS.ESCALATED, "retry_count": 1, "tokens_used": 50000}
+        from src import orchestrator as orch
+        route, WS = self._import_route()
+        state = {"status": WS.ESCALATED, "retry_count": 1, "tokens_used": orch.TOKEN_BUDGET}
         assert route(state) == "flush_memory"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dev-suite/tests/test_qa_escalation.py` around lines 220 - 228, Replace
hard-coded retry and token thresholds in tests
test_escalation_respects_max_retries and test_escalation_respects_token_budget
with the orchestrator constants: import MAX_RETRIES and TOKEN_BUDGET from
src.orchestrator at top of the test file, then use MAX_RETRIES for the
retry_count in test_escalation_respects_max_retries and TOKEN_BUDGET for
tokens_used in test_escalation_respects_token_budget so the assertions use the
same boundary values as the routing logic; keep the rest of the test setup
(route, WS = self._import_route(), state dict, assert route(state) ==
"flush_memory") unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@dev-suite/src/agents/qa.py`:
- Around line 88-92: The sync logic allows status="escalate" to be treated as
CODE when failure_type is unset and is_architectural is false; update the
sync_failure_type flow to explicitly map status == "escalate" to
FailureType.ARCHITECTURAL before the fallback that sets FailureType.CODE so
architectural intents don't degrade to code-retry. Concretely, inside the
sync_failure_type method (where it inspects self.is_architectural, self.status
and sets self.failure_type), add an explicit branch checking if self.status ==
"escalate" and assign FailureType.ARCHITECTURAL (referencing self.failure_type,
self.is_architectural, self.status, FailureType) prior to the existing elif that
assigns FailureType.CODE.

In `@dev-suite/src/api/events.py`:
- Line 51: QA_ESCALATION is added to events but never emitted or handled
end-to-end; to fix, emit QA_ESCALATION from the backend code path that triggers
QA escalations (call the same event-emitter used for other events with the
QA_ESCALATION constant), add 'qa_escalation' to the dashboard SSE type union in
the types file so the client recognizes the event, and register a handler for
'qa_escalation' in the SSE dispatcher and add it to the listener array so the UI
can receive and act on the event. Ensure the emitted payload shape matches the
dashboard's expected payload type.

In `@dev-suite/tests/test_qa_escalation.py`:
- Around line 270-310: Replace the local if/elif branching in
test_qa_sets_escalated_for_architectural and
test_qa_sets_reviewing_for_code_failure with calls to the actual qa_node
function: construct the same FailureReport instances, mock the QA LLM response
used by qa_node to simulate the intended QA outcome (architectural escalation vs
code failure), invoke qa_node(report, ...) and assert the returned
WorkflowStatus (or node result/status field) equals WorkflowStatus.ESCALATED for
the architectural case and WorkflowStatus.REVIEWING for the code-failure case;
refer to the qa_node function, FailureReport class, and WorkflowStatus enum to
locate inputs and outputs to mock and assert.
- Around line 321-340: Replace the manual assembly of user_msg in
test_architect_appends_failure_context with a real invocation of architect_node
using a stubbed LLM that captures sent messages; create a FailureReport instance
as shown, call architect_node(task, failure_report, llm_stub) (or the actual
architect_node signature) and then inspect the captured HumanMessage payload to
assert it includes failed_files (e.g., "wrong.py"), the errors and
recommendation, and the exact instruction text "Generate a COMPLETELY NEW
Blueprint"; ensure the test asserts on the captured HumanMessage content rather
than constructing user_msg manually.

---

Nitpick comments:
In `@dev-suite/tests/test_qa_escalation.py`:
- Around line 220-228: Replace hard-coded retry and token thresholds in tests
test_escalation_respects_max_retries and test_escalation_respects_token_budget
with the orchestrator constants: import MAX_RETRIES and TOKEN_BUDGET from
src.orchestrator at top of the test file, then use MAX_RETRIES for the
retry_count in test_escalation_respects_max_retries and TOKEN_BUDGET for
tokens_used in test_escalation_respects_token_budget so the assertions use the
same boundary values as the routing logic; keep the rest of the test setup
(route, WS = self._import_route(), state dict, assert route(state) ==
"flush_memory") unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 03e86091-118d-4936-8b36-98115e854ea4

📥 Commits

Reviewing files that changed from the base of the PR and between 1d4fecc and 094689f.

📒 Files selected for processing (4)
  • dev-suite/src/agents/qa.py
  • dev-suite/src/api/events.py
  • dev-suite/src/orchestrator.py
  • dev-suite/tests/test_qa_escalation.py

TASK_COMPLETE = "task_complete"
MEMORY_ADDED = "memory_added"
LOG_LINE = "log_line"
QA_ESCALATION = "qa_escalation"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify producer + consumer wiring for qa_escalation

echo "== Backend event usage =="
rg -n -C3 "QA_ESCALATION|qa_escalation|EventType\." dev-suite/src

echo
echo "== Dashboard SSE type union =="
rg -n -C4 "type SSEEventType|qa_escalation" dashboard/src/lib/types/api.ts

echo
echo "== Dashboard SSE dispatch/listener coverage =="
rg -n -C4 "switch \(eventType\)|eventTypes: SSEEventType\[]|qa_escalation" dashboard/src/lib/sse.ts

Repository: Abernaughty/agent-dev

Length of output: 6639


QA_ESCALATION is defined but not wired end-to-end—dashboard visibility is broken.

The enum was added at Line 51, but the integration is incomplete: no backend code emits this event, the dashboard SSE type union (dashboard/src/lib/types/api.ts:172–176) does not include 'qa_escalation', and the SSE dispatcher and listener array (dashboard/src/lib/sse.ts:44–80, 138–144`) lack corresponding handlers. The event cannot flow to UI consumers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dev-suite/src/api/events.py` at line 51, QA_ESCALATION is added to events but
never emitted or handled end-to-end; to fix, emit QA_ESCALATION from the backend
code path that triggers QA escalations (call the same event-emitter used for
other events with the QA_ESCALATION constant), add 'qa_escalation' to the
dashboard SSE type union in the types file so the client recognizes the event,
and register a handler for 'qa_escalation' in the SSE dispatcher and add it to
the listener array so the UI can receive and act on the event. Ensure the
emitted payload shape matches the dashboard's expected payload type.

Comment on lines +270 to +310
def test_qa_sets_escalated_for_architectural(self):
"""When QA reports is_architectural=True, status should be ESCALATED."""
from src.orchestrator import WorkflowStatus

report = FailureReport(
task_id="t1", status="escalate",
tests_passed=0, tests_failed=2,
errors=["wrong file targeted"],
failed_files=["wrong.py"],
is_architectural=True,
recommendation="re-plan",
)
if report.is_architectural:
status = WorkflowStatus.ESCALATED
elif report.status == "pass":
status = WorkflowStatus.PASSED
else:
status = WorkflowStatus.REVIEWING

assert status == WorkflowStatus.ESCALATED

def test_qa_sets_reviewing_for_code_failure(self):
"""When QA reports a code failure, status should be REVIEWING (retry)."""
from src.orchestrator import WorkflowStatus

report = FailureReport(
task_id="t2", status="fail",
tests_passed=3, tests_failed=1,
errors=["TypeError in line 42"],
failed_files=["auth.py"],
is_architectural=False,
recommendation="fix type error",
)
if report.is_architectural:
status = WorkflowStatus.ESCALATED
elif report.status == "pass":
status = WorkflowStatus.PASSED
else:
status = WorkflowStatus.REVIEWING

assert status == WorkflowStatus.REVIEWING
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

These tests reimplement QA branching instead of testing qa_node.

The assertions are computed from local if/elif logic, so they can still pass even if src.orchestrator.qa_node regresses. Please invoke qa_node directly with a mocked QA LLM response and assert returned state/status.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dev-suite/tests/test_qa_escalation.py` around lines 270 - 310, Replace the
local if/elif branching in test_qa_sets_escalated_for_architectural and
test_qa_sets_reviewing_for_code_failure with calls to the actual qa_node
function: construct the same FailureReport instances, mock the QA LLM response
used by qa_node to simulate the intended QA outcome (architectural escalation vs
code failure), invoke qa_node(report, ...) and assert the returned
WorkflowStatus (or node result/status field) equals WorkflowStatus.ESCALATED for
the architectural case and WorkflowStatus.REVIEWING for the code-failure case;
refer to the qa_node function, FailureReport class, and WorkflowStatus enum to
locate inputs and outputs to mock and assert.

Comment on lines +321 to +340
def test_architect_appends_failure_context(self):
"""architect_node should include failure info when is_architectural."""
failure = FailureReport(
task_id="t1", status="escalate",
tests_passed=0, tests_failed=2,
errors=["wrong file targeted", "missing dependency"],
failed_files=["wrong.py"],
is_architectural=True,
recommendation="Target auth.py instead of wrong.py",
)

user_msg = "Create authentication middleware"
if failure and failure.is_architectural:
user_msg += "\n\nPREVIOUS ATTEMPT FAILED (architectural issue):\n"
user_msg += f"Errors: {', '.join(failure.errors)}\n"
user_msg += f"Recommendation: {failure.recommendation}"

assert "PREVIOUS ATTEMPT FAILED (architectural issue)" in user_msg
assert "wrong file targeted" in user_msg
assert "Target auth.py instead of wrong.py" in user_msg
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Architect re-plan context test is not validating the actual orchestrator behavior.

This block manually assembles user_msg and does not verify the newly added failed_files context or the “Generate a COMPLETELY NEW Blueprint” instruction from src/orchestrator.py Lines 274-280. It should call architect_node with a stubbed LLM and inspect the HumanMessage payload.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dev-suite/tests/test_qa_escalation.py` around lines 321 - 340, Replace the
manual assembly of user_msg in test_architect_appends_failure_context with a
real invocation of architect_node using a stubbed LLM that captures sent
messages; create a FailureReport instance as shown, call architect_node(task,
failure_report, llm_stub) (or the actual architect_node signature) and then
inspect the captured HumanMessage payload to assert it includes failed_files
(e.g., "wrong.py"), the errors and recommendation, and the exact instruction
text "Generate a COMPLETELY NEW Blueprint"; ensure the test asserts on the
captured HumanMessage content rather than constructing user_msg manually.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. What shall we delve into next?

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown
Owner Author

@Abernaughty Abernaughty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressing CodeRabbit findings — all valid observations, responses inline.

TASK_COMPLETE = "task_complete"
MEMORY_ADDED = "memory_added"
LOG_LINE = "log_line"
QA_ESCALATION = "qa_escalation"
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — this is an intentional forward declaration. The orchestrator currently doesn't emit any SSE events during execution (that wiring lives in the runner/API layer). Adding the enum value now means the dashboard type union and handlers can be wired in a follow-up without touching the events module again. Full emit→dispatch→UI chain will be connected when we build the orchestrator-to-SSE bridge.

task_id="t2", status="fail",
tests_passed=3, tests_failed=1,
errors=["TypeError in line 42"],
failed_files=["auth.py"],
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid point. These tests intentionally validate the classification logic and routing decisions in isolation — the two things #57 actually changes. Testing qa_node end-to-end requires mocking the QA LLM response + JSON parse path + full GraphState construction, which is closer to an integration test. The existing test_orchestrator.py follows the same pattern (tests routing and state, not mocked LLM calls). Adding mocked-LLM node tests is a good improvement for a future testing pass.

errors=["wrong file targeted", "missing dependency"],
failed_files=["wrong.py"],
is_architectural=True,
recommendation="Target auth.py instead of wrong.py",
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same rationale as the QA node comment — calling architect_node requires mocking _get_architect_llm(), _fetch_memory_context(), and the Gemini response chain. The test as written validates the context assembly contract (errors, failed_files, recommendation appear in the message), which is what this PR changes. Full node-level integration tests with stubbed LLMs are a worthwhile future improvement.

@Abernaughty Abernaughty merged commit 6427d1b into main Mar 31, 2026
3 checks passed
@Abernaughty Abernaughty deleted the feat/57-qa-escalation-path branch March 31, 2026 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: QA → Architect escalation path for architectural failures

1 participant