Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 77 additions & 11 deletions dev-suite/src/agents/qa.py
Original file line number Diff line number Diff line change
@@ -1,19 +1,41 @@
"""QA agent - runs tests, audits security, writes failure reports.

Produces structured failure reports for retry context.
Can escalate architectural failures back to Architect.

Implementation in Step 4.
Can escalate architectural failures back to Architect via
failure_type classification.
"""

from pydantic import BaseModel
import logging
from enum import Enum

from pydantic import BaseModel, field_validator, model_validator

logger = logging.getLogger(__name__)


class FailureType(str, Enum):
"""Classification of QA failure for routing decisions.

The orchestrator uses this to decide where to route after QA:
- CODE: Bug, syntax error, test failure in the implementation.
Action: retry with Lead Dev using the same Blueprint.
- ARCHITECTURAL: Wrong target file, missing dependency, design flaw.
Action: escalate to Architect for a new Blueprint.
"""

CODE = "code"
ARCHITECTURAL = "architectural"


class FailureReport(BaseModel):
"""Structured report passed back on QA failure.

Included in retry context so Lead Dev knows exactly
what broke and why.
Included in retry context so the orchestrator can route to the
correct agent: Lead Dev for code fixes, Architect for re-planning.

The failure_type field is the primary classifier. The is_architectural
bool is kept for backward compatibility and stays in sync via the
model validator.
"""

task_id: str
Expand All @@ -24,10 +46,54 @@ class FailureReport(BaseModel):
failed_files: list[str]
is_architectural: bool # If True, escalate to Architect
recommendation: str
failure_type: FailureType | None = None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Accept case-insensitive failure_type values

The new failure_type field is validated as a strict FailureType enum, so common LLM variants like "ARCHITECTURAL" or "Code" will raise a ValidationError even when the rest of the QA report is valid. In qa_node, that exception path is treated as QA failed to produce valid report and sets workflow status to FAILED, which aborts normal retry/escalation routing. Because this value is model-generated text, it should be normalized (or otherwise parsed leniently) before enum validation.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed in 4c3b079 -- added a field_validator("failure_type", mode="before") that calls .strip().lower() on string inputs before enum validation. Tests added for "ARCHITECTURAL", "Code", "CODE", and " architectural " variants.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Gracefully handle unknown failure_type values

Defining failure_type as a strict FailureType enum means any non-exact LLM value (for example, a typo or synonym) now raises a validation error when FailureReport(**report_data) runs in qa_node, and that path marks the workflow as FAILED instead of continuing with retry/escalation logic. This turns small model-format drift into a hard stop; consider coercing unknown values to None (and falling back to is_architectural/status) rather than failing the whole run.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — the field_validator now only passes through "code" and "architectural" (after lowercasing). Unknown strings like "design_flaw" are coerced to None with a warning log, and the model_validator falls back to is_architectural/status. No more ValidationError crashes from model-format drift.


@field_validator("failure_type", mode="before")
@classmethod
def normalize_failure_type(cls, v: object) -> object:
"""Accept case-insensitive failure_type values from LLM output.

LLMs may return "ARCHITECTURAL", "Code", or even typos like
"design_flaw". Normalize known values to lowercase; coerce
unknown strings to None so the model_validator can fall back
to is_architectural/status instead of crashing the workflow.
"""
if isinstance(v, str):
normalized = v.strip().lower()
if normalized in ("code", "architectural"):
return normalized
# Unknown value -- log and fall back to None
if normalized:
logger.warning(
"Unknown failure_type '%s' from LLM output, "
"falling back to is_architectural/status",
v,
)
return None
return v

@model_validator(mode="after")
def sync_failure_type(self) -> "FailureReport":
"""Keep failure_type and is_architectural in sync.

# TODO Step 4:
# - Define QA prompt template
# - Implement test execution via E2B sandbox
# - Parse structured output from sandbox wrapper
# - Determine pass/fail/escalate decision
If failure_type is provided, it takes precedence and syncs
is_architectural. If only is_architectural is set (backward
compat from older LLM output), failure_type is derived.
status="escalate" also implies ARCHITECTURAL even when
is_architectural was not explicitly set.
"""
if self.failure_type is not None:
# failure_type takes precedence
self.is_architectural = (
self.failure_type == FailureType.ARCHITECTURAL
)
elif self.is_architectural:
self.failure_type = FailureType.ARCHITECTURAL
elif self.status == "escalate":
# status=escalate implies architectural failure
self.failure_type = FailureType.ARCHITECTURAL
self.is_architectural = True
elif self.status == "fail":
self.failure_type = FailureType.CODE
Comment on lines +96 to +97
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Infer architectural failures from escalate status

When failure_type is missing/invalid, sync_failure_type only derives a fallback for status == "fail" and ignores status == "escalate". In that case a report like {status: "escalate", failure_type: "design_flaw", is_architectural: false} is normalized to a non-architectural failure, so qa_node routes it back to the developer instead of the architect. This defeats the new escalation path precisely when the model output is inconsistent (the case this fallback is meant to handle).

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added elif self.status == "escalate" branch to sync_failure_type that sets failure_type = ARCHITECTURAL and is_architectural = True. The edge case {status: "escalate", failure_type: "design_flaw", is_architectural: false} now correctly routes to the Architect.

# status == "pass" leaves failure_type as None (no failure)
return self
9 changes: 5 additions & 4 deletions dev-suite/src/api/events.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""Async event bus for real-time SSE streaming to the dashboard.

Issue #35: SSE Event System Real-Time Task Streaming
Issue #35: SSE Event System -- Real-Time Task Streaming

The EventBus is a singleton that LangGraph nodes publish events to.
Connected SSE clients each get their own asyncio.Queue for fan-out.
Expand All @@ -14,7 +14,7 @@
data={"agent": "dev", "status": "coding", "task_id": "auth-rls"},
))

Usage (subscribing handled by the /stream endpoint):
Usage (subscribing -- handled by the /stream endpoint):
queue = event_bus.subscribe()
try:
while True:
Expand Down Expand Up @@ -48,6 +48,7 @@ class EventType(str, Enum):
TASK_COMPLETE = "task_complete"
MEMORY_ADDED = "memory_added"
LOG_LINE = "log_line"
QA_ESCALATION = "qa_escalation"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify producer + consumer wiring for qa_escalation

echo "== Backend event usage =="
rg -n -C3 "QA_ESCALATION|qa_escalation|EventType\." dev-suite/src

echo
echo "== Dashboard SSE type union =="
rg -n -C4 "type SSEEventType|qa_escalation" dashboard/src/lib/types/api.ts

echo
echo "== Dashboard SSE dispatch/listener coverage =="
rg -n -C4 "switch \(eventType\)|eventTypes: SSEEventType\[]|qa_escalation" dashboard/src/lib/sse.ts

Repository: Abernaughty/agent-dev

Length of output: 6639


QA_ESCALATION is defined but not wired end-to-end—dashboard visibility is broken.

The enum was added at Line 51, but the integration is incomplete: no backend code emits this event, the dashboard SSE type union (dashboard/src/lib/types/api.ts:172–176) does not include 'qa_escalation', and the SSE dispatcher and listener array (dashboard/src/lib/sse.ts:44–80, 138–144`) lack corresponding handlers. The event cannot flow to UI consumers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dev-suite/src/api/events.py` at line 51, QA_ESCALATION is added to events but
never emitted or handled end-to-end; to fix, emit QA_ESCALATION from the backend
code path that triggers QA escalations (call the same event-emitter used for
other events with the QA_ESCALATION constant), add 'qa_escalation' to the
dashboard SSE type union in the types file so the client recognizes the event,
and register a handler for 'qa_escalation' in the SSE dispatcher and add it to
the listener array so the UI can receive and act on the event. Ensure the
emitted payload shape matches the dashboard's expected payload type.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — this is an intentional forward declaration. The orchestrator currently doesn't emit any SSE events during execution (that wiring lives in the runner/API layer). Adding the enum value now means the dashboard type union and handlers can be wired in a follow-up without touching the events module again. Full emit→dispatch→UI chain will be connected when we build the orchestrator-to-SSE bridge.



class SSEEvent(BaseModel):
Expand Down Expand Up @@ -133,7 +134,7 @@ async def publish(self, event: SSEEvent) -> int:
delivered += 1
except asyncio.QueueFull:
logger.warning(
"SSE client queue full dropping event %s (counter=%d)",
"SSE client queue full -- dropping event %s (counter=%d)",
event.type.value,
self._event_counter,
)
Expand All @@ -156,6 +157,6 @@ async def clear(self) -> None:
logger.info("EventBus cleared all subscribers")


# ── Singleton ──
# -- Singleton --

event_bus = EventBus()
30 changes: 24 additions & 6 deletions dev-suite/src/orchestrator.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
from pydantic import BaseModel

from .agents.architect import Blueprint
from .agents.qa import FailureReport
from .agents.qa import FailureReport, FailureType
from .memory.factory import create_memory_store
from .memory.protocol import MemoryQueryResult, MemoryStore
from .memory.summarizer import summarize_writes_sync
Expand Down Expand Up @@ -271,7 +271,13 @@ def architect_node(state: GraphState) -> dict:
if failure_report and failure_report.is_architectural:
user_msg += "\n\nPREVIOUS ATTEMPT FAILED (architectural issue):\n"
user_msg += f"Errors: {', '.join(failure_report.errors)}\n"
user_msg += f"Recommendation: {failure_report.recommendation}"
if failure_report.failed_files:
user_msg += f"Failed files: {', '.join(failure_report.failed_files)}\n"
user_msg += f"Recommendation: {failure_report.recommendation}\n"
user_msg += (
"\nGenerate a COMPLETELY NEW Blueprint. Do not patch the old one. "
"The previous target_files or approach was wrong."
)

llm = _get_architect_llm()
response = llm.invoke([
Expand Down Expand Up @@ -411,12 +417,24 @@ def qa_node(state: GraphState) -> dict:
' "tests_failed": number,\n'
' "errors": ["list of specific error descriptions"],\n'
' "failed_files": ["list of files with issues"],\n'
' "is_architectural": true/false '
"(set true if the failure is a design/planning issue),\n"
' "is_architectural": true/false,\n'
' "failure_type": "code" or "architectural" or null (if pass),\n'
' "recommendation": "what to fix or why it should escalate"\n'
"}\n\n"
'"escalate" means the Blueprint itself is wrong, not just the '
"implementation.\n"
"FAILURE CLASSIFICATION (critical for correct routing):\n\n"
'Set failure_type to "code" (status: "fail") when:\n'
"- Implementation has bugs, syntax errors, or type errors\n"
"- Tests fail due to logic errors in the code\n"
"- Code does not follow the Blueprint's constraints\n"
"- Missing error handling or edge cases\n"
"Action: Lead Dev will retry with the same Blueprint.\n\n"
'Set failure_type to "architectural" (status: "escalate") when:\n'
"- Blueprint targets the WRONG files (code is in the wrong place)\n"
"- A required dependency or import is missing from the Blueprint\n"
"- The design approach is fundamentally flawed\n"
"- Acceptance criteria are impossible to meet with current targets\n"
"- The task requires files not listed in target_files\n"
"Action: Architect will generate a completely NEW Blueprint.\n\n"
"Be strict but fair. Only pass code that meets ALL acceptance "
"criteria.\n"
"Do not include any text before or after the JSON."
Expand Down
Loading
Loading