Add reformatting to Tool Call Accuracy Evaluator#46090
Add reformatting to Tool Call Accuracy Evaluator#46090salma-elshafey wants to merge 6 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the ToolCallAccuracyEvaluator to reformat conversation history and tool call inputs into a more readable, compact form before invoking the underlying prompty flow, and expands unit tests to cover response/query list scenarios and tool-result inclusion.
Changes:
- Reformat
queryusingreformat_conversation_history()and reformattool_callsinto a[TOOL_CALL] .../[TOOL_RESULT] ...string viareformat_agent_response(). - Move intermediate-response detection and message preprocessing earlier into
_real_call()to ensure tool parsing operates on normalized inputs. - Extend unit tests to validate that tool call reformatting is applied and that tool results are included when present in the response.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py | Adds query/tool-call reformatting and shifts preprocessing/intermediate-response handling into _real_call(). |
| sdk/evaluation/azure-ai-evaluation/tests/unittests/test_tool_call_accuracy_evaluator.py | Updates mock scoring logic for reformatted tool calls and adds new tests for reformatting + tool result inclusion. |
| # Reformat conversation history for cleaner evaluation | ||
| eval_input["query"] = reformat_conversation_history( | ||
| eval_input["query"], logger, include_system_messages=True, include_tool_messages=True | ||
| ) |
There was a problem hiding this comment.
reformat_conversation_history() expects a list of message dicts; calling it unconditionally means string queries will always hit the exception path and emit a warning (because a logger is passed), while returning the original string anyway. Guard this call (e.g., only reformat when query is a list) to avoid noisy logs and unnecessary work for the common query: str case.
| # Reformat conversation history for cleaner evaluation | |
| eval_input["query"] = reformat_conversation_history( | |
| eval_input["query"], logger, include_system_messages=True, include_tool_messages=True | |
| ) | |
| # Reformat conversation history for cleaner evaluation when query is | |
| # provided as a list of message dicts. Leave string queries unchanged. | |
| if isinstance(eval_input.get("query"), list): | |
| eval_input["query"] = reformat_conversation_history( | |
| eval_input["query"], logger, include_system_messages=True, include_tool_messages=True | |
| ) |
| query = [ | ||
| {"role": "system", "content": "You are a helpful weather assistant."}, | ||
| {"role": "user", "content": "What's the weather like in Paris?"}, | ||
| {"role": "assistant", "content": "Let me check that for you."}, | ||
| ] |
There was a problem hiding this comment.
This test uses OpenAI-style messages with content as a plain string, but reformat_conversation_history() only extracts user text from content when it is a list of {type: "text", text: ...} items; with string content it will fall back to the original input (and typically log a warning), so the test doesn't actually exercise the new reformatting path. Update the test inputs to the converter/message schema that reformat_conversation_history() supports and assert on some formatted output (e.g., presence of "User turn" / "Agent turn").
Description
Please add an informative description that covers that changes made by the pull request and link all relevant issues.
If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.
All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines