Add reformatting to Tool Call Accuracy Evaluator by salma-elshafey · Pull Request #46090 · Azure/azure-sdk-for-python

salma-elshafey · 2026-04-02T20:27:21Z

Description

Please add an informative description that covers that changes made by the pull request and link all relevant issues.

If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.

All SDK Contribution checklist:

The pull request does not introduce [breaking changes]
CHANGELOG is updated for new features, bug fixes or other significant changes.
I have read the contribution guidelines.

General Guidelines and Best Practices

Title of the pull request is clear and informative.
There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

Pull request includes test coverage for the included changes.

Copilot

Pull request overview

This PR updates the ToolCallAccuracyEvaluator to reformat conversation history and tool call inputs into a more readable, compact form before invoking the underlying prompty flow, and expands unit tests to cover response/query list scenarios and tool-result inclusion.

Changes:

Reformat query using reformat_conversation_history() and reformat tool_calls into a [TOOL_CALL] ... / [TOOL_RESULT] ... string via reformat_agent_response().
Move intermediate-response detection and message preprocessing earlier into _real_call() to ensure tool parsing operates on normalized inputs.
Extend unit tests to validate that tool call reformatting is applied and that tool results are included when present in the response.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py	Adds query/tool-call reformatting and shifts preprocessing/intermediate-response handling into `_real_call()`.
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_tool_call_accuracy_evaluator.py	Updates mock scoring logic for reformatted tool calls and adds new tests for reformatting + tool result inclusion.

Copilot · 2026-04-02T20:34:42Z

...ure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py

+        # Reformat conversation history for cleaner evaluation
+        eval_input["query"] = reformat_conversation_history(
+            eval_input["query"], logger, include_system_messages=True, include_tool_messages=True
+        )


reformat_conversation_history() expects a list of message dicts; calling it unconditionally means string queries will always hit the exception path and emit a warning (because a logger is passed), while returning the original string anyway. Guard this call (e.g., only reformat when query is a list) to avoid noisy logs and unnecessary work for the common query: str case.

Suggested change

# Reformat conversation history for cleaner evaluation

eval_input["query"] = reformat_conversation_history(

eval_input["query"], logger, include_system_messages=True, include_tool_messages=True

)

# Reformat conversation history for cleaner evaluation when query is

# provided as a list of message dicts. Leave string queries unchanged.

if isinstance(eval_input.get("query"), list):

eval_input["query"] = reformat_conversation_history(

eval_input["query"], logger, include_system_messages=True, include_tool_messages=True

)

Copilot · 2026-04-02T20:34:42Z

sdk/evaluation/azure-ai-evaluation/tests/unittests/test_tool_call_accuracy_evaluator.py

+        query = [
+            {"role": "system", "content": "You are a helpful weather assistant."},
+            {"role": "user", "content": "What's the weather like in Paris?"},
+            {"role": "assistant", "content": "Let me check that for you."},
+        ]


This test uses OpenAI-style messages with content as a plain string, but reformat_conversation_history() only extracts user text from content when it is a list of {type: "text", text: ...} items; with string content it will fall back to the original input (and typically log a warning), so the test doesn't actually exercise the new reformatting path. Update the test inputs to the converter/message schema that reformat_conversation_history() supports and assert on some formatted output (e.g., presence of "User turn" / "Agent turn").

...ure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py

salma-elshafey added 4 commits April 1, 2026 16:47

add reformatting to tool call accuracy evaluator

0b4ce00

Update reformatting

2ddb8b0

remove logging

887c681

update reformatting

4e83fb8

Copilot AI review requested due to automatic review settings April 2, 2026 20:27

salma-elshafey requested a review from a team as a code owner April 2, 2026 20:27

github-actions bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Apr 2, 2026

Copilot started reviewing on behalf of salma-elshafey April 2, 2026 20:28 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

ashaabansoliman reviewed Apr 2, 2026

View reviewed changes

...ure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py Show resolved Hide resolved

salma-elshafey and others added 2 commits April 2, 2026 23:06

run black

24adf21

Merge branch 'main' into selshafey/reformat_tool_call_acc

a297ecb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add reformatting to Tool Call Accuracy Evaluator#46090

Add reformatting to Tool Call Accuracy Evaluator#46090
salma-elshafey wants to merge 6 commits intomainfrom
selshafey/reformat_tool_call_acc

salma-elshafey commented Apr 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 2, 2026

Uh oh!

Copilot AI Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

salma-elshafey commented Apr 2, 2026

Description

All SDK Contribution checklist:

General Guidelines and Best Practices

Testing Guidelines

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants