Skip to content

Add reformatting to Tool Call Accuracy Evaluator#46090

Open
salma-elshafey wants to merge 6 commits intomainfrom
selshafey/reformat_tool_call_acc
Open

Add reformatting to Tool Call Accuracy Evaluator#46090
salma-elshafey wants to merge 6 commits intomainfrom
selshafey/reformat_tool_call_acc

Conversation

@salma-elshafey
Copy link
Copy Markdown
Contributor

Description

Please add an informative description that covers that changes made by the pull request and link all relevant issues.

If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

Copilot AI review requested due to automatic review settings April 2, 2026 20:27
@salma-elshafey salma-elshafey requested a review from a team as a code owner April 2, 2026 20:27
@github-actions github-actions bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Apr 2, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the ToolCallAccuracyEvaluator to reformat conversation history and tool call inputs into a more readable, compact form before invoking the underlying prompty flow, and expands unit tests to cover response/query list scenarios and tool-result inclusion.

Changes:

  • Reformat query using reformat_conversation_history() and reformat tool_calls into a [TOOL_CALL] ... / [TOOL_RESULT] ... string via reformat_agent_response().
  • Move intermediate-response detection and message preprocessing earlier into _real_call() to ensure tool parsing operates on normalized inputs.
  • Extend unit tests to validate that tool call reformatting is applied and that tool results are included when present in the response.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py Adds query/tool-call reformatting and shifts preprocessing/intermediate-response handling into _real_call().
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_tool_call_accuracy_evaluator.py Updates mock scoring logic for reformatted tool calls and adds new tests for reformatting + tool result inclusion.

Comment on lines +232 to +235
# Reformat conversation history for cleaner evaluation
eval_input["query"] = reformat_conversation_history(
eval_input["query"], logger, include_system_messages=True, include_tool_messages=True
)
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reformat_conversation_history() expects a list of message dicts; calling it unconditionally means string queries will always hit the exception path and emit a warning (because a logger is passed), while returning the original string anyway. Guard this call (e.g., only reformat when query is a list) to avoid noisy logs and unnecessary work for the common query: str case.

Suggested change
# Reformat conversation history for cleaner evaluation
eval_input["query"] = reformat_conversation_history(
eval_input["query"], logger, include_system_messages=True, include_tool_messages=True
)
# Reformat conversation history for cleaner evaluation when query is
# provided as a list of message dicts. Leave string queries unchanged.
if isinstance(eval_input.get("query"), list):
eval_input["query"] = reformat_conversation_history(
eval_input["query"], logger, include_system_messages=True, include_tool_messages=True
)

Copilot uses AI. Check for mistakes.
Comment on lines +796 to +800
query = [
{"role": "system", "content": "You are a helpful weather assistant."},
{"role": "user", "content": "What's the weather like in Paris?"},
{"role": "assistant", "content": "Let me check that for you."},
]
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test uses OpenAI-style messages with content as a plain string, but reformat_conversation_history() only extracts user text from content when it is a list of {type: "text", text: ...} items; with string content it will fall back to the original input (and typically log a warning), so the test doesn't actually exercise the new reformatting path. Update the test inputs to the converter/message schema that reformat_conversation_history() supports and assert on some formatted output (e.g., presence of "User turn" / "Agent turn").

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Evaluation Issues related to the client library for Azure AI Evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants