Best practices for creating dataset from traces for agent evaluation.

Question

Best practices for creating dataset from traces for agent evaluation.

Leon Kannanovich 0

I'm trying to convert my agent traces into a dataset that I can later use for agent evaluation. For now, I'm mainly interested in capturing tool calls from the traces.

I already have tracing set up and can view the traces in the Traces tab in Azure AI Foundry. My goal is to extract these traces and build a dataset from them.

However, I couldn’t find documentation describing the recommended way to turn traces into a dataset.

A few questions:

What is the recommended way to export agent traces for dataset creation?

Should I query them from Application Insights using KQL and export the results?

Is there an SDK or API that allows programmatic access to the traces?

When I navigate from the Traces page in Azure AI Foundry to the underlying query in "Application Insights", the query returns what looks like trace metadata rather than the full trace payload.

The same happens when exporting the query results.

Is there a way to retrieve the full trace data (including tool call details) through KQL or another API?

Any guidance or examples on building evaluation datasets from agent traces would be very helpful.

Karnam Venkata Rajeswari 565 Reputation points Microsoft External Staff Moderator

2026-03-26T14:04:29.75+00:00

Hello Leon Kannanovich,

Following up to see if the below answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Thank you

2 answers

Your answer

Karnam Venkata Rajeswari 565 Reputation points Microsoft External Staff Moderator

2026-03-26T14:04:29.75+00:00

Hello Leon Kannanovich,

Following up to see if the below answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Thank you

Answer 1

Hello Leon Kannanovich,

Welcome to Microsoft Q&A and Thank you for reaching out.

Agent tracing in Azure AI Foundry is designed to help observe, understand and troubleshoot agent behaviour. It provides helpful visibility into how an agent run executes, such as the sequence of steps, latency and high‑level tool usage. However, traces are not intended to function as structured datasets for evaluation or long‑term data analysis.

Trace data cannot be used as an evaluation dataset as the purpose of agent traces using the traces view in Azure AI Foundry is to focus on debugging agent runs,monitoring performance and reliability and inspecting execution flow at a session level

As a result, traces should not be considered a canonical data export layer, since they are not designed to support full replay of agent executions and therefore cannot reliably capture complete interaction flows. Because of these limitations, traces should not be treated as ground-truth datasets for evaluation purposes.

When trace data flows into Azure Application Insights, it is distributed across multiple tables—such as traces, customEvents, and dependencies , which results in only partial information being visible in any single query rather than a complete, unified view of the trace.

The best-practice architecture is to use a dual logging approach, where observability logs are clearly separated from evaluation data. In this setup, agent execution sends telemetry and debugging information to Azure Application Insights for monitoring purposes, while simultaneously writing structured dataset logs to a storage system for use in evaluation workflows.

Please ensure that the structured dataset logging captures

User input
Model output
Tool names
Tool arguments
Tool responses
Timestamps and correlation identifiers

With the recommended storage options that include:

Azure Blob Storage
Azure Data Lake
Cosmos DB

Please check if the following troubleshooting steps help:

Inspect raw telemetry

   traces
   | take 10
   | project customDimensions

Check additional tables - customEvents and dependencies

Review sampling behavior

   union traces, customEvents
   | summarize count() by itemType

Confirm instrumentation by ensuring OpenTelemetry instrumentation is enabled and verbose logging is configured where applicable
Review payload size for large tool inputs or outputs may be truncated due to telemetry limits

References:

Set Up Tracing for AI Agents in Microsoft Foundry - Microsoft Foundry | Microsoft Learn

Agent tracing in Microsoft Foundry (preview) - Microsoft Foundry | Microsoft Learn

Application Insights telemetry data model - Azure Monitor | Microsoft Learn

Thank you!

Please 'Upvote'(Thumbs-up) and 'Accept' as answer if the reply was helpful. This will be benefitting other community members who face the same issue.

Answer 2

Agent traces are best used for evaluation by turning them into a curated evaluation dataset, rather than exporting raw telemetry from Application Insights.

Recommended approach:

Use traces as the source for evaluation datasets
Traces capture the journey of a request or workflow through the agent, including tool calls, as spans with attributes and events. These are intended to be the raw material for evaluation datasets, where selected traces become test cases for quality, regression, and comparison of agent versions.
Evaluation datasets are then used to:
- Improve quality using known problematic examples.
- Prevent regressions with a “golden set”.
- Compare prompts/models/logic on the same data.
- Target specific features (safety, domain knowledge, edge cases).
Instrument traces with rich span attributes for tool calls
To make traces useful for dataset creation, ensure tool calls are clearly represented in spans:
- Use spans to represent each tool call (for example, one span per tool invocation).
- Add attributes for tool name, arguments, results, and any quality or status indicators.
- Follow OpenTelemetry semantic conventions where possible so attributes are consistent and easy to query.
  This makes it much easier to later filter and extract the tool-call spans that should become evaluation records.
Apply tracing best practices to support evaluation
When instrumenting agents and tools:
- Use consistent span attribute names across agents and tools so queries and downstream processing are simple.
- Correlate evaluation run IDs in attributes so quality/performance analysis can be tied back to specific evaluation runs.
- Redact or avoid storing sensitive content (secrets, credentials, personal data) in prompts, tool arguments, or span attributes before they reach telemetry.
Exporting traces: concepts and limitations
In Azure AI Foundry, traces are exported via OpenTelemetry trace exporters to backends such as Azure Monitor / Application Insights or other OTLP-compatible platforms. These exporters send trace data (traces, spans, attributes) for storage and analysis.
The Application Insights view and its KQL queries typically surface trace and span metadata and attributes. If only metadata is visible, ensure that:
- The agent and tools are instrumented to attach the needed attributes (for example, tool call details) to spans.
- Sensitive data redaction policies are not stripping the fields required for evaluation.
Programmatic access and dataset building pattern
While the context does not describe a dedicated “export traces to dataset” API, the recommended pattern is:
- Use OpenTelemetry-compatible exporters or SDKs to send traces (including tool-call spans) to a backend.
- From that backend, query spans with the attributes that identify tool calls and important cases (for example, by tool name, error flag, or quality score attribute).
- Transform those query results into an evaluation dataset schema used by the evaluation framework (for example, a table or structured dataset with input, expected output, and metadata).
- Store the resulting evaluation dataset in a governed store (such as a cataloged table) and keep it versioned for regression testing.
Security and governance when exporting
Treat trace data as production telemetry:
- Do not export secrets or credentials when building datasets.
- Minimize or redact personal data before it appears in telemetry or in the derived evaluation dataset.
- Apply the same access controls and retention policies to trace-derived datasets that are used for logs and metrics.

This pattern—rich, consistent tracing with OpenTelemetry, exporting via trace exporters, querying spans with tool-call attributes, and transforming them into a structured evaluation dataset—is the recommended way to leverage agent traces for evaluation.

References:

Share via

Best practices for creating dataset from traces for agent evaluation.

2 answers

Your answer