Building an AI Agent Evaluation Suite with DeepEval — A Practical Guide
TL;DR: DeepEval is an open-source LLM evaluation framework that brings unit-testing patterns to AI agent assessment — with 50+ research-backed metrics, automatic trace capture, and native CI/CD integration [1]. This guide walks through building a complete agent evaluation suite: golden datasets, task completion and tool calling metrics, trajectory-level analysis, and a CI gate that blocks regressions before deploy.
Why Agent Evaluation Is Different
Traditional software testing is deterministic. You assert that add(2, 2) == 4 and move on. AI agents are non-deterministic — the same input produces different outputs, tool calls, and reasoning paths across runs. You can’t assert a specific output; you need to assert that the agent achieved the goal within acceptable quality bounds.
This shift requires a fundamentally different approach to testing. Instead of assert-equals, you measure along dimensions like [2]:
- Task completion — Did the agent achieve the user’s goal?
- Tool calling accuracy — Did it pick the right tool with the right arguments?
- Reasoning quality — Was the chain of thought logically sound?
- Step efficiency — Did it waste turns on unnecessary actions?
- Faithfulness — Did the output stay grounded in the provided context?
DeepEval addresses this by providing evaluator LLMs (LLM-as-a-judge) that score agent behavior across these dimensions, wrapped in a pytest-like interface that integrates with your existing test infrastructure [1].
Installation and Setup
DeepEval requires Python 3.9+ and works with any LLM provider for its judge evaluation:
pip install -U "deepeval[inspect]"
The [inspect] extra adds trace visualization, which is essential for debugging agent trajectories. Set up your judge model (the LLM that scores your agent’s outputs):
export OPENAI_API_KEY="sk-..."
Or use a local model:
# Use a local Ollama model as your judge
deepeval set-local-model --model-name="ollama/qwen2.5:32b"
DeepEval’s default judge uses GPT-4o, but you can configure any model supported by the framework [1]. For production pipelines, a local judge avoids API latency and data egress costs.
Step 1: Build a Golden Dataset
A golden dataset is a curated set of input prompts with expected outcomes. This is the foundation of your evaluation suite. Start small — 20 to 30 examples for your most critical agent use case [3].
Each golden example contains:
- Input — The user query or task description
- Expected tools — Which tools the agent should call and in what order
- Expected output — The ideal final answer (if deterministic enough)
- Context — Any additional context the agent needs (documents, schemas, API responses)
from deepeval.dataset import Golden, EvaluationDataset
goldens = [
Golden(
input="What's the current weather in Tokyo?",
expected_tools=["get_weather(location=Tokyo)"],
expected_output="should return current temperature and conditions",
context={"allowed_locations": ["Tokyo", "Osaka", "Kyoto"]}
),
Golden(
input="Book a flight from SFO to JFK on June 20th for 2 people",
expected_tools=[
"search_flights(origin=SFO, dest=JFK, date=2026-06-20)",
"book_flight(flight_id=..., passengers=2)"
],
expected_output="should confirm booking with flight details",
context={"max_passengers": 4}
),
Golden(
input="What is the refund policy for cancelled flights?",
expected_tools=["search_knowledge_base(query=refund policy cancelled flights)"],
expected_output="should explain the refund policy with conditions",
),
]
dataset = EvaluationDataset(goldens=goldens)
Source golden examples from real production failures — tickets that required manual escalation, user complaints, or edge cases your agent mishandled. This ensures your eval suite reflects actual failure modes, not hypothetical scenarios [3].
Step 2: Instrument Your Agent with Trace Capture
DeepEval automatically captures execution traces when you wrap your agent function with the @agent decorator [4]. This gives you a span tree showing every LLM call, tool invocation, and sub-agent interaction.
from deepeval.tracing import agent
@agent
def support_agent(user_input: str, context: dict = None) -> str:
"""Your existing agent logic — no modifications needed beyond the decorator."""
# 1. Classify intent
intent = llm_call(f"Classify intent: {user_input}")
# 2. Route to handler
if intent == "weather":
location = llm_call(f"Extract location from: {user_input}")
weather_data = get_weather(location)
return llm_call(f"Format response: {weather_data}")
elif intent == "booking":
details = llm_call(f"Extract booking details from: {user_input}")
flights = search_flights(**details)
booking = book_flight(**flights[0])
return llm_call(f"Confirm booking: {booking}")
else:
docs = search_knowledge_base(user_input)
return llm_call(f"Answer from {docs}")
Each call inside the decorated function becomes a span in the trace:
support_agent (root span)
├── LLM: "Classify intent..." (span)
├── LLM: "Extract location..." (span)
├── Tool: get_weather(Tokyo) (span)
└── LLM: "Format response..." (span)
This trace structure is what DeepEval uses to compute trajectory-level metrics — it can see which tools were called, in what order, and how the LLM reasoned between them [4].
Step 3: Run Agent Evaluation
With your golden dataset and instrumented agent, running evaluation is a single function call:
from deepeval import evaluate
from deepeval.metrics import (
TaskCompletionMetric,
ToolCallingAccuracyMetric,
ReasoningQualityMetric,
StepEfficiencyMetric,
)
metrics = [
TaskCompletionMetric(),
ToolCallingAccuracyMetric(),
ReasoningQualityMetric(),
StepEfficiencyMetric(max_steps=10),
]
results = evaluate(
dataset=dataset,
metrics=metrics,
agent_function=support_agent,
max_concurrent=5, # Run 5 evals in parallel
)
for result in results:
print(f"Input: {result.input[:60]}...")
print(f" Task Completion: {result.metrics['Task Completion']:.2f}/1.0")
print(f" Tool Accuracy: {result.metrics['Tool Calling Accuracy']:.2f}/1.0")
print(f" Reasoning: {result.metrics['Reasoning Quality']:.2f}/1.0")
print(f" Step Efficiency: {result.metrics['Step Efficiency']:.2f}/1.0")
What each metric measures
| Metric | What it scores | Scale | When to use |
|---|---|---|---|
| Task Completion | Whether the agent achieved the user’s stated goal | 0–1 | Every evaluation — the primary success metric [5] |
| Tool Calling Accuracy | Whether the correct tool was called with correct args | 0–1 | Agents with >3 tool options; catches routing errors |
| Reasoning Quality | Logical coherence of the agent’s chain of thought | 0–1 | Complex multi-step tasks where reasoning matters |
| Step Efficiency | Whether the agent wasted turns on unnecessary actions | 0–1 | Latency-sensitive or cost-sensitive agents |
| Faithfulness | Whether output stays grounded in provided context | 0–1 | RAG agents; catches hallucination |
A score below 0.7 on any metric is a red flag. Below 0.5 means the agent is fundamentally broken for that input [5].
Step 4: Trajectory-Level Evaluation
For complex multi-step agents, you need to evaluate the full execution trajectory — not just the final output. DeepEval’s trajectory evaluation compares the agent’s actual tool call sequence against the expected sequence in your golden dataset [4].
from deepeval.metrics import ToolCallMatchMetric
trajectory_metric = ToolCallMatchMetric(
agent_function=support_agent,
golden=goldens[1], # The flight booking example
)
# This checks:
# 1. Did the agent call search_flights before book_flight?
# 2. Did both calls have the correct arguments?
# 3. Were there any unnecessary tool calls between them?
# 4. Did the agent handle errors (e.g., no flights found)?
score = trajectory_metric.measure()
print(f"Trajectory match: {score:.2f}/1.0")
Trajectory evaluation catches failure modes that end-to-end metrics miss:
- Tool ordering bugs — Agent calls
book_flightbeforesearch_flights - Redundant calls — Agent hits the knowledge base three times with the same query
- Missing recovery — Agent crashes when a tool returns an error instead of retrying
- Argument drift — Agent extracts a location string but passes a different one to the tool
Step 5: CI/CD Integration
The most impactful place to run agent evaluations is in CI, before deployment. DeepEval exits with a non-zero code when scores fall below your thresholds [1], making it a natural CI gate.
# .github/workflows/agent-eval.yml
name: Agent Evaluation
on: [pull_request]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install dependencies
run: |
pip install -U "deepeval[inspect]"
pip install -r requirements.txt
- name: Run agent evaluation
env:
OPENAI_API_KEY: ${{ secrets.JUDGE_API_KEY }}
run: |
python -m deepeval run \
--dataset golden_dataset.py \
--agent agent.py:support_agent \
--metrics task_completion tool_accuracy reasoning \
--threshold 0.75 \
--fail-below
The --fail-below 0.75 flag means: if the average score across any metric drops below 0.75, exit with code 1 and block the PR.
For local development, DeepEval also integrates with pytest:
# test_agent_eval.py
from deepeval import pytest_configure
class TestAgentEvaluation:
def test_task_completion(self):
deepeval.assert_test(
dataset=golden_dataset,
agent=support_agent,
metric=TaskCompletionMetric(threshold=0.8)
)
def test_tool_accuracy(self):
deepeval.assert_test(
dataset=golden_dataset,
agent=support_agent,
metric=ToolCallingAccuracyMetric(threshold=0.85)
)
Run it like normal tests:
python -m pytest test_agent_eval.py -v
Step 6: Production Monitoring
CI evaluation catches regressions before deploy, but agent behavior can drift in production due to model updates, API changes, or shifting user patterns. You need ongoing monitoring.
DeepEval’s platform (or your own logging pipeline with the same metrics) lets you:
import logging
from deepeval.metrics import TaskCompletionMetric
# Production evaluation — lightweight, async
async def log_production_eval(user_input, agent_output, trace):
metric = TaskCompletionMetric()
score = await metric.ameasure(
input=user_input,
output=agent_output,
trace=trace,
)
logging.info({
"event": "agent_eval",
"score": score,
"input_truncated": user_input[:200],
"timestamp": time.isoformat(),
})
if score < 0.5:
# Alert: agent is failing on this input
await alert_team(f"Agent failure detected (score={score:.2f})")
A production eval pipeline should:
- Sample — Evaluate 10–20% of production trajectories (full eval on every call is expensive)
- Stratify — Oversample edge cases and underrepresented intents
- Alert — Flag any trajectory scoring below 0.5 for manual review
- Track — Store scores in your metrics backend (Datadog, Grafana, CloudWatch) with per-intent breakdowns
Putting It All Together
Here’s the end-to-end workflow for a production agent evaluation pipeline:
┌─────────────────────────────────────────────────────────────┐
│ Development Loop │
│ │
│ golden_dataset.py ──► eval.py ──► CI gate ──► deploy │
│ ▲ │ │
│ │ ▼ │
│ real failures ─────────► update golden dataset │
│ from production │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Production Loop │
│ │
│ agent call ──► sample 20% ──► async eval ──► metrics DB │
│ │ │ │
│ │ ▼ │
│ └──► score < 0.5 ──► alert on-call │
└─────────────────────────────────────────────────────────────┘
Key takeaways:
- Start with 20–30 golden examples drawn from real production failures
- Use trace-level evaluation to catch tool ordering and argument bugs
- Set a CI threshold of 0.75 minimum for all metrics before deploy
- Monitor production trajectories continuously with stratified sampling
- Feed production failures back into your golden dataset — treat evaluation as a living artifact, not a one-time setup
Agent evaluation is not a gate you install once and forget. It’s a practice — the agent equivalent of a test suite, maintained alongside the agent itself. Every time you add a new tool, change a prompt, or swap a model, run the eval suite first. If the scores drop, you catch it before users do.
Sources
[1] DeepEval — AI Agent Evaluation Quickstart, https://deepeval.com/docs/getting-started-agents
[2] DeepEval — AI Agent Evaluation Metrics Guide, https://deepeval.com/guides/guides-ai-agent-evaluation-metrics
[3] Maxim AI — Building a Golden Dataset for AI Evaluation, https://www.getmaxim.ai/articles/building-a-golden-dataset-for-ai-evaluation-a-step-by-step-guide/
[4] DeepEval — AI Agent Evaluation, https://deepeval.com/guides/guides-ai-agent-evaluation
[5] Confident AI — LLM Agent Evaluation Complete Guide (June 2026), https://www.confident-ai.com/blog/llm-agent-evaluation-complete-guide
← Back to all posts