Building an AI Agent Evaluation Suite with DeepEval

TL;DR: DeepEval is an open-source LLM evaluation framework that brings unit-testing patterns to AI agent assessment — with 50+ research-backed metrics, automatic trace capture, and native CI/CD integration [1]. This guide walks through building a complete agent evaluation suite: golden datasets, task completion and tool calling metrics, trajectory-level analysis, and a CI gate that blocks regressions before deploy.

Why Agent Evaluation Is Different

Traditional software testing is deterministic. You assert that add(2, 2) == 4 and move on. AI agents are non-deterministic — the same input produces different outputs, tool calls, and reasoning paths across runs. You can’t assert a specific output; you need to assert that the agent achieved the goal within acceptable quality bounds.

This shift requires a fundamentally different approach to testing. Instead of assert-equals, you measure along dimensions like [2]:

Task completion — Did the agent achieve the user’s goal?
Tool calling accuracy — Did it pick the right tool with the right arguments?
Reasoning quality — Was the chain of thought logically sound?
Step efficiency — Did it waste turns on unnecessary actions?
Faithfulness — Did the output stay grounded in the provided context?

DeepEval addresses this by providing evaluator LLMs (LLM-as-a-judge) that score agent behavior across these dimensions, wrapped in a pytest-like interface that integrates with your existing test infrastructure [1].

Installation and Setup

DeepEval requires Python 3.9+ and works with any LLM provider for its judge evaluation:

pip install -U "deepeval[inspect]"

The [inspect] extra adds trace visualization, which is essential for debugging agent trajectories. Set up your judge model (the LLM that scores your agent’s outputs):

export OPENAI_API_KEY="sk-..."

Or use a local model:

# Use a local Ollama model as your judge
deepeval set-local-model --model-name="ollama/qwen2.5:32b"

DeepEval’s default judge uses GPT-4o, but you can configure any model supported by the framework [1]. For production pipelines, a local judge avoids API latency and data egress costs.

Step 1: Build a Golden Dataset

A golden dataset is a curated set of input prompts with expected outcomes. This is the foundation of your evaluation suite. Start small — 20 to 30 examples for your most critical agent use case [3].

Each golden example contains:

Input — The user query or task description
Expected tools — Which tools the agent should call and in what order
Expected output — The ideal final answer (if deterministic enough)
Context — Any additional context the agent needs (documents, schemas, API responses)

from deepeval.dataset import Golden, EvaluationDataset

goldens = [
    Golden(
        input="What's the current weather in Tokyo?",
        expected_tools=["get_weather(location=Tokyo)"],
        expected_output="should return current temperature and conditions",
        context={"allowed_locations": ["Tokyo", "Osaka", "Kyoto"]}
    ),
    Golden(
        input="Book a flight from SFO to JFK on June 20th for 2 people",
        expected_tools=[
            "search_flights(origin=SFO, dest=JFK, date=2026-06-20)",
            "book_flight(flight_id=..., passengers=2)"
        ],
        expected_output="should confirm booking with flight details",
        context={"max_passengers": 4}
    ),
    Golden(
        input="What is the refund policy for cancelled flights?",
        expected_tools=["search_knowledge_base(query=refund policy cancelled flights)"],
        expected_output="should explain the refund policy with conditions",
    ),
]

dataset = EvaluationDataset(goldens=goldens)

Source golden examples from real production failures — tickets that required manual escalation, user complaints, or edge cases your agent mishandled. This ensures your eval suite reflects actual failure modes, not hypothetical scenarios [3].

Step 2: Instrument Your Agent with Trace Capture

DeepEval automatically captures execution traces when you wrap your agent function with the @agent decorator [4]. This gives you a span tree showing every LLM call, tool invocation, and sub-agent interaction.

from deepeval.tracing import agent

@agent
def support_agent(user_input: str, context: dict = None) -> str:
    """Your existing agent logic — no modifications needed beyond the decorator."""

    # 1. Classify intent
    intent = llm_call(f"Classify intent: {user_input}")

    # 2. Route to handler
    if intent == "weather":
        location = llm_call(f"Extract location from: {user_input}")
        weather_data = get_weather(location)
        return llm_call(f"Format response: {weather_data}")

    elif intent == "booking":
        details = llm_call(f"Extract booking details from: {user_input}")
        flights = search_flights(**details)
        booking = book_flight(**flights[0])
        return llm_call(f"Confirm booking: {booking}")

    else:
        docs = search_knowledge_base(user_input)
        return llm_call(f"Answer from {docs}")

Each call inside the decorated function becomes a span in the trace:

support_agent (root span)
├── LLM: "Classify intent..." (span)
├── LLM: "Extract location..." (span)
├── Tool: get_weather(Tokyo) (span)
└── LLM: "Format response..." (span)

This trace structure is what DeepEval uses to compute trajectory-level metrics — it can see which tools were called, in what order, and how the LLM reasoned between them [4].

Step 3: Run Agent Evaluation

With your golden dataset and instrumented agent, running evaluation is a single function call:

from deepeval import evaluate
from deepeval.metrics import (
    TaskCompletionMetric,
    ToolCallingAccuracyMetric,
    ReasoningQualityMetric,
    StepEfficiencyMetric,
)

metrics = [
    TaskCompletionMetric(),
    ToolCallingAccuracyMetric(),
    ReasoningQualityMetric(),
    StepEfficiencyMetric(max_steps=10),
]

results = evaluate(
    dataset=dataset,
    metrics=metrics,
    agent_function=support_agent,
    max_concurrent=5,  # Run 5 evals in parallel
)

for result in results:
    print(f"Input: {result.input[:60]}...")
    print(f"  Task Completion:  {result.metrics['Task Completion']:.2f}/1.0")
    print(f"  Tool Accuracy:    {result.metrics['Tool Calling Accuracy']:.2f}/1.0")
    print(f"  Reasoning:        {result.metrics['Reasoning Quality']:.2f}/1.0")
    print(f"  Step Efficiency:  {result.metrics['Step Efficiency']:.2f}/1.0")

What each metric measures

Metric	What it scores	Scale	When to use
Task Completion	Whether the agent achieved the user’s stated goal	0–1	Every evaluation — the primary success metric [5]
Tool Calling Accuracy	Whether the correct tool was called with correct args	0–1	Agents with >3 tool options; catches routing errors
Reasoning Quality	Logical coherence of the agent’s chain of thought	0–1	Complex multi-step tasks where reasoning matters
Step Efficiency	Whether the agent wasted turns on unnecessary actions	0–1	Latency-sensitive or cost-sensitive agents
Faithfulness	Whether output stays grounded in provided context	0–1	RAG agents; catches hallucination

A score below 0.7 on any metric is a red flag. Below 0.5 means the agent is fundamentally broken for that input [5].

Step 4: Trajectory-Level Evaluation

For complex multi-step agents, you need to evaluate the full execution trajectory — not just the final output. DeepEval’s trajectory evaluation compares the agent’s actual tool call sequence against the expected sequence in your golden dataset [4].

from deepeval.metrics import ToolCallMatchMetric

trajectory_metric = ToolCallMatchMetric(
    agent_function=support_agent,
    golden=goldens[1],  # The flight booking example
)

# This checks:
# 1. Did the agent call search_flights before book_flight?
# 2. Did both calls have the correct arguments?
# 3. Were there any unnecessary tool calls between them?
# 4. Did the agent handle errors (e.g., no flights found)?

score = trajectory_metric.measure()
print(f"Trajectory match: {score:.2f}/1.0")

Trajectory evaluation catches failure modes that end-to-end metrics miss:

Tool ordering bugs — Agent calls book_flight before search_flights
Redundant calls — Agent hits the knowledge base three times with the same query
Missing recovery — Agent crashes when a tool returns an error instead of retrying
Argument drift — Agent extracts a location string but passes a different one to the tool

Step 5: CI/CD Integration

The most impactful place to run agent evaluations is in CI, before deployment. DeepEval exits with a non-zero code when scores fall below your thresholds [1], making it a natural CI gate.

# .github/workflows/agent-eval.yml
name: Agent Evaluation
on: [pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install dependencies
        run: |
          pip install -U "deepeval[inspect]"
          pip install -r requirements.txt

      - name: Run agent evaluation
        env:
          OPENAI_API_KEY: ${{ secrets.JUDGE_API_KEY }}
        run: |
          python -m deepeval run \
            --dataset golden_dataset.py \
            --agent agent.py:support_agent \
            --metrics task_completion tool_accuracy reasoning \
            --threshold 0.75 \
            --fail-below

The --fail-below 0.75 flag means: if the average score across any metric drops below 0.75, exit with code 1 and block the PR.

For local development, DeepEval also integrates with pytest:

# test_agent_eval.py
from deepeval import pytest_configure

class TestAgentEvaluation:
    def test_task_completion(self):
        deepeval.assert_test(
            dataset=golden_dataset,
            agent=support_agent,
            metric=TaskCompletionMetric(threshold=0.8)
        )

    def test_tool_accuracy(self):
        deepeval.assert_test(
            dataset=golden_dataset,
            agent=support_agent,
            metric=ToolCallingAccuracyMetric(threshold=0.85)
        )

Run it like normal tests:

python -m pytest test_agent_eval.py -v

Step 6: Production Monitoring

CI evaluation catches regressions before deploy, but agent behavior can drift in production due to model updates, API changes, or shifting user patterns. You need ongoing monitoring.

DeepEval’s platform (or your own logging pipeline with the same metrics) lets you:

import logging
from deepeval.metrics import TaskCompletionMetric

# Production evaluation — lightweight, async
async def log_production_eval(user_input, agent_output, trace):
    metric = TaskCompletionMetric()
    score = await metric.ameasure(
        input=user_input,
        output=agent_output,
        trace=trace,
    )

    logging.info({
        "event": "agent_eval",
        "score": score,
        "input_truncated": user_input[:200],
        "timestamp": time.isoformat(),
    })

    if score < 0.5:
        # Alert: agent is failing on this input
        await alert_team(f"Agent failure detected (score={score:.2f})")

A production eval pipeline should:

Sample — Evaluate 10–20% of production trajectories (full eval on every call is expensive) [1]
Stratify — Oversample edge cases and underrepresented intents
Alert — Flag any trajectory scoring below 0.5 for manual review
Track — Store scores in your metrics backend (Datadog, Grafana, CloudWatch) with per-intent breakdowns

Putting It All Together

Here’s the end-to-end workflow for a production agent evaluation pipeline:

┌─────────────────────────────────────────────────────────────┐
│                    Development Loop                         │
│                                                             │
│  golden_dataset.py ──► eval.py ──► CI gate ──► deploy      │
│       ▲                               │                    │
│       │                               ▼                    │
│  real failures ─────────► update golden dataset             │
│  from production                                            │
└─────────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│                    Production Loop                           │
│                                                             │
│  agent call ──► sample 20% ──► async eval ──► metrics DB   │ [2]
│                     │                         │             │
│                     │                         ▼             │
│                     └──► score < 0.5 ──► alert on-call     │
└─────────────────────────────────────────────────────────────┘

Key takeaways:

Start with 20–30 golden examples drawn from real production failures
Use trace-level evaluation to catch tool ordering and argument bugs
Set a CI threshold of 0.75 minimum for all metrics before deploy
Monitor production trajectories continuously with stratified sampling
Feed production failures back into your golden dataset — treat evaluation as a living artifact, not a one-time setup

Agent evaluation is not a gate you install once and forget. It’s a practice — the agent equivalent of a test suite, maintained alongside the agent itself. Every time you add a new tool, change a prompt, or swap a model, run the eval suite first. If the scores drop, you catch it before users do.

Sources

[1] DeepEval — AI Agent Evaluation Quickstart

[2] DeepEval — AI Agent Evaluation Metrics Guide

[3] Maxim AI — Building a Golden Dataset for AI Evaluation

[4] DeepEval — AI Agent Evaluation

[5] Confident AI — LLM Agent Evaluation Complete Guide (June 2026)