Building a RAG Evaluation Pipeline: From Metrics to CI/CD Quality Gates

The bottom line: RAG systems power an estimated 60% of production AI applications in 2026 [1]. Yet most teams evaluate them ad-hoc — cherry-picking examples, eyeballing outputs, and shipping with no regression detection. A 2024 industry survey found that 50% of organizations cite evaluation as their second-greatest challenge in AI deployment [2]. This guide walks through building a real RAG evaluation pipeline: metrics you need, how to measure them, and how to gate deploys on quality thresholds.


What You’re Building

An automated evaluation pipeline that runs on every PR and deploy cycle. It measures retrieval quality and generation accuracy, catches regressions before they ship, and accumulates test cases from production failures.

The three layers:

  • Retrieval evaluation — Does the retriever return the right documents?
  • Generation evaluation — Is the LLM’s answer faithful to the retrieved context?
  • CI/CD quality gates — Do metrics stay above defined thresholds?

Layer 1: Retrieval Metrics

Retrieval quality determines everything downstream. If the retriever returns irrelevant documents, no generator can produce a good answer.

Core Retrieval Metrics

MetricWhat it measuresHow to compute
Precision@kOf the top-k documents, how many are relevantrelevant_in_top_k / k
Recall@kOf all relevant documents, how many appear in top-krelevant_in_top_k / total_relevant
Mean Reciprocal Rank (MRR)How early the first relevant document appearsAverage of 1 / rank_of_first_relevant
Context RelevanceOverall usefulness of the retrieved contextLLM-as-judge scoring

For a production RAG pipeline, track Context Precision (are the most relevant docs ranked highest?) and Context Recall (did we find everything needed?) as primary signals [3].

Implementation with RAGAS

RAGAS is the lowest-barrier entry point — install and import in under 10 lines [4]:

from ragas import evaluate
from ragas.metrics import (
    context_precision,
    context_recall,
    faithfulness,
    answer_relevancy,
)
from datasets import Dataset

# Sample data — one row per query
data = {
    "question": [
        "What is the return policy for electronics?",
        "How do I reset my admin password?",
    ],
    "answer": [
        "Electronics can be returned within 30 days with receipt.",
        "Go to Settings > Admin > Reset Password.",
    ],
    "contexts": [
        ["All electronics returns must be within 30 days of purchase...",
         "Returns require original packaging and receipt..."],
        ["Admin password reset is available at Settings > Admin...",
         "Password resets send a confirmation email to the admin address..."],
    ],
}

dataset = Dataset.from_dict(data)
scores = evaluate(dataset, metrics=[
    context_precision,
    context_recall,
    faithfulness,
    answer_relevancy,
])

print(scores)
# {'context_precision': 0.92, 'context_recall': 0.87,
#  'faithfulness': 0.95, 'answer_relevancy': 0.89}

RAGAS decomposes answers into individual claims via LLM-as-judge, then verifies each claim against the retrieved context [4]. This gives you both a composite score and per-claim breakdown.


Layer 2: Generation Metrics

Generation evaluation answers one question: is the LLM’s output faithful to the context?

Faithfulness (Groundedness)

This is the single most important metric for RAG [3]. Every statement in the generated answer must be traceable to the retrieved documents.

How LLM-as-judge faithfulness works:

  1. Decompose the answer into atomic claims
  2. For each claim: does it appear in or can it be inferred from the context?
  3. Score = fraction of supported claims
from ragas.metrics import faithfulness

# Faithfulness returns per-sample scores
result = evaluate(dataset, metrics=[faithfulness])
# Low faithfulness (< 0.7) means the LLM is hallucinating

Answer Correctness

For queries with ground-truth answers, measure semantic similarity between the generated answer and the expected answer:

from ragas.metrics import answer_correctness

# Requires ground_truth column in your dataset
dataset_with_gt = Dataset.from_dict({
    "question": [...],
    "answer": [...],
    "contexts": [[...], ...],
    "ground_truth": ["Expected answer text", ...],
})

result = evaluate(dataset_with_gt, metrics=[answer_correctness])

What Thresholds to Target

Based on production deployments documented across RAG systems [1][3]:

MetricGoodWarningCritical
Faithfulness> 0.900.70–0.90< 0.70
Context Precision> 0.850.65–0.85< 0.65
Context Recall> 0.800.60–0.80< 0.60
Answer Relevancy> 0.850.70–0.85< 0.70

These are starting points. Your domain, chunking strategy, and use case will shift them.


Layer 3: Building a Test Suite with DeepEval

DeepEval uses a pytest-style API that makes evaluation look like unit tests [5]:

pip install deepeval
# test_rag_pipeline.py
import pytest
from deepeval import assert_test
from deepeval.metrics import (
    FaithfulnessMetric,
    ContextualPrecisionMetric,
    HallucinationMetric,
)
from deepeval.test_case import LLMTestCase

def test_refund_policy_accuracy():
    test_case = LLMTestCase(
        input="What documentation do I need for a refund?",
        actual_output="You need the original receipt and packaging within 30 days.",
        retrieval_context=[
            "Refunds require original receipt and packaging...",
            "Returns must be initiated within 30 days of purchase...",
        ]
    )
    assert_test(test_case, [
        FaithfulnessMetric(threshold=0.85),
        ContextualPrecisionMetric(threshold=0.8),
    ])

def test_returns_policy_hallucination():
    """Verify the model doesn't invent return policies."""
    test_case = LLMTestCase(
        input="Can I return items without a receipt?",
        actual_output="Store credit may be issued at manager discretion without a receipt.",
        retrieval_context=[
            "Returns without receipt are evaluated on a case-by-case basis...",
            "Manager approval required for no-receipt returns...",
        ]
    )
    assert_test(test_case, [
        FaithfulnessMetric(threshold=0.85),
        HallucinationMetric(threshold=0.9),
    ])

Run it like any test suite:

deepeval test run test_rag_pipeline.py

On failure, DeepEval exits non-zero with detailed per-claim breakdowns showing which statements were unsupported.


Layer 4: CI/CD Quality Gates

Connect evaluation to your deploy pipeline. The goal: block any PR that regresses on core metrics.

GitHub Actions Example

# .github/workflows/rag-eval.yml
name: RAG Evaluation
on: [pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - name: Install dependencies
        run: |
          pip install deepeval ragas datasets
          pip install -r requirements.txt

      - name: Evaluate RAG pipeline
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          deepeval test run test_rag_pipeline.py \
            --fail-on-threshold-breach

The --fail-on-threshold-breach flag causes the action to fail when any metric drops below its threshold, blocking the PR from merging.

Production-to-Evaluation Feedback Loop

The most important pattern: convert production failures into test cases automatically [1]. When your monitoring detects a bad answer in production:

  1. Capture the full trace (query, retrieved chunks, generated response)
  2. Create a new test case with the correct expected answer
  3. Add it to the evaluation suite

Tools like Braintrust provide one-click trace-to-test-case conversion [1]. But you can build a simpler version:

# capture_failure.py — run from your monitoring webhook
import json
from datetime import datetime

def capture_rag_failure(query, retrieved_chunks, bad_answer, corrected_answer):
    """Log a production failure as a new test case."""
    test_case = {
        "question": query,
        "contexts": retrieved_chunks,
        "answer": bad_answer,
        "ground_truth": corrected_answer,
        "captured_at": datetime.utcnow().isoformat(),
        "needs_review": True,
    }
    with open("production_failures.jsonl", "a") as f:
        f.write(json.dumps(test_case) + "\n")

Run a weekly job that reviews these, validates the ground truth, and adds them to the eval dataset.


Common Pitfalls (and How to Avoid Them)

PitfallWhy it hurtsFix
Aggregate-score blindnessAverage faithfulness of 0.90 hides the 10% of queries where it’s 0.30Track P50, P90, P99 distributions, not just mean [3]
One-shot eval datasetsTest data from launch doesn’t cover new user behaviorsAdd production failures monthly; rotate stale cases
Wrong judge modelA 7B judge evaluating GPT-4 output misses subtle hallucinationsUse the same-tier model as your generator, or a stronger one [3]
No chunk-level evaluationRetriever returns 5 chunks but generator uses 1 — wasted cost and latencyMeasure chunk attribution alongside faithfulness
Skipping adversarial queriesReal users ask things outside your knowledge baseAdd out-of-domain and ambiguous query test cases
No versioning of eval datasetsYou can’t tell if a score change is a pipeline improvement or dataset shiftStore eval datasets in git or a versioned blob store

Putting It All Together

A minimal but production-ready setup:

rag-evaluation/
├── test_suite/
│   ├── test_retrieval.py      # DeepEval test cases
│   ├── test_faithfulness.py   # Per-domain test cases
│   └── test_edge_cases.py     # Adversarial + out-of-domain
├── data/
│   ├── golden_dataset.json    # Curated eval set (100-500 queries)
│   └── production_failures.jsonl  # Auto-captured from monitoring
├── config/
│   └── thresholds.yaml        # Per-metric pass/fail thresholds
├── run_eval.py                # Orchestrator script
└── .github/workflows/
    └── rag-eval.yml           # CI pipeline

Key takeaways:

  • Start with faithfulness and context precision — they catch the most common failure modes
  • Automate on every PR — manual eval is better than nothing, but CI gating catches regressions before deploy
  • Feed production failures back — your eval dataset should grow with your system
  • Track distributions, not averages — a 0.85 faithfulness average with 0.20 P10 means 10% of answers are mostly hallucinations

References

[1] Braintrust, “Best RAG Evaluation Tools in 2026, Compared” — https://www.braintrust.dev/articles/best-rag-evaluation-tools

[2] Maxim AI, “Best Practices in RAG Evaluation: A Comprehensive Guide” — https://www.getmaxim.ai/articles/best-practices-in-rag-evaluation-a-comprehensive-guide/

[3] Future AGI, “Top 5 Tools to Evaluate RAG Performance in 2026” — https://futureagi.substack.com/p/top-5-tools-to-evaluate-rag-performance

[4] RAGAS Documentation — https://docs.ragas.io/

[5] DeepEval Documentation — https://docs.confident-ai.com/

← Back to all posts