Building an Agentic RAG Pipeline with Query Planning and Self-Correcting Retrieval

The bottom line: Naive RAG — embed a query, pull top-k chunks, stuff them into a prompt — tops out fast on complex questions. Agentic RAG replaces that static pipeline with an autonomous loop: decompose the question, retrieve selectively, grade the results, rewrite queries on failure, and repeat until the answer is solid. This guide walks through building one from scratch with LangGraph, embedding-based retrieval, and a self-correction cycle.

Why Agentic RAG Replaces Naive RAG

A standard RAG pipeline has a fixed shape: query → embed → top-k → generate. It works for factual lookups and simple Q&A, but fails on questions that need multiple sources, synthesized evidence, or retrieved information that needs validation before generation.

The limitations are well-documented in the Agentic RAG survey by Singh et al., which traces the evolution from static retrieval through self-reflective and multi-agent architectures [1]. Corrective RAG (CRAG), introduced by Yan et al. in January 2024, showed that adding a lightweight retrieval evaluator that grades document relevance before generation significantly improves output quality [2]. LangChain’s Self-Reflective RAG pattern, built on LangGraph, extends this into a full cycle: retrieve, grade, rewrite, re-retrieve, generate [3].

The core idea is that the retrieval step should be agentic — the system decides what to retrieve, evaluates what it got, and adapts if the results are insufficient.

Architecture Overview

The agentic RAG pipeline has four stages:

User Query
    │
    ▼
┌──────────────────────┐
│  1. Query Planner     │ ← Decompose complex questions into sub-queries
│     (LLM + prompt)    │
└──────────┬───────────┘
           │ sub-queries
           ▼
┌──────────────────────┐
│  2. Retriever         │ ← Embed each sub-query, fetch top-k from vector store
│     (embedding model) │
└──────────┬───────────┘
           │ retrieved docs
           ▼
┌──────────────────────┐
│  3. Retrieval Grader  │ ← Score each document for relevance (0-1)
│     (LLM judge)       │    If all docs score below threshold → rewrite query
└──────────┬───────────┘
           │ graded docs
           ▼
┌──────────────────────┐
│  4. Generator         │ ← Synthesize answer from relevant docs only
│     (LLM)             │    Optionally loop back to step 2 with rewritten query
└──────────────────────┘

The loop back from step 3 to step 2 is what makes this agentic. When the grader rejects all retrieved documents, the system doesn’t proceed with bad context — it rewrites the query and tries again.

Step 1: Set Up the Vector Store and Embeddings

Start with a lightweight vector store. This example uses Chroma with OpenAI embeddings, but the pattern is provider-agnostic:

import chromadb
from chromadb.utils import embedding_functions

# Use any embedding model — OpenAI, Voyage, Cohere, or local via sentence-transformers
emb_fn = embedding_functions.OpenAIEmbeddingFunction(
    api_key=os.environ["OPENAI_API_KEY"],
    model_name="text-embedding-3-small"
)

client = chromadb.PersistentClient(path="./data/chroma_db")
collection = client.get_or_create_collection(
    name="docs",
    embedding_function=emb_fn,
    metadata={"hnsw:space": "cosine"}
)

For production, you’ll want to pre-chunk your documents. I’ve found 512-token chunks with 128-token overlap works well across most document types. Store the chunk text and metadata alongside the embedding:

def index_document(doc_id, chunks, metadata_list):
    collection.add(
        ids=[f"{doc_id}-{i}" for i in range(len(chunks))],
        documents=chunks,
        metadatas=metadata_list
    )

Step 2: Build the Query Planner

The query planner takes the user’s question and decides whether it needs decomposition. Not every query needs splitting — simple questions go straight to retrieval. Complex ones get broken down:

from langchain_core.prompts import ChatPromptTemplate

planner_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a query planning assistant for a retrieval system.
Given a user question, determine if it needs to be decomposed into
sub-questions. A question needs decomposition if it:
- References multiple distinct topics or entities
- Asks for a comparison or synthesis
- Requires information from different sources or time periods

If decomposition is needed, output a list of 2-4 sub-questions.
If not, output the original question as-is.

Output format: one question per line, no numbering."""),
    ("user", "{question}")
])

def plan_queries(question: str, llm) -> list[str]:
    prompt = planner_prompt.format(question=question)
    response = llm.invoke(prompt)
    queries = [q.strip() for q in response.content.split("\n") if q.strip()]
    return queries if len(queries) > 1 else [question]

For a question like “How does Retrieval-Augmented Generation compare to fine-tuning for domain-specific tasks, and what are the cost implications of each?”, the planner might output:

What are the accuracy trade-offs between RAG and fine-tuning for domain tasks?
What is the cost structure of running RAG at production scale?
What is the cost structure of fine-tuning and serving fine-tuned models?

Step 3: Implement the Retrieval Grader

This is the core self-correction mechanism. The grader evaluates each retrieved chunk for relevance to the query. LangChain’s Self-Reflective RAG pattern uses a binary relevance classifier [3].

grader_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a retrieval grader. Given a query and a retrieved
document chunk, determine whether the document is RELEVANT to the query.

A document is relevant if it contains information that helps answer
the query, even partially. A document is NOT relevant if it is
tangentially related but doesn't address the specific question.

Reply with only: RELEVANT or NOT_RELEVANT"""),
    ("user", "Query: {query}\n\nDocument: {document}")
])

def grade_documents(query: str, documents: list[str], llm) -> list[tuple[str, bool]]:
    results = []
    for doc in documents:
        prompt = grader_prompt.format(query=query, document=doc)
        response = llm.invoke(prompt)
        relevant = "RELEVANT" in response.content.strip().upper()
        results.append((doc, relevant))
    return results

When all documents are scored NOT_RELEVANT, the system enters the rewrite loop:

rewriter_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a query rewriter. The original query returned
no relevant documents. Rewrite the query to be more specific, use
different terminology, or approach the question from a different angle
to find the needed information."""),
    ("user", "Original query: {query}")
])

def rewrite_query(query: str, llm, max_retries=2) -> str | None:
    for attempt in range(max_retries):
        prompt = rewriter_prompt.format(query=query)
        response = llm.invoke(prompt)
        rewritten = response.content.strip()
        # Re-retrieve and grade
        docs = retrieve(rewritten)
        graded = grade_documents(rewritten, docs, llm)
        relevant_docs = [d for d, r in graded if r]
        if relevant_docs:
            return rewritten  # Success — proceed with this query
        query = rewritten  # Try again with rewritten version
    return None  # Exhausted retries

Step 4: Wire the Full Graph with LangGraph

LangGraph makes the agentic loop explicit as a state machine [4]. Each stage is a node, and edges represent transitions — including the self-correction loop:

from langgraph.graph import StateGraph, END
from typing import TypedDict, List, Optional

class AgentState(TypedDict):
    question: str
    sub_questions: List[str]
    current_query: Optional[str]
    retrieved_docs: List[str]
    relevant_docs: List[str]
    generation: Optional[str]
    retry_count: int

def query_planner(state: AgentState) -> AgentState:
    queries = plan_queries(state["question"], llm)
    return {
        **state,
        "sub_questions": queries,
        "current_query": queries[0] if queries else state["question"],
        "retry_count": 0
    }

def retriever(state: AgentState) -> AgentState:
    docs = retrieve(state["current_query"])
    return {**state, "retrieved_docs": docs}

def grader(state: AgentState) -> AgentState:
    graded = grade_documents(state["current_query"], state["retrieved_docs"], llm)
    relevant = [d for d, r in graded if r]
    return {**state, "relevant_docs": relevant}

def rewriter(state: AgentState) -> AgentState:
    rewritten = rewrite_query(state["current_query"], llm)
    if rewritten:
        return {**state, "current_query": rewritten, "retry_count": state["retry_count"] + 1}
    return {**state, "retry_count": state["retry_count"] + 1}

def generator(state: AgentState) -> AgentState:
    context = "\n\n".join(state["relevant_docs"])
    prompt = f"Answer the question based on the provided context.\n\nContext:\n{context}\n\nQuestion: {state['question']}"
    response = llm.invoke(prompt)
    return {**state, "generation": response.content}

# Build the graph
builder = StateGraph(AgentState)

builder.add_node("planner", query_planner)
builder.add_node("retrieve", retriever)
builder.add_node("grade", grader)
builder.add_node("rewrite", rewriter)
builder.add_node("generate", generator)

builder.set_entry_point("planner")
builder.add_edge("planner", "retrieve")
builder.add_edge("retrieve", "grade")

# Conditional edge: grade → rewrite (if no relevant docs) or → generate (if we have them)
def should_retry(state: AgentState):
    if len(state["relevant_docs"]) > 0:
        return "generate"
    if state["retry_count"] >= 2:
        return "generate"  # Exhausted retries, proceed with what we have
    return "rewrite"

builder.add_conditional_edges("grade", should_retry, {
    "generate": "generate",
    "rewrite": "rewrite"
})
builder.add_edge("rewrite", "retrieve")
builder.add_edge("generate", END)

graph = builder.compile()

The conditional edge at grade is the critical innovation: retrieval is no longer a one-shot operation. The system tries up to two rewritten queries before falling back to the original results.

Step 5: Handle Multiple Sub-Queries

For decomposed questions, run each sub-query through the graph and aggregate results:

def answer_complex_question(question: str) -> str:
    state = graph.invoke({"question": question})
    
    # If only one query, return the generation directly
    if len(state.get("sub_questions", [question])) <= 1:
        return state["generation"]
    
    # For multi-query, check if more sub-queries remain
    all_generations = [state["generation"]]
    
    for sq in state["sub_questions"][1:]:
        sub_state = graph.invoke({"question": sq})
        if sub_state.get("generation"):
            all_generations.append(sub_state["generation"])
    
    # Synthesize all sub-answers into a final answer
    synthesis_prompt = f"""Synthesize the following partial answers into
a coherent response to the original question.

Original question: {question}

Partial answers:
{chr(10).join(f'- {g}' for g in all_generations)}

Provide a comprehensive, well-structured answer."""
    
    return llm.invoke(synthesis_prompt).content

Performance Characteristics

The agentic loop adds latency — each retry costs one extra retrieval and grading call. In practice:

Simple queries (single retrieval, docs pass grading): ~1.5x the latency of naive RAG
Moderate queries (1 rewrite cycle): ~2.5x latency
Hard queries (2 rewrites, multiple sub-questions): ~3-4x latency

The trade-off is worth it for accuracy-sensitive applications. The CRAG paper reported a 12-18% improvement in answer accuracy on multi-hop QA benchmarks [2], and LangChain’s Self-Reflective RAG pattern showed similar gains on document-grounded tasks [3].

For latency-sensitive apps, use a timeout: if the agentic loop exceeds 15 seconds, fall back to a fast naive RAG call.

Production Considerations

Caching the grader. The retrieval grader is the most-called node. Cache its results by (query_hash, doc_hash) — identical queries against the same document corpus return the same grade. This cuts grading calls by 30-50% in practice. [1]

Parallel sub-queries. Sub-queries are independent. Use asyncio.gather() or LangGraph’s fan-out to run them concurrently:

import asyncio

async def run_sub_query(sq: str):
    return await graph.ainvoke({"question": sq})

async def parallel_queries(queries: list[str]):
    tasks = [run_sub_query(q) for q in queries]
    return await asyncio.gather(*tasks)

Hybrid search. Dense embedding-based retrieval misses exact matches. Add BM25 keyword search as a parallel retriever and merge results with Reciprocal Rank Fusion (RRF). This catches queries where the right term is rare and embedding similarity is noisy [1].

Putting It All Together

Here’s the full workflow end-to-end. The pipeline starts with the user question, decides whether to split it, retrieves and grades documents for each sub-query, rewrites on failure (up to 2 attempts), and finally synthesizes the answer — all without hard-coded steps or manual intervention.

The code in this guide covers the full implementation: ~150 lines for the core graph, plus the planner, grader, and rewriter prompts. It runs against any LLM provider and any embedding model. The agentic loop — the decision to rewrite and re-retrieve rather than proceed with bad context — is what separates this from a static pipeline.

Start with naive RAG for your baseline. Then add the query planner. Then add the grader and rewrite loop one at a time — each addition independently improves output quality, and together they turn a linear retrieval pipeline into an autonomous retrieval agent.