Building an AI Code Review Agent: Architecture, Patterns, and Production Deployment

The bottom line: Automated PR review is one of the highest-ROI agent applications in 2026, but naive single-prompt approaches miss architectural issues, misunderstand context, and generate noise. This guide walks through a production-grade multi-agent code review architecture — with working code leveraging LangGraph, GitHub Actions, and structured outputs — that catches real issues without overwhelming developers.
Why Single-Prompt Code Review Fails
The first version of every AI code review tool uses a single prompt: feed the diff to an LLM and ask for issues. This fails in three predictable ways:
Lost context. The diff alone doesn’t show the surrounding codebase — imports, shared utilities, data models, and architectural patterns. A 2025 study by Augment Code found that effective code reviews need context beyond the PR diff to catch cross-file issues [1]. Without it, reviewers flag “undefined variable” on new imports or miss that a function signature change breaks a caller three files away.
Single-lens blindness. One LLM call operates with one perspective. It might catch style issues but miss security vulnerabilities, or find performance problems but ignore test coverage gaps. Multi-agent architectures solve this by assigning specialized agents — a security agent, an architecture agent, a test agent — that each analyze from their own lens and then synthesize findings [2].
No evaluation loop. Without a feedback mechanism, the reviewer repeats the same mistakes — flagging false positives on patterns it doesn’t understand, missing real issues because they’re outside the diff scope, or generating vague comments (“consider refactoring”) that waste time.
Multi-Agent Code Review Architecture
The production architecture that emerged through 2026 combines three layers:
┌─────────────────────────────────────────┐
│ Orchestrator Agent │
│ (Coordinates review, deduplicates, │
│ prioritizes, formats output) │
└────────────┬────────────┬───────────────┘
│ │
┌────────▼──┐ ┌─────▼────────┐ ┌────────▼────────┐
│ Security │ │ Architecture │ │ Code Quality │
│ Agent │ │ Agent │ │ Agent │
│ (OWASP │ │ (coupling, │ │ (style, tests, │
│ scan) │ │ layering) │ │ coverage) │
└───────────┘ └──────────────┘ └─────────────────┘
Layer 1: The Orchestrator
The orchestrator manages the review lifecycle: it receives the PR payload, gathers context (git history, related files, CI status), dispatches to specialized agents, and synthesizes results. Zylos Research’s analysis of autonomous code review systems found that the orchestrator’s most critical function is context assembly — gathering everything an agent needs before it starts [3].
import asyncio
from typing import List, Dict
from pydantic import BaseModel
class ReviewTask(BaseModel):
agent_type: str
diff_content: str
context_files: Dict[str, str]
guidelines: str
class ReviewResult(BaseModel):
agent_type: str
findings: List[Dict]
confidence: float
async def run_code_review(payload: Dict) -> Dict:
"""Orchestrate multi-agent PR review."""
context = await gather_pr_context(payload["pr_number"])
tasks = [
ReviewTask(
agent_type="security",
diff_content=payload["diff"],
context_files=context["files"],
guidelines="OWASP Top 10, data flow analysis"
),
ReviewTask(
agent_type="architecture",
diff_content=payload["diff"],
context_files=context["files"],
guidelines="Coupling, cohesion, layering violations"
),
ReviewTask(
agent_type="quality",
diff_content=payload["diff"],
context_files=context["files"],
guidelines="Style consistency, test coverage, edge cases"
),
]
results = await asyncio.gather(*[
dispatch_agent(task) for task in tasks
])
return synthesize(results, context["repo_metadata"])
Layer 2: Specialized Agents
Each agent has a focused prompt and tool set. The security agent might call a dependency scanner and a SAST tool, then feed results into its LLM review. The architecture agent reads module boundaries and cross-file dependencies. The quality agent checks for patterns against the project’s style guide.
Pasi Huuhka’s detailed build log of a PR reviewer using OpenCode demonstrated that specialized agents catch 3-5x more actionable issues than a general-purpose reviewer [4]. His key insight: give each agent code-reading tools (grep, file lookup, AST parsing) rather than expecting the LLM to infer structure from the diff alone.
async def security_agent(task: ReviewTask) -> ReviewResult:
"""Security-focused code review agent."""
vuln_scan = await run_sast_scan(task.diff_content)
dep_check = await check_dependency_vulns(task.context_files)
prompt = f"""Review this diff for security issues:
Diff:
{task.diff_content}
SAST findings:
{vuln_scan}
Dependency advisories:
{dep_check}
Focus on:
1. Injection vulnerabilities (SQL, command, template)
2. Authentication/authorization gaps
3. Data exposure in error messages
4. Unsafe deserialization
5. Secrets in code or config
For each finding, include: file, line, severity (CRITICAL/HIGH/MEDIUM/LOW), and a specific fix suggestion."""
# Use structured output for parseable results
llm_response = await llm.call(prompt, response_model=SecurityFindings)
return ReviewResult(
agent_type="security",
findings=llm_response.findings,
confidence=llm_response.confidence
)
Layer 3: Synthesis and Prioritization
Raw findings from multiple agents need deduplication and prioritization. The synthesizer merges overlapping findings (same line, same issue), assigns overall severity, and formats the output for a PR comment.
def synthesize(results: List[ReviewResult], repo_meta: Dict) -> Dict:
"""Merge and prioritize findings from all agents."""
all_findings = []
for result in results:
for f in result.findings:
f["source"] = result.agent_type
all_findings.append(f)
# Deduplicate by file+line+issue_type
seen = set()
unique = []
for f in sorted(all_findings,
key=lambda x: {"CRITICAL": 0, "HIGH": 1,
"MEDIUM": 2, "LOW": 3}[x.get("severity","LOW")]):
key = (f["file"], f["line"], f["issue_type"])
if key not in seen:
seen.add(key)
unique.append(f)
return {
"summary": f"Found {len(unique)} issues "
f"({sum(1 for f in unique if f['severity']=='CRITICAL')} critical)",
"findings": unique[:repo_meta.get("max_findings", 20)]
}
Integration with GitHub Actions
A production code review agent needs to run automatically on every pull request. GitHub Actions provides a natural home, with a workflow that triggers on pull_request events, calls the agent, and posts results as a comment.
The Ivern AI team’s 2026 benchmark showed that a two-agent pipeline (Gemini CLI for broad analysis + Claude Haiku for detailed review) completes in 30-60 seconds at under $0.02/PR [2]. Here’s a workflow that implements the full multi-agent architecture:
name: AI Code Review
on:
pull_request:
types: [opened, synchronize]
jobs:
review:
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: write
checks: write
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Get PR diff and context
id: context
run: |
git diff origin/${{ github.base_ref }}...HEAD > diff.txt
echo "files_changed=$(git diff --name-only origin/${{ github.base_ref }}...HEAD | wc -l)" >> $GITHUB_OUTPUT
- name: Run AI Code Review
id: review
uses: your-org/ai-code-review-action@v1
with:
model: "claude-sonnet-4-2026"
api-key: ${{ secrets.AI_API_KEY }}
diff-file: "diff.txt"
severity-threshold: "MEDIUM"
max-comments: 20
- name: Post Review Comment
uses: actions/github-script@v7
with:
script: |
const review = ${{ steps.review.outputs.review-json }};
const body = formatReviewComment(review);
github.rest.issues.createComment({
...context.repo,
issue_number: context.issue.number,
body: body
});
Evaluation: The Critical Loop
The most overlooked component of AI code review systems is the evaluation loop. Without it, you can’t tell if your agent is improving or regressing. A rigorous evaluation pipeline has three parts:
Labeled dataset. Collect 50-100 past PRs with manually tagged issues (what the human reviewer found). The Augment team found that evaluation datasets need at least 80 PRs for statistical significance [1].
Regression suite. Run every agent version against the labeled dataset. Track precision (are flagged issues real?) and recall (did the agent miss known issues?). Flag any version that drops either metric.
Human feedback loop. For every PR comment the agent posts, add a thumbs-up/thumbs-down reaction. Collect these weekly and use them to fine-tune prompts and thresholds.
from pydantic import BaseModel
from typing import List, Optional
class EvalExample(BaseModel):
pr_number: int
diff: str
known_issues: List[dict] # Ground truth
def evaluate_agent(agent, examples: List[EvalExample]) -> dict:
true_positives = 0
false_positives = 0
false_negatives = 0
for ex in examples:
result = agent.review(ex.diff)
agent_issues = {(f["file"], f["line"], f["issue_type"])
for f in result["findings"]}
known = {(f["file"], f["line"], f["issue_type"])
for f in ex.known_issues}
true_positives += len(agent_issues & known)
false_positives += len(agent_issues - known)
false_negatives += len(known - agent_issues)
precision = true_positives / (true_positives + false_positives)
recall = true_positives / (true_positives + false_negatives)
return {
"precision": precision,
"recall": recall,
"f1": 2 * precision * recall / (precision + recall)
}
Choosing a Framework
By mid-2026, the landscape of tools for building code review agents has matured. Here’s how the main options stack up:
| Framework | Strengths | Best for | Reference |
|---|---|---|---|
| LangGraph | Stateful graphs, human-in-the-loop, built-in persistence | Complex multi-agent review with approval steps | [6] |
| OpenCode | Fast, open-source, pluggable tools | Individual agent per review type | [4] |
| Custom (FastAPI + LLM) | Full control, no framework overhead | Single-purpose, well-scoped reviews | – |
| PR-Agent (Codium) | Pre-built, GitHub-native | Teams wanting out-of-box solution | [5] |
LangGraph’s state graph model makes it particularly well-suited for code review, where you need to maintain PR context across multiple agent invocations and optionally pause for human approval before posting critical findings [6].
Putting It Together: A Complete Minimal Agent
Here’s a minimal end-to-end implementation using LangGraph that runs a security review on a PR diff:
from langgraph.graph import StateGraph, END
from typing import TypedDict, List
import openai
class ReviewState(TypedDict):
diff: str
context: dict
security_findings: List[dict]
quality_findings: List[dict]
report: str
def security_review(state: ReviewState) -> ReviewState:
"""Security review node."""
response = openai.chat.completions.create(
model="claude-sonnet-4-2026",
messages=[{
"role": "system",
"content": "Review this diff for security issues. "
"Focus on OWASP Top 10 vulnerabilities."
}, {
"role": "user",
"content": state["diff"]
}],
response_format={"type": "json_object"}
)
findings = json.loads(response.choices[0].message.content)
return {**state, "security_findings": findings.get("findings", [])}
def quality_review(state: ReviewState) -> ReviewState:
"""Code quality review node."""
# Similar structure targeting code style, test coverage, edge cases
pass
def synthesize(state: ReviewState) -> ReviewState:
"""Synthesize findings into a report."""
all_findings = state["security_findings"] + state["quality_findings"]
report = format_report(all_findings)
return {**state, "report": report}
# Build the graph
graph = StateGraph(ReviewState)
graph.add_node("security", security_review)
graph.add_node("quality", quality_review)
graph.add_node("synthesize", synthesize)
graph.set_entry_point("security")
graph.add_edge("security", "quality")
graph.add_edge("quality", "synthesize")
graph.add_edge("synthesize", END)
app = graph.compile()
result = app.invoke({"diff": pr_diff, "context": {}})
print(result["report"])
Pitfalls to Watch For
Noise kills adoption. If your agent flags 50 issues per PR, developers will ignore every single one. Start with critical-only findings and expand as precision proves out. The Sourcegraph comparison of 13 code review tools found that the most-used tools cap at 5-10 comments per PR [5].
Context is a cost multiplier. Fetching 20 context files for every diff analysis burns tokens. Use smart context selection — only pull files that the diff actually references (imports, shared types, called functions) rather than loading the whole repo [1].
Evaluation requires ground truth. Without a labeled dataset of real PR issues, you’re flying blind. Build this dataset early by tagging issues from past human reviews. Zylos Research recommends starting with 50 annotated PRs and expanding quarterly [3].
Latency matters. A review that takes 5 minutes is effectively useless for developer workflow. Target sub-60-second reviews to integrate into the existing PR cycle [2].
The Bottom Line for Teams
Building an AI code review agent in 2026 is less about the model and more about the architecture. The teams getting real value — 60-85% issue capture rates with under 20% false positives — share three patterns:
- Multi-agent specialization rather than single-prompt generalists
- Structured evaluation loops with labeled ground truth datasets
- Smart context retrieval that pulls relevant code beyond the diff
Start with a single specialized agent (security is the easiest to evaluate objectively), build your evaluation dataset, and expand only when you can measure improvement.
Sources
[1] Augment Code, “How we built a high-quality AI code review agent,” March 2026. https://www.augmentcode.com/blog/how-we-built-high-quality-ai-code-review-agent
[2] Ivern AI, “AI Agent Code Review Automation: Automate 100% of PRs in 60 Sec,” 2026. https://ivern.ai/blog/ai-agent-code-review-automation
[3] Zylos Research, “Autonomous Code Review: Multi-Agent Approaches to Pull Request Analysis,” April 2026. https://zylos.ai/research/2026-04-22-autonomous-code-review-multi-agent-pr-analysis
[4] Pasi Huuhka, “Building your own PR reviewer with coding agents,” March 2026. https://www.huuhka.net/building-your-own-pr-reviewer-with-coding-agents/
[5] Sourcegraph, “13 Best Automated Code Review Tools in 2026,” May 2026. https://sourcegraph.com/blog/automated-code-review-tools
[6] LangChain, “LangGraph: Build resilient agents,” https://github.com/langchain-ai/langgraph
📖 Related Reads
- CodeIntel Log — code quality, debugging, and software engineering benchmarks
- ToolBrain — tool reviews, LLM comparisons, and AI workflow guides
Cross-links automatically generated from NiteAgent.
← Back to all posts

