Prompt Caching in Production: A Provider-by-Provider Implementation Guide

The bottom line: Prompt caching is the highest-leverage cost optimization available to production LLM systems in 2026. Each major provider implements it differently — OpenAI auto-caches prefixes above a threshold, Anthropic uses explicit breakpoints you control, Google offers both implicit and explicit modes. This guide covers the implementation details, code patterns, and cost data for each, plus a unified approach when routing across multiple providers.
The Caching Landscape
Every LLM inference has a fixed overhead: the model computes key-value (KV) cache tensors for the shared prefix of your prompt. When the same prefix appears in successive requests — system prompt, tool definitions, few-shot examples — that work is wasted if recomputed from scratch. Prompt caching keeps the KV cache alive between requests, so subsequent calls with the same prefix skip the most expensive part of inference.
The result is dramatic: 35–85% latency reduction and 30–90% input token cost savings depending on provider and prompt structure [1]. For production pipelines where the same system prompt fires across thousands of agent turns, the savings compound fast.
Each provider takes a different approach:
| Provider | Mechanism | Cache Duration | Cache Size Threshold | Cost Savings (Cache Hit) |
|---|---|---|---|---|
| OpenAI | Automatic prefix matching | 5–10 min inactivity | 1,024 tokens minimum | 50% off input tokens |
| Anthropic | Explicit cache_control breakpoints |
5 min inactivity | 1,024 tokens minimum (breakpoint) | Up to 90% off input tokens |
| Google Gemini (explicit) | CachedContent API with TTL | Configurable (1h default) | 32,768 tokens minimum | Up to 80% off input tokens |
| Google Gemini (implicit) | Automatic (Gemini 2.5+) | Per-session | No minimum | No cost guarantee |
OpenAI: Automatic Prefix Caching
OpenAI’s approach is the simplest to implement because it requires zero code changes. Caching is automatic — the API identifies repeated prefixes at inference time and reuses computed KV tensors when a matching prefix is found [2].
How it Works
- Cache eligibility starts at 1,024 tokens. Prompts under that threshold never qualify.
- The system checks prefixes of the full prompt for matches in a shared cache pool (5–10 minute inactivity window).
- Structured output schemas are included in the prefix, so they improve cache hit rates rather than degrading them [3].
- Cache hits return a 50% discount on input token pricing. [1]
Implementation Strategy
The most important thing you can do for OpenAI prompt caching is prefix stability. Every agent turn should share an identical prefix — system prompt, tool definitions, and structured output schema — with only the conversation history varying.
from openai import OpenAI
client = OpenAI()
# The stable prefix: system prompt + tools never change between agent turns
SYSTEM_PROMPT = """You are a code review assistant. Analyze pull requests for:
1. Security vulnerabilities
2. Performance regressions
3. Code style violations
4. Test coverage gaps
Provide specific line-level feedback with severity ratings."""
TOOLS = [
{
"type": "function",
"function": {
"name": "search_code",
"description": "Search codebase for patterns",
"parameters": {
"type": "object",
"properties": {
"pattern": {"type": "string"},
"path": {"type": "string"}
},
"required": ["pattern"]
}
}
},
# ... additional tool definitions remain constant
]
def agent_turn(messages: list[dict]) -> str:
"""Single agent turn with stable prefix for cache hits."""
response = client.responses.create(
model="gpt-4o",
input=[
{"role": "system", "content": SYSTEM_PROMPT},
*messages # Only this section varies between turns
],
tools=TOOLS,
tool_choice="auto"
)
return response.output_text
The key optimization: the system prompt and tools array form a stable prefix of typically 2,000–4,000+ tokens, well above the 1,024 token threshold. Every agent turn after the first gets a cache hit on that prefix.
Caveats
- There’s no way to force a cache hit. If the prefix varies even slightly (whitespace, ordering), the hash changes and you get a cache miss.
- The 5–10 minute window resets on each cache hit. Sustained traffic keeps the cache warm.
- For sporadic workloads (one request every 15+ minutes), the cache is typically cold and you won’t see savings.
Anthropic: Explicit Cache Breakpoints
Anthropic takes the opposite approach from OpenAI: caching is opt-in with explicit cache_control breakpoints placed wherever you want the cached prefix to end [4].
How it Works
You mark specific content blocks in your prompt with {"cache_control": {"type": "ephemeral"}}. Everything from the start of the block list up to that breakpoint is cached. The cache persists for 5 minutes of inactivity and is isolated at the workspace level (since February 2026).
Cost savings are tiered:
| Content Type | Cost on Cache Hit |
|---|---|
| System prompt | ~90% discount |
| Tool definitions | ~90% discount |
| Conversation history (per breakpoint) | ~75% discount |
| Remaining input | ~50% discount |
Implementation
from anthropic import Anthropic
client = Anthropic()
def agent_turn(user_message: str, history: list[dict]) -> str:
"""Agent turn with cache breakpoints on system prompt and tools."""
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
system=[
{
"type": "text",
"text": """You are a code review assistant. Analyze pull requests for:
1. Security vulnerabilities
2. Performance regressions
3. Code style violations
4. Test coverage gaps
Provide specific line-level feedback with severity ratings.""",
"cache_control": {"type": "ephemeral"}
}
],
tools=[
{
"name": "search_code",
"description": "Search codebase for patterns",
"input_schema": {
"type": "object",
"properties": {
"pattern": {"type": "string"},
"path": {"type": "string"}
},
"required": ["pattern"]
},
"cache_control": {"type": "ephemeral"}
}
],
messages=[
*history,
{"role": "user", "content": user_message}
]
)
return response.content[0].text
The cache_control marker tells Anthropic to stop computing the KV cache at that point. On subsequent requests with the same prefix content, the server reuses the cached values. Without the marker, no caching occurs regardless of prompt length.
Breakpoint Strategy
Place breakpoints strategically: one at the end of the system prompt, one at the end of tool definitions, and optionally one at key points in conversation history for long-running agent sessions. Each breakpoint adds a small amount of overhead (the marker itself) but enables a larger cached region.
Caveats
- Each breakpoint costs 600 tokens (the cache write cost). On short prompts this can exceed the savings — avoid breakpoints under 1,024 tokens.
- Citations are incompatible with prompt caching (returns 400 error).
- If you change the system prompt text, the old cache entry is invalidated immediately. Only reuse identical prefixes.
- Workspace-level isolation means caches from different workspaces don’t interfere (or share benefits).
Google Gemini: Explicit Context Caching
Google offers two modes. Implicit caching is automatic on Gemini 2.5+ but carries no cost guarantee. Explicit caching via the CachedContent API is the recommended approach for production workloads [5].
How it Works
You create a CachedContent object with your prefix content, set a TTL, and reference it in subsequent generateContent calls. The cache is a named resource that persists until the TTL expires (default 1 hour, max configurable) or until it’s manually deleted.
Requirements:
- Minimum 32,768 tokens for the cached content
- Supported models: Gemini 1.5 Pro/Flash, Gemini 2.0 Flash, Gemini 2.5 Pro/Flash
Implementation
import google.generativeai as genai
genai.configure(api_key="YOUR_API_KEY")
# Step 1: Create a cached context (one-time setup)
system_content = """You are a code review assistant. Analyze pull requests for:
1. Security vulnerabilities
2. Performance regressions
3. Code style violations
4. Test coverage gaps
Provide specific line-level feedback with severity ratings."""
cached_content = genai.CachedContent.create(
model="models/gemini-2.5-pro",
display_name="code-review-system-prompt",
system_instruction=system_content,
contents=[],
ttl="3600s", # 1 hour cache lifetime
)
cache_name = cached_content.name
print(f"Created cache: {cache_name}")
# Step 2: Reference the cache in inference calls
model = genai.GenerativeModel.from_cached_content(
cached_content=cached_content
)
def agent_turn(user_message: str) -> str:
response = model.generate_content(user_message)
return response.text
The cost structure is clear: cache storage costs are fixed per token-hour, while cache hits save roughly 80% on input token pricing compared to full computation. [2]
Multi-Turn Pattern
For long-running agent sessions, you can update the cached content incrementally:
# Update cache with conversation history (extends TTL)
cached_content = genai.CachedContent.update(
name=cache_name,
contents=conversation_history,
ttl="3600s"
)
This is useful for multi-turn agents where the conversation history itself becomes part of the cached prefix. Note that updating the content counts as a write operation with storage cost.
Caveats
- The 32,768 token minimum means explicit caching only makes sense for medium-to-large prompts.
- The
CachedContentAPI is a separate resource with its own costs (storage per hour), even on cache misses. - TTL management is manual — if you don’t update or extend the TTL, the cache evaporates.
A Unified Multi-Provider Strategy
When routing across providers, you need a strategy that works for all three. The common pattern is prefix normalization: ensure the system prompt, tool definitions, and output schema are byte-identical across requests for a given task, regardless of which provider handles it.
from dataclasses import dataclass
@dataclass
class StablePrefix:
"""A normalized, cacheable prefix shared across providers."""
system_prompt: str
tools_definitions: str # serialized JSON
output_schema: str | None = None
@property
def normalized_system(self) -> str:
"""Return byte-identical system prompt for cache stability."""
return self.system_prompt.strip()
def token_estimate(self, model: str) -> int:
"""Rough token count for cache eligibility checks."""
# ~4 chars per token for English text
total = len(self.normalized_system) // 4
total += len(self.tools_definitions) // 3 # JSON is denser
if self.output_schema:
total += len(self.output_schema) // 3
return total
The optimization targets:
- System prompt stability — Same prompt text across all provider calls for a task. Even a single changed word invalidates the cache prefix.
- Tool definition normalization — Serialize tools in a consistent order (alphabetical by name) and format across all providers.
- Agent turn batching — Group calls by task type so identical prefixes cluster together in time, keeping caches warm.
- Cache-aware deployment — Route similar tasks to the same model instance when possible to maximize cache locality.
Real-World Savings
Published figures from production deployments in 2026 show consistent results:
- An e-commerce support agent using Anthropic prompt caching reduced input token costs by roughly 70% and latency by 60% after adding breakpoints to the system prompt and tool definitions [1].
- A code analysis pipeline using OpenAI’s automatic prefix caching reported approximately 50% savings on input tokens for sustained-review workloads where the same system prompt was used across hundreds of pull requests per hour [2].
- A multi-model RAG pipeline using Gemini explicit caching cut input costs by roughly 75% on cached prefixes, with cache storage costs adding less than 5% overhead [6].
The common thread: any sustained production workload with a stable system prompt and repeated invocations will benefit. The savings are trivial for one-off requests and significant for pipelines processing hundreds or thousands of calls per hour.
When Not to Cache
Prompt caching is not a silver bullet. Skip it when:
- Short-lived sessions — If each user session makes 1–3 calls and there’s no shared system prompt across sessions, the cache is unlikely to hit.
- Dynamic system prompts — If every request uses a different system instruction (e.g., user-provided prompts), there’s no stable prefix to cache.
- Throughput below threshold — Fewer than one request per 5–10 minutes (per provider window) means the cache expires between calls.
- Prompt under threshold — OpenAI needs 1,024 tokens minimum; Gemini explicit needs 32,768 tokens. Short prompts get no benefit.
Next Steps
- Audit your prompts — Measure the token count of your stable prefix (system prompt + tools). Anything above 1,024 tokens is a candidate.
- Add cache markers — For Anthropic, add
cache_controlbreakpoints. For OpenAI, ensure prefix stability. For Gemini, evaluate the 32K threshold. - Measure before and after — Track
usage.cache_creation_input_tokensandusage.cache_read_input_tokensin Anthropic responses. OpenAI returnscache_hitandcache_missmetrics in the response object. Gemini’sCachedContentAPI returns usage breakdowns in the response metadata. - Monitor cache hit rate — A hit rate below 60% usually means prefix instability or low request density. Diagnose and fix before investing in more caching. [3]
Prompt caching won’t save you from bad architecture, but it will cut your LLD bill by a meaningful margin for any production system with a stable prompt. The implementation is simple once you understand each provider’s model — and the payoff compounds with every request your system processes.
Sources
[1] “Prompt Caching Infrastructure: Reducing LLM Costs and Latency” — Introl, March 2026. https://introl.com/blog/prompt-caching-infrastructure-llm-cost-latency-reduction-guide-2025
[2] “Prompt caching” — OpenAI API docs. https://developers.openai.com/api/docs/guides/prompt-caching
[3] “Prompt Caching 201” — OpenAI Developers, February 2026. https://developers.openai.com/cookbook/examples/prompt_caching_201
[4] “Prompt caching” — Anthropic Claude API Docs. https://platform.claude.com/docs/en/build-with-claude/prompt-caching
[5] “Context caching - generateContent API” — Google AI for Developers. https://ai.google.dev/gemini-api/docs/caching
[6] “Prompt Caching in 2026: Cut LLM Costs, Keep Quality” — Digital Applied, June 2026. https://www.digitalapplied.com/blog/prompt-caching-2026-cut-llm-costs-engineering-guide
References
- [0] (citation needed)
- [1] (citation needed)
- [2] (citation needed)
- [3] (citation needed)
- [4] (citation needed)
- [5] (citation needed)
- [6] (citation needed)


