Prompt Caching in Production: A Provider-by-Provider Implementation Guide

The bottom line: Prompt caching is the highest-leverage cost optimization available to production LLM systems in 2026. Each major provider implements it differently — OpenAI auto-caches prefixes above a threshold, Anthropic uses explicit breakpoints you control, Google offers both implicit and explicit modes. This guide covers the implementation details, code patterns, and cost data for each, plus a unified approach when routing across multiple providers.

The Caching Landscape

Every LLM inference has a fixed overhead: the model computes key-value (KV) cache tensors for the shared prefix of your prompt. When the same prefix appears in successive requests — system prompt, tool definitions, few-shot examples — that work is wasted if recomputed from scratch. Prompt caching keeps the KV cache alive between requests, so subsequent calls with the same prefix skip the most expensive part of inference.

The result is dramatic: 35–85% latency reduction and 30–90% input token cost savings depending on provider and prompt structure [1]. For production pipelines where the same system prompt fires across thousands of agent turns, the savings compound fast.

Each provider takes a different approach:

Provider	Mechanism	Cache Duration	Cache Size Threshold	Cost Savings (Cache Hit)
OpenAI	Automatic prefix matching	5–10 min inactivity	1,024 tokens minimum	50% off input tokens
Anthropic	Explicit `cache_control` breakpoints	5 min inactivity	1,024 tokens minimum (breakpoint)	Up to 90% off input tokens
Google Gemini (explicit)	CachedContent API with TTL	Configurable (1h default)	32,768 tokens minimum	Up to 80% off input tokens
Google Gemini (implicit)	Automatic (Gemini 2.5+)	Per-session	No minimum	No cost guarantee

OpenAI: Automatic Prefix Caching

OpenAI’s approach is the simplest to implement because it requires zero code changes. Caching is automatic — the API identifies repeated prefixes at inference time and reuses computed KV tensors when a matching prefix is found [2].

How it Works

Cache eligibility starts at 1,024 tokens. Prompts under that threshold never qualify.
The system checks prefixes of the full prompt for matches in a shared cache pool (5–10 minute inactivity window).
Structured output schemas are included in the prefix, so they improve cache hit rates rather than degrading them [3].
Cache hits return a 50% discount on input token pricing. [1]

Implementation Strategy

The most important thing you can do for OpenAI prompt caching is prefix stability. Every agent turn should share an identical prefix — system prompt, tool definitions, and structured output schema — with only the conversation history varying.

from openai import OpenAI

client = OpenAI()

# The stable prefix: system prompt + tools never change between agent turns
SYSTEM_PROMPT = """You are a code review assistant. Analyze pull requests for:
1. Security vulnerabilities
2. Performance regressions
3. Code style violations
4. Test coverage gaps

Provide specific line-level feedback with severity ratings."""

TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "search_code",
            "description": "Search codebase for patterns",
            "parameters": {
                "type": "object",
                "properties": {
                    "pattern": {"type": "string"},
                    "path": {"type": "string"}
                },
                "required": ["pattern"]
            }
        }
    },
    # ... additional tool definitions remain constant
]

def agent_turn(messages: list[dict]) -> str:
    """Single agent turn with stable prefix for cache hits."""
    response = client.responses.create(
        model="gpt-4o",
        input=[
            {"role": "system", "content": SYSTEM_PROMPT},
            *messages  # Only this section varies between turns
        ],
        tools=TOOLS,
        tool_choice="auto"
    )
    return response.output_text

The key optimization: the system prompt and tools array form a stable prefix of typically 2,000–4,000+ tokens, well above the 1,024 token threshold. Every agent turn after the first gets a cache hit on that prefix.

Caveats

There’s no way to force a cache hit. If the prefix varies even slightly (whitespace, ordering), the hash changes and you get a cache miss.
The 5–10 minute window resets on each cache hit. Sustained traffic keeps the cache warm.
For sporadic workloads (one request every 15+ minutes), the cache is typically cold and you won’t see savings.

Anthropic: Explicit Cache Breakpoints

Anthropic takes the opposite approach from OpenAI: caching is opt-in with explicit cache_control breakpoints placed wherever you want the cached prefix to end [4].

How it Works

You mark specific content blocks in your prompt with {"cache_control": {"type": "ephemeral"}}. Everything from the start of the block list up to that breakpoint is cached. The cache persists for 5 minutes of inactivity and is isolated at the workspace level (since February 2026).

Cost savings are tiered:

Content Type	Cost on Cache Hit
System prompt	~90% discount
Tool definitions	~90% discount
Conversation history (per breakpoint)	~75% discount
Remaining input	~50% discount

Implementation

from anthropic import Anthropic

client = Anthropic()

def agent_turn(user_message: str, history: list[dict]) -> str:
    """Agent turn with cache breakpoints on system prompt and tools."""
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        system=[
            {
                "type": "text",
                "text": """You are a code review assistant. Analyze pull requests for:
1. Security vulnerabilities
2. Performance regressions
3. Code style violations
4. Test coverage gaps

Provide specific line-level feedback with severity ratings.""",
                "cache_control": {"type": "ephemeral"}
            }
        ],
        tools=[
            {
                "name": "search_code",
                "description": "Search codebase for patterns",
                "input_schema": {
                    "type": "object",
                    "properties": {
                        "pattern": {"type": "string"},
                        "path": {"type": "string"}
                    },
                    "required": ["pattern"]
                },
                "cache_control": {"type": "ephemeral"}
            }
        ],
        messages=[
            *history,
            {"role": "user", "content": user_message}
        ]
    )
    return response.content[0].text

The cache_control marker tells Anthropic to stop computing the KV cache at that point. On subsequent requests with the same prefix content, the server reuses the cached values. Without the marker, no caching occurs regardless of prompt length.

Breakpoint Strategy

Place breakpoints strategically: one at the end of the system prompt, one at the end of tool definitions, and optionally one at key points in conversation history for long-running agent sessions. Each breakpoint adds a small amount of overhead (the marker itself) but enables a larger cached region.

Caveats

Each breakpoint costs 600 tokens (the cache write cost). On short prompts this can exceed the savings — avoid breakpoints under 1,024 tokens.
Citations are incompatible with prompt caching (returns 400 error).
If you change the system prompt text, the old cache entry is invalidated immediately. Only reuse identical prefixes.
Workspace-level isolation means caches from different workspaces don’t interfere (or share benefits).

Google Gemini: Explicit Context Caching

Google offers two modes. Implicit caching is automatic on Gemini 2.5+ but carries no cost guarantee. Explicit caching via the CachedContent API is the recommended approach for production workloads [5].

How it Works

You create a CachedContent object with your prefix content, set a TTL, and reference it in subsequent generateContent calls. The cache is a named resource that persists until the TTL expires (default 1 hour, max configurable) or until it’s manually deleted.

Requirements:

Minimum 32,768 tokens for the cached content
Supported models: Gemini 1.5 Pro/Flash, Gemini 2.0 Flash, Gemini 2.5 Pro/Flash

Implementation

import google.generativeai as genai

genai.configure(api_key="YOUR_API_KEY")

# Step 1: Create a cached context (one-time setup)
system_content = """You are a code review assistant. Analyze pull requests for:
1. Security vulnerabilities
2. Performance regressions
3. Code style violations
4. Test coverage gaps

Provide specific line-level feedback with severity ratings."""

cached_content = genai.CachedContent.create(
    model="models/gemini-2.5-pro",
    display_name="code-review-system-prompt",
    system_instruction=system_content,
    contents=[],
    ttl="3600s",  # 1 hour cache lifetime
)

cache_name = cached_content.name
print(f"Created cache: {cache_name}")

# Step 2: Reference the cache in inference calls
model = genai.GenerativeModel.from_cached_content(
    cached_content=cached_content
)

def agent_turn(user_message: str) -> str:
    response = model.generate_content(user_message)
    return response.text

The cost structure is clear: cache storage costs are fixed per token-hour, while cache hits save roughly 80% on input token pricing compared to full computation. [2]

Multi-Turn Pattern

For long-running agent sessions, you can update the cached content incrementally:

# Update cache with conversation history (extends TTL)
cached_content = genai.CachedContent.update(
    name=cache_name,
    contents=conversation_history,
    ttl="3600s"
)

This is useful for multi-turn agents where the conversation history itself becomes part of the cached prefix. Note that updating the content counts as a write operation with storage cost.

Caveats

The 32,768 token minimum means explicit caching only makes sense for medium-to-large prompts.
The CachedContent API is a separate resource with its own costs (storage per hour), even on cache misses.
TTL management is manual — if you don’t update or extend the TTL, the cache evaporates.

A Unified Multi-Provider Strategy

When routing across providers, you need a strategy that works for all three. The common pattern is prefix normalization: ensure the system prompt, tool definitions, and output schema are byte-identical across requests for a given task, regardless of which provider handles it.

from dataclasses import dataclass

@dataclass
class StablePrefix:
    """A normalized, cacheable prefix shared across providers."""
    system_prompt: str
    tools_definitions: str  # serialized JSON
    output_schema: str | None = None

    @property
    def normalized_system(self) -> str:
        """Return byte-identical system prompt for cache stability."""
        return self.system_prompt.strip()

    def token_estimate(self, model: str) -> int:
        """Rough token count for cache eligibility checks."""
        # ~4 chars per token for English text
        total = len(self.normalized_system) // 4
        total += len(self.tools_definitions) // 3  # JSON is denser
        if self.output_schema:
            total += len(self.output_schema) // 3
        return total

The optimization targets:

System prompt stability — Same prompt text across all provider calls for a task. Even a single changed word invalidates the cache prefix.
Tool definition normalization — Serialize tools in a consistent order (alphabetical by name) and format across all providers.
Agent turn batching — Group calls by task type so identical prefixes cluster together in time, keeping caches warm.
Cache-aware deployment — Route similar tasks to the same model instance when possible to maximize cache locality.

Real-World Savings

Published figures from production deployments in 2026 show consistent results:

An e-commerce support agent using Anthropic prompt caching reduced input token costs by roughly 70% and latency by 60% after adding breakpoints to the system prompt and tool definitions [1].
A code analysis pipeline using OpenAI’s automatic prefix caching reported approximately 50% savings on input tokens for sustained-review workloads where the same system prompt was used across hundreds of pull requests per hour [2].
A multi-model RAG pipeline using Gemini explicit caching cut input costs by roughly 75% on cached prefixes, with cache storage costs adding less than 5% overhead [6].

The common thread: any sustained production workload with a stable system prompt and repeated invocations will benefit. The savings are trivial for one-off requests and significant for pipelines processing hundreds or thousands of calls per hour.

When Not to Cache

Prompt caching is not a silver bullet. Skip it when:

Short-lived sessions — If each user session makes 1–3 calls and there’s no shared system prompt across sessions, the cache is unlikely to hit.
Dynamic system prompts — If every request uses a different system instruction (e.g., user-provided prompts), there’s no stable prefix to cache.
Throughput below threshold — Fewer than one request per 5–10 minutes (per provider window) means the cache expires between calls.
Prompt under threshold — OpenAI needs 1,024 tokens minimum; Gemini explicit needs 32,768 tokens. Short prompts get no benefit.

Next Steps

Audit your prompts — Measure the token count of your stable prefix (system prompt + tools). Anything above 1,024 tokens is a candidate.
Add cache markers — For Anthropic, add cache_control breakpoints. For OpenAI, ensure prefix stability. For Gemini, evaluate the 32K threshold.
Measure before and after — Track usage.cache_creation_input_tokens and usage.cache_read_input_tokens in Anthropic responses. OpenAI returns cache_hit and cache_miss metrics in the response object. Gemini’s CachedContent API returns usage breakdowns in the response metadata.
Monitor cache hit rate — A hit rate below 60% usually means prefix instability or low request density. Diagnose and fix before investing in more caching. [3]

Prompt caching won’t save you from bad architecture, but it will cut your LLD bill by a meaningful margin for any production system with a stable prompt. The implementation is simple once you understand each provider’s model — and the payoff compounds with every request your system processes.

Sources

[1] “Prompt Caching Infrastructure: Reducing LLM Costs and Latency” — Introl, March 2026. https://introl.com/blog/prompt-caching-infrastructure-llm-cost-latency-reduction-guide-2025

[2] “Prompt caching” — OpenAI API docs. https://developers.openai.com/api/docs/guides/prompt-caching

[3] “Prompt Caching 201” — OpenAI Developers, February 2026. https://developers.openai.com/cookbook/examples/prompt_caching_201

[4] “Prompt caching” — Anthropic Claude API Docs. https://platform.claude.com/docs/en/build-with-claude/prompt-caching

[5] “Context caching - generateContent API” — Google AI for Developers. https://ai.google.dev/gemini-api/docs/caching

[6] “Prompt Caching in 2026: Cut LLM Costs, Keep Quality” — Digital Applied, June 2026. https://www.digitalapplied.com/blog/prompt-caching-2026-cut-llm-costs-engineering-guide

References

[0] (citation needed)
[1] (citation needed)
[2] (citation needed)
[3] (citation needed)
[4] (citation needed)
[5] (citation needed)
[6] (citation needed)

← Back to all posts

Prompt Caching in Production: A Provider-by-Provider Implementation Guide

The Caching Landscape

OpenAI: Automatic Prefix Caching

How it Works

Implementation Strategy

Caveats

Anthropic: Explicit Cache Breakpoints

How it Works

Implementation

Breakpoint Strategy

Caveats

Google Gemini: Explicit Context Caching

How it Works

Implementation

Multi-Turn Pattern

Caveats

A Unified Multi-Provider Strategy

Real-World Savings

When Not to Cache

Next Steps

References

Related Posts

Prompt Cache Hit Rate Engineering: A Production Guide for AI Agents

Cross-Provider Structured Outputs: A Production Guide for OpenAI, Anthropic, and Gemini

Structured Outputs Across LLM Providers: A Production Guide to JSON Mode, Tool Calling, and Constrained Decoding