Cross-Provider Structured Outputs: A Production Guide for OpenAI, Anthropic, and Gemini
The bottom line: Every major LLM provider now supports native structured outputs — schema-guaranteed JSON that doesn’t need regex parsing or prompt engineering. This guide covers three approaches — native APIs, library-based generation with Instructor, and grammar-constrained decoding with Outlines — with working Python code you can drop into production today.
The Problem: Prompting for JSON Is Not a Strategy
If you’ve ever told an LLM “respond in JSON format” and then written a regex to handle the inconsistent results, you know the pain. OpenAI’s own guidance now explicitly states that JSON mode (type: "json_object") is considered legacy — it guarantees valid JSON syntax but not schema adherence [1]. A model can return valid JSON that still has wrong field names, wrong types, or missing fields.
The production-grade approaches in 2026 are:
- Native structured outputs — Each provider’s own schema-enforced API (OpenAI
response_format, Anthropicoutput_config, Geminiresponse_schema) - Library-based generation — Instructor patches the provider client to enforce Pydantic schemas with automatic retries
- Grammar-constrained decoding — Outlines and XGrammar constrain the model’s token generation to only produce valid JSON matching a schema
Each approach has different tradeoffs for latency, cost, provider support, and flexibility. This guide covers all three with production patterns.
Approach 1: Native Structured Outputs by Provider
OpenAI: Strict JSON Schema Enforcement
OpenAI’s structured outputs use constrained decoding at the token level — the model literally cannot generate tokens that violate the schema. This is available on GPT-4o, GPT-4o-mini, and o-series models [1].
from openai import OpenAI
from pydantic import BaseModel
import json
client = OpenAI()
class ExtractedClaim(BaseModel):
claim_text: str
confidence: float
category: str
source_reference: str | None = None
# JSON Schema is derived from the Pydantic model automatically
completion = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{"role": "system", "content": "Extract claims from the following text."},
{"role": "user", "content": text}
],
response_format=ExtractedClaim,
)
claim = completion.choices[0].message.parsed
print(f"Claim: {claim.claim_text}, Confidence: {claim.confidence}")
Key points about OpenAI’s implementation:
response_formataccepts either a JSON Schema directly or any Pydantic BaseModel when using theparse()helper- The model cannot produce tokens that violate the schema — this is enforced at generation time, not validated after
- Schema support includes nested objects, arrays, enums, optional fields, and
anyOf/allOf - Refusal detection: if the model refuses,
refusalfield is set on the message; otherwise it’sNone - Supported models: GPT-4o family (all generations), GPT-4o-mini, o1, o3, and o4-mini
The retrieve parameter lets you access the response format config from a previous run, making replay and auditing straightforward.
Anthropic Claude: Output Configuration with Grammars
Anthropic added structured outputs to Claude Sonnet 4.5 and Opus 4.1 in late 2025 via their output_config parameter. The key difference from OpenAI is that Anthropic uses a grammar-based approach — the grammar applies only to Claude’s direct output, not to tool use calls or thinking tags [2].
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": "Extract the key entities from: " + text}],
output_config={
"format": "json",
"schema": {
"type": "object",
"properties": {
"entities": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"type": {"type": "string", "enum": ["person", "org", "location", "product"]},
"confidence": {"type": "number", "minimum": 0, "maximum": 1}
},
"required": ["name", "type", "confidence"]
}
}
},
"required": ["entities"]
}
}
)
print(response.content[0].text)
Key points about Anthropic’s implementation:
- The grammar resets between sections (thinking, tool calls, final response), allowing Claude to reason freely in extended thinking mode while still producing structured final output
- To use with extended thinking, you configure the
thinkingparameter alongsideoutput_config— the thinking block is unconstrained, only the final response is grammar-enforced - Citation support is incompatible with output_config (returns 400)
- Prefix-filling (prefilling the assistant response) is incompatible with JSON outputs
- Returns a 400 error if the schema is invalid or incompatible with model capabilities
Google Gemini: response_schema with JSON Schema
Gemini’s structured outputs support JSON Schema natively, and recent improvements (November 2025) added full JSON Schema compatibility including Pydantic and Zod integration [3].
from google import genai
from google.genai.types import GenerateContentConfig
from pydantic import BaseModel
client = genai.Client(api_key="YOUR_API_KEY")
class AnalysisResult(BaseModel):
summary: str
key_findings: list[str]
risk_score: float
recommended_actions: list[str]
response = client.models.generate_content(
model="gemini-2.5-pro",
contents=text,
config=GenerateContentConfig(
response_mime_type="application/json",
response_schema=AnalysisResult,
)
)
result = response.parsed
print(f"Risk score: {result.risk_score}")
Key points about Gemini’s implementation:
- Set
response_mime_typeto"application/json"and pass your schema toresponse_schema - Accepts Pydantic BaseModel, dataclass, or plain dict/JSON Schema
- JSON Schema support includes nested objects, arrays, enums,
$ref,allOf, andoneOf - Works with both
generate_contentandstream_generate_content - Gemini 2.5 Pro, 2.5 Flash, and 2.0 Flash all support structured outputs
- Batch mode (
.jsonlfiles) also supports inline response schemas per-line
Approach 2: Library-Based Structured Outputs with Instructor
Instructor is a Python library that wraps any provider’s client and enforces Pydantic schema compliance with automatic retries, validation, and streaming support. It’s maintained by the Pydantic team and supports OpenAI, Anthropic, Gemini, Cohere, Mistral, and any OpenAI-compatible endpoint [4].
import instructor
from openai import OpenAI
from pydantic import BaseModel, Field
# Patch any OpenAI-compatible client
client = instructor.from_openai(OpenAI())
class MedicalRecord(BaseModel):
patient_id: str
diagnosis: str
medications: list[str]
follow_up_date: str | None = None
severity: str = Field(description="one of: low, medium, high, critical")
# Just call with response_model — Instructor handles the rest
record = client.chat.completions.create(
model="gpt-4o",
response_model=MedicalRecord,
messages=[
{"role": "system", "content": "Extract structured medical record data."},
{"role": "user", "content": raw_clinical_notes}
],
)
print(f"Diagnosis: {record.diagnosis}, Severity: {record.severity}")
Instructor automatically:
- Converts the Pydantic model into the right schema format for your provider
- Submits a validation call after generation (or uses streaming validation)
- Retries automatically on validation failure (configurable max retries)
- Handles nested models, Union types, and Optional fields
- Supports streaming mode for real-time partial parsing
For Anthropic specifically:
import instructor
from anthropic import Anthropic
client = instructor.from_anthropic(Anthropic())
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
response_model=MedicalRecord,
messages=[{"role": "user", "content": raw_clinical_notes}],
)
The advantage of Instructor over native APIs is provider portability — the same Pydantic model and the same calling pattern works with any supported provider. If you need to switch from OpenAI to Anthropic, you change one import and one client initialization.
Approach 3: Grammar-Constrained Decoding with Outlines
Outlines takes a fundamentally different approach. Instead of wrapping a provider’s API, it constrains the token generation process directly using a grammar. This works at the token sampling level — the model’s probability distribution is masked so that tokens that would violate the schema have probability zero [5].
from outlines import models, generate
from pydantic import BaseModel
# Load any model via Transformers, vLLM, or Ollama
model = models.transformers("microsoft/Phi-3-medium-4k-instruct")
class CodeReview(BaseModel):
file_path: str
issues: list[dict]
severity: str
suggestion: str
# Constrain generation to match the Pydantic schema
generator = generate.json(model, CodeReview)
result = generator(
"Review this Python file for memory leaks and threading bugs."
)
print(result.issues)
Outlines supports:
- Local models via Transformers, vLLM, ExLlamaV2, and llama.cpp
- Remote models via OpenAI, Anthropic, and any OpenAI-compatible endpoint
- Multiple modes: JSON schema, CSV, regular expressions, and context-free grammars
- Batch processing for N-samples-per-prompt
The killer use case for Outlines is local structured generation with quantized models. If you’re running a 7B or 13B model on-premises and need guaranteed JSON output, Outlines with vLLM’s XGrammar backend is the standard approach — delivering up to 3.5x faster JSON generation than alternative grammar engines [5].
Production Patterns
Pattern 1: Unified Router Across Providers
For production systems, you want a single interface that routes to the best available structured output method:
from enum import Enum
from typing import Protocol
from pydantic import BaseModel
class Provider(Enum):
OPENAI = "openai"
ANTHROPIC = "anthropic"
GEMINI = "gemini"
LOCAL_OUTLINES = "local"
class StructuredOutputProvider(Protocol):
def extract(self, model: str, prompt: str, schema: type[BaseModel]) -> BaseModel:
...
class Router:
def __init__(self):
self.providers: dict[Provider, StructuredOutputProvider] = {}
def route(
self,
schema: type[BaseModel],
prompt: str,
prefer: Provider | None = None
) -> BaseModel:
# Try preferred provider, fall back to alternatives
providers_to_try = (
[prefer] + [p for p in Provider if p != prefer]
if prefer else list(Provider)
)
last_error = None
for provider in providers_to_try:
try:
impl = self.providers.get(provider)
if not impl:
continue
return impl.extract("default", prompt, schema)
except Exception as e:
last_error = e
continue
raise RuntimeError(f"All providers failed: {last_error}")
Pattern 2: Validation Chain with Retries
Schema enforcement doesn’t mean 100% correctness — especially for complex nested schemas. Implement a validation chain:
from pydantic import ValidationError
from tenacity import retry, stop_after_attempt, wait_exponential
class Extractor:
MAX_RETRIES = 3
@retry(
stop=stop_after_attempt(MAX_RETRIES),
wait=wait_exponential(multiplier=1, min=1, max=10)
)
def extract_with_retry(
self,
prompt: str,
model: type[BaseModel]
) -> BaseModel:
raw = self._call_llm(prompt, model)
# Validate the parsed output
try:
return model.model_validate(raw)
except ValidationError as e:
# Feed validation errors back into the retry prompt
retry_prompt = (
f"Previous output failed validation: {e.errors()}\n"
f"Please regenerate with correct schema."
)
raw = self._call_llm(retry_prompt, model)
return model.model_validate(raw)
Pattern 3: Cost and Latency Tracking
Track structured output costs per schema complexity:
import time
from dataclasses import dataclass, field
@dataclass
class StructuredOutputMetrics:
schema_name: str
provider: str
model: str
input_tokens: int = 0
output_tokens: int = 0
duration_ms: float = 0.0
retries: int = 0
cost_usd: float = 0.0
class MonitoredExtractor:
def extract(self, schema: type[BaseModel], text: str) -> tuple[BaseModel, StructuredOutputMetrics]:
start = time.time()
metrics = StructuredOutputMetrics(
schema_name=schema.__name__,
provider="openai",
model="gpt-4o",
)
result = self.client.beta.chat.completions.parse(
model="gpt-4o",
response_format=schema,
messages=[{"role": "user", "content": text}],
)
duration = (time.time() - start) * 1000
usage = result.usage
metrics.duration_ms = duration
metrics.input_tokens = usage.prompt_tokens
metrics.output_tokens = usage.completion_tokens
# GPT-4o: $2.50/M input, $10.00/M output (as of June 2026) [1]
metrics.cost_usd = (
usage.prompt_tokens / 1_000_000 * 2.50 +
usage.completion_tokens / 1_000_000 * 10.00
)
return result.choices[0].message.parsed, metrics
Comparison: Which Approach When
| Situation | Best approach | Reason |
|---|---|---|
| Single provider, need speed | Native API | No overhead, provider-optimized constraint |
| Multi-provider fallback | Instructor | Same Pydantic model, same call pattern |
| Local/on-premise model | Outlines + XGrammar | Grammar-constrained decoding works offline |
| Complex nested schemas | Instructor + Native | Instructor validates recursively, native constraints top-level |
| Streaming structured output | Instructor or Outlines | Both support incremental parsing |
| Schema changes frequently | Native API (JSON Schema) | No code change needed, just update schema |
| Cost-sensitive batch processing | Outlines (local) | Zero API cost after model download |
Migration Path: From Prompt-Only JSON to Structured Outputs
If you’re currently using prompt-only JSON (asking the model nicely to return JSON), migrate in stages:
- Stage 1 — Add JSON mode (
response_format={"type": "json_object"}or equivalent) for syntactic validation. This catches malformed JSON on the wire. - Stage 2 — Add downstream schema validation (Pydantic
model_validate) and log schema violations. Measure how often the model returns wrong field names or types. - Stage 3 — Switch to structured outputs with full JSON Schema. The provider enforces the schema at generation time, eliminating most validation failures.
- Stage 4 — Add automatic retries with the validation error as context. Handle the remaining 0.5-1% of cases where the model refuses or truncates.
Each stage reduces the error rate by roughly an order of magnitude without requiring a full rewrite.
References
[1] OpenAI, “Structured Outputs API Guide,” developers.openai.com, 2026. https://developers.openai.com/api/docs/guides/structured-outputs
[2] Anthropic, “Structured Outputs — Claude API Docs,” docs.anthropic.com, 2026. https://docs.anthropic.com/en/docs/build-with-claude/structured-outputs
[3] Google, “Structured Outputs — Gemini API Docs,” ai.google.dev, 2026. https://ai.google.dev/gemini-api/docs/structured-output
[4] 567-Labs, “Instructor: Structured Outputs for LLMs,” GitHub, 2026. https://github.com/567-labs/instructor
[5] Dottxt, “Outlines: Structured Text Generation,” GitHub, 2026. https://github.com/dottxt-ai/outlines
← Back to all posts