Cross-Provider Structured Outputs: A Production Guide for OpenAI, Anthropic, and Gemini

The bottom line: Every major LLM provider now supports native structured outputs — schema-guaranteed JSON that doesn’t need regex parsing or prompt engineering. This guide covers three approaches — native APIs, library-based generation with Instructor, and grammar-constrained decoding with Outlines — with working Python code you can drop into production today.


The Problem: Prompting for JSON Is Not a Strategy

If you’ve ever told an LLM “respond in JSON format” and then written a regex to handle the inconsistent results, you know the pain. OpenAI’s own guidance now explicitly states that JSON mode (type: "json_object") is considered legacy — it guarantees valid JSON syntax but not schema adherence [1]. A model can return valid JSON that still has wrong field names, wrong types, or missing fields.

The production-grade approaches in 2026 are:

  1. Native structured outputs — Each provider’s own schema-enforced API (OpenAI response_format, Anthropic output_config, Gemini response_schema)
  2. Library-based generation — Instructor patches the provider client to enforce Pydantic schemas with automatic retries
  3. Grammar-constrained decoding — Outlines and XGrammar constrain the model’s token generation to only produce valid JSON matching a schema

Each approach has different tradeoffs for latency, cost, provider support, and flexibility. This guide covers all three with production patterns.


Approach 1: Native Structured Outputs by Provider

OpenAI: Strict JSON Schema Enforcement

OpenAI’s structured outputs use constrained decoding at the token level — the model literally cannot generate tokens that violate the schema. This is available on GPT-4o, GPT-4o-mini, and o-series models [1].

from openai import OpenAI
from pydantic import BaseModel
import json

client = OpenAI()

class ExtractedClaim(BaseModel):
    claim_text: str
    confidence: float
    category: str
    source_reference: str | None = None

# JSON Schema is derived from the Pydantic model automatically
completion = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Extract claims from the following text."},
        {"role": "user", "content": text}
    ],
    response_format=ExtractedClaim,
)

claim = completion.choices[0].message.parsed
print(f"Claim: {claim.claim_text}, Confidence: {claim.confidence}")

Key points about OpenAI’s implementation:

  • response_format accepts either a JSON Schema directly or any Pydantic BaseModel when using the parse() helper
  • The model cannot produce tokens that violate the schema — this is enforced at generation time, not validated after
  • Schema support includes nested objects, arrays, enums, optional fields, and anyOf/allOf
  • Refusal detection: if the model refuses, refusal field is set on the message; otherwise it’s None
  • Supported models: GPT-4o family (all generations), GPT-4o-mini, o1, o3, and o4-mini

The retrieve parameter lets you access the response format config from a previous run, making replay and auditing straightforward.

Anthropic Claude: Output Configuration with Grammars

Anthropic added structured outputs to Claude Sonnet 4.5 and Opus 4.1 in late 2025 via their output_config parameter. The key difference from OpenAI is that Anthropic uses a grammar-based approach — the grammar applies only to Claude’s direct output, not to tool use calls or thinking tags [2].

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Extract the key entities from: " + text}],
    output_config={
        "format": "json",
        "schema": {
            "type": "object",
            "properties": {
                "entities": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "name": {"type": "string"},
                            "type": {"type": "string", "enum": ["person", "org", "location", "product"]},
                            "confidence": {"type": "number", "minimum": 0, "maximum": 1}
                        },
                        "required": ["name", "type", "confidence"]
                    }
                }
            },
            "required": ["entities"]
        }
    }
)

print(response.content[0].text)

Key points about Anthropic’s implementation:

  • The grammar resets between sections (thinking, tool calls, final response), allowing Claude to reason freely in extended thinking mode while still producing structured final output
  • To use with extended thinking, you configure the thinking parameter alongside output_config — the thinking block is unconstrained, only the final response is grammar-enforced
  • Citation support is incompatible with output_config (returns 400)
  • Prefix-filling (prefilling the assistant response) is incompatible with JSON outputs
  • Returns a 400 error if the schema is invalid or incompatible with model capabilities

Google Gemini: response_schema with JSON Schema

Gemini’s structured outputs support JSON Schema natively, and recent improvements (November 2025) added full JSON Schema compatibility including Pydantic and Zod integration [3].

from google import genai
from google.genai.types import GenerateContentConfig
from pydantic import BaseModel

client = genai.Client(api_key="YOUR_API_KEY")

class AnalysisResult(BaseModel):
    summary: str
    key_findings: list[str]
    risk_score: float
    recommended_actions: list[str]

response = client.models.generate_content(
    model="gemini-2.5-pro",
    contents=text,
    config=GenerateContentConfig(
        response_mime_type="application/json",
        response_schema=AnalysisResult,
    )
)

result = response.parsed
print(f"Risk score: {result.risk_score}")

Key points about Gemini’s implementation:

  • Set response_mime_type to "application/json" and pass your schema to response_schema
  • Accepts Pydantic BaseModel, dataclass, or plain dict/JSON Schema
  • JSON Schema support includes nested objects, arrays, enums, $ref, allOf, and oneOf
  • Works with both generate_content and stream_generate_content
  • Gemini 2.5 Pro, 2.5 Flash, and 2.0 Flash all support structured outputs
  • Batch mode (.jsonl files) also supports inline response schemas per-line

Approach 2: Library-Based Structured Outputs with Instructor

Instructor is a Python library that wraps any provider’s client and enforces Pydantic schema compliance with automatic retries, validation, and streaming support. It’s maintained by the Pydantic team and supports OpenAI, Anthropic, Gemini, Cohere, Mistral, and any OpenAI-compatible endpoint [4].

import instructor
from openai import OpenAI
from pydantic import BaseModel, Field

# Patch any OpenAI-compatible client
client = instructor.from_openai(OpenAI())

class MedicalRecord(BaseModel):
    patient_id: str
    diagnosis: str
    medications: list[str]
    follow_up_date: str | None = None
    severity: str = Field(description="one of: low, medium, high, critical")

# Just call with response_model — Instructor handles the rest
record = client.chat.completions.create(
    model="gpt-4o",
    response_model=MedicalRecord,
    messages=[
        {"role": "system", "content": "Extract structured medical record data."},
        {"role": "user", "content": raw_clinical_notes}
    ],
)

print(f"Diagnosis: {record.diagnosis}, Severity: {record.severity}")

Instructor automatically:

  1. Converts the Pydantic model into the right schema format for your provider
  2. Submits a validation call after generation (or uses streaming validation)
  3. Retries automatically on validation failure (configurable max retries)
  4. Handles nested models, Union types, and Optional fields
  5. Supports streaming mode for real-time partial parsing

For Anthropic specifically:

import instructor
from anthropic import Anthropic

client = instructor.from_anthropic(Anthropic())

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    response_model=MedicalRecord,
    messages=[{"role": "user", "content": raw_clinical_notes}],
)

The advantage of Instructor over native APIs is provider portability — the same Pydantic model and the same calling pattern works with any supported provider. If you need to switch from OpenAI to Anthropic, you change one import and one client initialization.


Approach 3: Grammar-Constrained Decoding with Outlines

Outlines takes a fundamentally different approach. Instead of wrapping a provider’s API, it constrains the token generation process directly using a grammar. This works at the token sampling level — the model’s probability distribution is masked so that tokens that would violate the schema have probability zero [5].

from outlines import models, generate
from pydantic import BaseModel

# Load any model via Transformers, vLLM, or Ollama
model = models.transformers("microsoft/Phi-3-medium-4k-instruct")

class CodeReview(BaseModel):
    file_path: str
    issues: list[dict]
    severity: str
    suggestion: str

# Constrain generation to match the Pydantic schema
generator = generate.json(model, CodeReview)

result = generator(
    "Review this Python file for memory leaks and threading bugs."
)

print(result.issues)

Outlines supports:

  • Local models via Transformers, vLLM, ExLlamaV2, and llama.cpp
  • Remote models via OpenAI, Anthropic, and any OpenAI-compatible endpoint
  • Multiple modes: JSON schema, CSV, regular expressions, and context-free grammars
  • Batch processing for N-samples-per-prompt

The killer use case for Outlines is local structured generation with quantized models. If you’re running a 7B or 13B model on-premises and need guaranteed JSON output, Outlines with vLLM’s XGrammar backend is the standard approach — delivering up to 3.5x faster JSON generation than alternative grammar engines [5].


Production Patterns

Pattern 1: Unified Router Across Providers

For production systems, you want a single interface that routes to the best available structured output method:

from enum import Enum
from typing import Protocol
from pydantic import BaseModel

class Provider(Enum):
    OPENAI = "openai"
    ANTHROPIC = "anthropic"
    GEMINI = "gemini"
    LOCAL_OUTLINES = "local"

class StructuredOutputProvider(Protocol):
    def extract(self, model: str, prompt: str, schema: type[BaseModel]) -> BaseModel:
        ...

class Router:
    def __init__(self):
        self.providers: dict[Provider, StructuredOutputProvider] = {}
    
    def route(
        self, 
        schema: type[BaseModel], 
        prompt: str, 
        prefer: Provider | None = None
    ) -> BaseModel:
        # Try preferred provider, fall back to alternatives
        providers_to_try = (
            [prefer] + [p for p in Provider if p != prefer]
            if prefer else list(Provider)
        )
        
        last_error = None
        for provider in providers_to_try:
            try:
                impl = self.providers.get(provider)
                if not impl:
                    continue
                return impl.extract("default", prompt, schema)
            except Exception as e:
                last_error = e
                continue
        
        raise RuntimeError(f"All providers failed: {last_error}")

Pattern 2: Validation Chain with Retries

Schema enforcement doesn’t mean 100% correctness — especially for complex nested schemas. Implement a validation chain:

from pydantic import ValidationError
from tenacity import retry, stop_after_attempt, wait_exponential

class Extractor:
    MAX_RETRIES = 3
    
    @retry(
        stop=stop_after_attempt(MAX_RETRIES),
        wait=wait_exponential(multiplier=1, min=1, max=10)
    )
    def extract_with_retry(
        self, 
        prompt: str, 
        model: type[BaseModel]
    ) -> BaseModel:
        raw = self._call_llm(prompt, model)
        # Validate the parsed output
        try:
            return model.model_validate(raw)
        except ValidationError as e:
            # Feed validation errors back into the retry prompt
            retry_prompt = (
                f"Previous output failed validation: {e.errors()}\n"
                f"Please regenerate with correct schema."
            )
            raw = self._call_llm(retry_prompt, model)
            return model.model_validate(raw)

Pattern 3: Cost and Latency Tracking

Track structured output costs per schema complexity:

import time
from dataclasses import dataclass, field

@dataclass
class StructuredOutputMetrics:
    schema_name: str
    provider: str
    model: str
    input_tokens: int = 0
    output_tokens: int = 0
    duration_ms: float = 0.0
    retries: int = 0
    cost_usd: float = 0.0

class MonitoredExtractor:
    def extract(self, schema: type[BaseModel], text: str) -> tuple[BaseModel, StructuredOutputMetrics]:
        start = time.time()
        metrics = StructuredOutputMetrics(
            schema_name=schema.__name__,
            provider="openai",
            model="gpt-4o",
        )
        
        result = self.client.beta.chat.completions.parse(
            model="gpt-4o",
            response_format=schema,
            messages=[{"role": "user", "content": text}],
        )
        
        duration = (time.time() - start) * 1000
        usage = result.usage
        metrics.duration_ms = duration
        metrics.input_tokens = usage.prompt_tokens
        metrics.output_tokens = usage.completion_tokens
        
        # GPT-4o: $2.50/M input, $10.00/M output (as of June 2026) [1]
        metrics.cost_usd = (
            usage.prompt_tokens / 1_000_000 * 2.50 +
            usage.completion_tokens / 1_000_000 * 10.00
        )
        
        return result.choices[0].message.parsed, metrics

Comparison: Which Approach When

SituationBest approachReason
Single provider, need speedNative APINo overhead, provider-optimized constraint
Multi-provider fallbackInstructorSame Pydantic model, same call pattern
Local/on-premise modelOutlines + XGrammarGrammar-constrained decoding works offline
Complex nested schemasInstructor + NativeInstructor validates recursively, native constraints top-level
Streaming structured outputInstructor or OutlinesBoth support incremental parsing
Schema changes frequentlyNative API (JSON Schema)No code change needed, just update schema
Cost-sensitive batch processingOutlines (local)Zero API cost after model download

Migration Path: From Prompt-Only JSON to Structured Outputs

If you’re currently using prompt-only JSON (asking the model nicely to return JSON), migrate in stages:

  1. Stage 1 — Add JSON mode (response_format={"type": "json_object"} or equivalent) for syntactic validation. This catches malformed JSON on the wire.
  2. Stage 2 — Add downstream schema validation (Pydantic model_validate) and log schema violations. Measure how often the model returns wrong field names or types.
  3. Stage 3 — Switch to structured outputs with full JSON Schema. The provider enforces the schema at generation time, eliminating most validation failures.
  4. Stage 4 — Add automatic retries with the validation error as context. Handle the remaining 0.5-1% of cases where the model refuses or truncates.

Each stage reduces the error rate by roughly an order of magnitude without requiring a full rewrite.


References

[1] OpenAI, “Structured Outputs API Guide,” developers.openai.com, 2026. https://developers.openai.com/api/docs/guides/structured-outputs

[2] Anthropic, “Structured Outputs — Claude API Docs,” docs.anthropic.com, 2026. https://docs.anthropic.com/en/docs/build-with-claude/structured-outputs

[3] Google, “Structured Outputs — Gemini API Docs,” ai.google.dev, 2026. https://ai.google.dev/gemini-api/docs/structured-output

[4] 567-Labs, “Instructor: Structured Outputs for LLMs,” GitHub, 2026. https://github.com/567-labs/instructor

[5] Dottxt, “Outlines: Structured Text Generation,” GitHub, 2026. https://github.com/dottxt-ai/outlines

← Back to all posts