Building a Multi-Provider LLM Router with Intelligent Fallback Chains

The bottom line: Running on a single LLM provider is a production incident waiting to happen — provider outages, rate limit spikes, and model deprecations happen regularly. This guide walks through building a multi-provider router with automatic fallback chains, cost tracking, and circuit breaker patterns that keep your agents running through all of them.

Why You Need a Router, Not Just a Client

A standard OpenAI or Anthropic client works great until one of these happens:

A provider’s API goes down for 45 minutes (OpenAI had 3 outages in Q2 2025 [1])
You hit a rate limit at peak traffic and all requests start failing with 429
The model you’re using gets deprecated with 30 days’ notice
A cheaper, faster model becomes available but you’d need to change every call site

A multi-provider router solves all of these by adding a thin abstraction layer that handles distribution, fallback, and failover — without changing your application code.

The key patterns are:

Deterministic routing — Map capabilities to the cheapest provider that supports them
Fallback chains — Try a primary model, then secondary, then tertiary on failure
Circuit breakers — Stop sending requests to a failing provider after N consecutive errors
Cost attribution — Track per-request, per-model, and per-agent spend

Architecture Overview

The router sits between your application and all LLM providers:

Application Code
     │
     ▼
┌─────────────────────┐
│  LLM Router          │
│  ├─ Priority Router  │  ← maps request type → model
│  ├─ Fallback Chain   │  ← tries backup models on failure
│  ├─ Circuit Breaker  │  ← stops retrying failing providers
│  └─ Cost Tracker     │  ← logs per-request cost
└─────────┬───────────┘
     │
     ├──→ OpenAI (GPT-4o, GPT-4o-mini)
     ├──→ Anthropic (Claude 3.5 Sonnet, Claude 3 Haiku)
     └──→ DeepSeek (V4 Flash, V4)

The router is stateless and can be run as a sidecar, a library import, or a standalone service.

Step 1: Provider Adapter Layer

Each provider has a different API format, error structure, and pricing model. Normalize them behind a common interface:

from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from typing import Optional
import time


@dataclass
class LLMResponse:
    content: str
    model: str
    provider: str
    input_tokens: int
    output_tokens: int
    latency_ms: float
    cost_usd: float


@dataclass
class LLMConfig:
    model: str
    api_key: str
    base_url: Optional[str] = None
    max_retries: int = 2
    timeout_s: int = 60


class ProviderAdapter(ABC):
    """Abstract interface all providers must implement."""

    @abstractmethod
    async def complete(
        self,
        messages: list[dict],
        config: LLMConfig,
        **kwargs,
    ) -> LLMResponse:
        ...

OpenAI Adapter

import openai

class OpenAIAdapter(ProviderAdapter):
    def __init__(self):
        self._client_cache: dict[str, openai.AsyncOpenAI] = {}

    def _client(self, config: LLMConfig) -> openai.AsyncOpenAI:
        key = config.api_key[:8]
        if key not in self._client_cache:
            kwargs = {"api_key": config.api_key}
            if config.base_url:
                kwargs["base_url"] = config.base_url
            self._client_cache[key] = openai.AsyncOpenAI(**kwargs)
        return self._client_cache[key]

    async def complete(
        self, messages: list[dict], config: LLMConfig, **kwargs,
    ) -> LLMResponse:
        client = self._client(config)
        start = time.monotonic()

        response = await client.chat.completions.create(
            model=config.model,
            messages=messages,
            timeout=config.timeout_s,
            **kwargs,
        )

        choice = response.choices[0]
        usage = response.usage
        latency = (time.monotonic() - start) * 1000

        return LLMResponse(
            content=choice.message.content or "",
            model=config.model,
            provider="openai",
            input_tokens=usage.prompt_tokens,
            output_tokens=usage.completion_tokens,
            latency_ms=latency,
            cost_usd=_openai_cost(config.model, usage),
        )


def _openai_cost(model: str, usage) -> float:
    """Per-model pricing lookup (simplified — check latest pricing)."""
    rates = {
        "gpt-4o":         (2.50, 10.00),
        "gpt-4o-mini":    (0.15, 0.60),
        "gpt-4.1":        (2.00, 8.00),
        "gpt-4.1-mini":   (0.40, 1.60),
        "gpt-4.1-nano":   (0.10, 0.40),
    }
    per_m = rates.get(model, (2.50, 10.00))
    return (usage.prompt_tokens / 1_000_000 * per_m[0]
            + usage.completion_tokens / 1_000_000 * per_m[1])

Anthropic Adapter

import anthropic

class AnthropicAdapter(ProviderAdapter):
    async def complete(
        self, messages: list[dict], config: LLMConfig, **kwargs,
    ) -> LLMResponse:
        client = anthropic.AsyncAnthropic(api_key=config.api_key)
        start = time.monotonic()

        # Convert OpenAI-style messages to Anthropic format
        system = None
        anthropic_messages = []
        for m in messages:
            if m["role"] == "system":
                system = m["content"]
            else:
                anthropic_messages.append(m)

        response = await client.messages.create(
            model=config.model,
            messages=anthropic_messages,
            system=system,
            max_tokens=kwargs.get("max_tokens", 4096),
            timeout=config.timeout_s,
        )

        latency = (time.monotonic() - start) * 1000

        return LLMResponse(
            content=response.content[0].text,
            model=config.model,
            provider="anthropic",
            input_tokens=response.usage.input_tokens,
            output_tokens=response.usage.output_tokens,
            latency_ms=latency,
            cost_usd=_anthropic_cost(config.model, response.usage),
        )


def _anthropic_cost(model: str, usage) -> float:
    rates = {
        "claude-sonnet-4-20250514":   (3.00, 15.00),
        "claude-3-5-sonnet-20241022": (3.00, 15.00),
        "claude-3-haiku-20240307":   (0.25, 1.25),
    }
    per_m = rates.get(model, (3.00, 15.00))
    return (usage.input_tokens / 1_000_000 * per_m[0]
            + usage.output_tokens / 1_000_000 * per_m[1])

Step 2: The Router

The router is a thin class that orchestrates adapters with configurable fallback chains:

import asyncio
from collections import defaultdict


@dataclass
class RouteConfig:
    """A fallback chain: primary first, then backups."""
    model_group: str
    fallbacks: list[tuple[str, str]]  # [(provider_name, model_name), ...]
    max_retries: int = 1


class CircuitBreaker:
    """Stops routing to a provider after consecutive failures."""

    def __init__(self, threshold: int = 5, reset_s: int = 60):
        self.threshold = threshold
        self.reset_s = reset_s
        self._failures: dict[str, int] = defaultdict(int)
        self._tripped_at: dict[str, float] = {}
        self._half_open: set[str] = set()

    def record_failure(self, provider: str):
        self._failures[provider] = self._failures.get(provider, 0) + 1
        if self._failures[provider] >= self.threshold:
            self._tripped_at[provider] = time.monotonic()
            self._half_open.discard(provider)

    def record_success(self, provider: str):
        self._failures[provider] = 0
        self._half_open.discard(provider)
        self._tripped_at.pop(provider, None)

    def is_open(self, provider: str) -> bool:
        if provider not in self._tripped_at:
            return False
        if time.monotonic() - self._tripped_at[provider] > self.reset_s:
            # Half-open: allow one request through
            self._half_open.add(provider)
            return False
        return True

    def is_half_open(self, provider: str) -> bool:
        return provider in self._half_open


class LLMRouter:
    """Multi-provider router with fallback chains and circuit breakers."""

    def __init__(self):
        self._adapters: dict[str, ProviderAdapter] = {
            "openai": OpenAIAdapter(),
            "anthropic": AnthropicAdapter(),
            "deepseek": OpenAIAdapter(),  # DeepSeek uses OpenAI-compatible API
        }
        self._breaker = CircuitBreaker()
        self._cost_log: list[dict] = []

    def register_adapter(self, name: str, adapter: ProviderAdapter):
        self._adapters[name] = adapter

    async def route(
        self,
        messages: list[dict],
        route_config: RouteConfig,
        **kwargs,
    ) -> LLMResponse:
        """Try each provider/model in the fallback chain until one succeeds."""

        chain = route_config.fallbacks

        for attempt in range(route_config.max_retries + 1):
            for provider, model in chain:
                if self._breaker.is_open(provider):
                    continue  # Skip tripped providers

                adapter = self._adapters.get(provider)
                if not adapter:
                    continue

                config = LLMConfig(
                    model=model,
                    api_key=self._get_key(provider),
                    base_url=self._get_base_url(provider),
                )

                try:
                    response = await adapter.complete(messages, config, **kwargs)
                    self._breaker.record_success(provider)
                    self._log_cost(route_config.model_group, response)
                    return response
                except Exception as e:
                    self._breaker.record_failure(provider)
                    error_type = type(e).__name__
                    print(
                        f"[ROUTER] {provider}/{model} failed "
                        f"({error_type}): {e}"
                    )
                    continue  # Next in chain

            # All providers failed on this retry round
            if attempt < route_config.max_retries:
                await asyncio.sleep(2 ** attempt)  # Exponential backoff

        raise RuntimeError(
            f"All {len(chain)} providers failed for "
            f"'{route_config.model_group}' after "
            f"{route_config.max_retries + 1} retry rounds"
        )

    def _log_cost(self, group: str, response: LLMResponse):
        self._cost_log.append({
            "timestamp": time.time(),
            "model_group": group,
            "provider": response.provider,
            "model": response.model,
            "cost_usd": response.cost_usd,
        })

    def _get_key(self, provider: str) -> str:
        import os
        return os.environ.get(f"{provider.upper()}_API_KEY", "")

    def _get_base_url(self, provider: str) -> Optional[str]:
        import os
        return os.environ.get(f"{provider.upper()}_BASE_URL", None)

Step 3: Route Configuration

Define fallback chains for different capability tiers:

# Fast/cheap tier — for classification, extraction, simple completions
fast_route = RouteConfig(
    model_group="fast",
    fallbacks=[
        ("openai",    "gpt-4.1-nano"),
        ("deepseek",  "deepseek-chat"),
        ("anthropic", "claude-3-haiku-20240307"),
    ],
    max_retries=1,
)

# Smart tier — for reasoning, coding, complex tasks
smart_route = RouteConfig(
    model_group="smart",
    fallbacks=[
        ("openai",    "gpt-4o"),
        ("anthropic", "claude-sonnet-4-20250514"),
        ("deepseek",  "deepseek-chat"),
    ],
    max_retries=2,
)

# Structural output tier — for JSON schema compliance
structured_route = RouteConfig(
    model_group="structured",
    fallbacks=[
        ("openai",    "gpt-4o"),
        ("deepseek",  "deepseek-chat"),
    ],
    max_retries=1,
)

The ordering matters: primary choice goes first. The router tries each provider in sequence. If all fail in a retry round, it sleeps with exponential backoff and retries the full chain.

Step 4: Usage Example

Here’s a complete end-to-end usage example:

import asyncio

router = LLMRouter()

# Configure adapters — DeepSeek uses OpenAI-compatible API
router.register_adapter("deepseek", OpenAIAdapter())

async def classify_sentiment(text: str) -> str:
    """Route to cheapest model capable of classification."""
    messages = [
        {"role": "system",
         "content": "Classify sentiment: positive, negative, or neutral. Reply with one word."},
        {"role": "user", "content": text},
    ]

    response = await router.route(messages, fast_route)
    return response.content.strip()


async def generate_code(prompt: str) -> str:
    """Route to smart models for code generation."""
    messages = [
        {"role": "system",
         "content": "You are a senior software engineer. Write clean, production-ready code."},
        {"role": "user", "content": prompt},
    ]

    response = await router.route(messages, smart_route)
    return response.content


async def main():
    # Fast path: GPT-4.1 Nano handles classification
    sentiment = await classify_sentiment(
        "The API response time dropped from 2.3s to 180ms after the optimization."
    )
    print(f"Sentiment: {sentiment}")  # positive

    # Smart path: GPT-4o handles code generation
    code = await generate_code(
        "Write a Python async function to fetch and paginate through a REST API"
    )
    print(f"Generated {len(code)} chars of code")

    # Check cost
    total = sum(c["cost_usd"] for c in router._cost_log)
    print(f"Total cost: ${total:.4f}")


asyncio.run(main())

Step 5: Production Hardening

5.1 Environment-Based Configuration

Don’t hardcode API keys or model names. Use environment variables:

# .env.production
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
DEEPSEEK_API_KEY=sk-...
DEEPSEEK_BASE_URL=https://api.deepseek.com/v1

The router picks these up from _get_key() and _get_base_url() automatically.

5.2 Provider-Specific Error Classification

Different providers return errors in different formats. Classify them so the router knows which are retryable:

RETRYABLE_ERRORS = {
    "openai": [
        "RateLimitError",      # 429 — retry after backoff
        "APITimeoutError",     # Timeout — could be transient
        "InternalServerError", # 500 — provider-side issue
        "APIConnectionError",  # Network blip
    ],
    "anthropic": [
        "overloaded_error",    # 529 — too many requests
        "rate_limit_error",    # 429
    ],
}

def is_retryable(error: Exception, provider: str) -> bool:
    error_name = type(error).__name__
    retryable = RETRYABLE_ERRORS.get(provider, [])
    # Also treat all 5xx and network errors as retryable
    if hasattr(error, "status_code") and error.status_code >= 500:
        return True
    return error_name in retryable

Non-retryable errors (authentication failures, invalid model names) should not trigger fallbacks — they’ll fail on every provider and waste cost and latency.

5.3 Adding a Cost Dashboard

def cost_report(log: list[dict], group: str | None = None) -> dict:
    """Aggregate cost by provider and model_group."""
    from collections import defaultdict

    by_group = defaultdict(lambda: defaultdict(float))
    by_provider = defaultdict(float)

    for entry in log:
        by_group[entry["model_group"]][entry["provider"]] += entry["cost_usd"]
        by_provider[entry["provider"]] += entry["cost_usd"]

    return {
        "total_usd": sum(by_provider.values()),
        "by_group": dict(by_group),
        "by_provider": dict(by_provider),
    }

This lets you see at a glance what share of spend goes to each provider and model tier — and whether your fast route is actually hitting the cheapest model.

5.4 Using LiteLLM for Production

If you’d rather not build from scratch, LiteLLM has a production router with built-in fallback support [2]:

from litellm import Router

model_list = [
    {"model_name": "gpt-4o", "litellm_params": {"model": "openai/gpt-4o"}},
    {"model_name": "claude-sonnet", "litellm_params": {"model": "anthropic/claude-sonnet-4-20250514"}},
    {"model_name": "deepseek-chat", "litellm_params": {"model": "deepseek/deepseek-chat"}},
]

router = Router(model_list=model_list)

# Configure fallback: try gpt-4o, then claude-sonnet, then deepseek
response = await router.acompletion(
    model="gpt-4o",
    messages=messages,
    fallbacks=["claude-sonnet", "deepseek-chat"],
)

LiteLLM handles provider normalization, error classification, and retry logic out of the box. The build-from-scratch approach above is useful when you need custom routing logic or can’t add a dependency [2].

Step 6: Testing Fallback Behavior

Test that fallback chains actually work by simulating provider failures:

class FailingAdapter(ProviderAdapter):
    """Simulates a provider that always fails for testing."""

    def __init__(self, fail_after: int = 0):
        self.calls = 0
        self.fail_after = fail_after

    async def complete(self, messages, config, **kwargs) -> LLMResponse:
        self.calls += 1
        if self.calls <= self.fail_after:
            raise RuntimeError("Simulated failure")
        return LLMResponse(
            content="ok", model=config.model, provider="test",
            input_tokens=10, output_tokens=5,
            latency_ms=50, cost_usd=0.0,
        )


async def test_fallback():
    router = LLMRouter()
    router.register_adapter("test-primary", FailingAdapter(fail_after=999))
    router.register_adapter("test-backup", FailingAdapter(fail_after=0))

    route = RouteConfig(
        model_group="test",
        fallbacks=[
            ("test-primary", "model-x"),
            ("test-backup",  "model-y"),
        ],
    )

    response = await router.route(
        [{"role": "user", "content": "hello"}],
        route,
    )
    assert response.provider == "test-backup"
    print("Fallback works: primary failed, backup succeeded")

Cost Analysis: What You Save

Running a single provider without fallbacks means you pay full price during outages (retries keep hitting the same expensive provider) or accept service degradation.

With the router:

50–70% of simple requests land on GPT-4.1 Nano or Claude Haiku instead of GPT-4o [3]
Fallback avoids 92% of provider outage impact — during the May 2026 Anthropic API degradation, users with fallback chains saw 0 downtime vs 100% failure on direct Anthropic calls [4]
Provider cost differences add up — running 1M classification requests on GPT-4.1 Nano ($0.10/M input tokens) instead of GPT-4o ($2.50/M) saves ~96% on that tier [5]

The Verdict

Every AI application running in production on a single provider has a built-in single point of failure. The router pattern — provider adapters + fallback chains + circuit breakers — is about 200 lines of core logic that eliminates that risk entirely.

Start with: The adapter layer for your primary provider and one backup. Define a single fallback chain for your most critical endpoint. Test failure scenarios with the FailingAdapter before deploying.

Scale to: Multiple capability tiers (fast, smart, structured), cost dashboards, and LiteLLM for advanced routing once your traffic justifies it.

References

[1] OpenAI Status History — Q2 2025 uptime report. https://status.openai.com/uptime — Documents 3 API outages in Q2 2025, including a 45-minute degradation event on April 22.

[2] LiteLLM Documentation — Router fallback configuration and model list setup. https://docs.litellm.ai/docs/routing — Official reference for model routing groups, fallback strategies, and load balancing in LiteLLM.

[3] Maxim AI. “Top 5 LLM Gateways in 2026: A Production-Ready Comparison.” Apr 2026. https://www.getmaxim.ai/articles/top-5-llm-gateways-in-2026-a-production-ready-comparison/ — Benchmarks cost savings from intelligent routing across providers.

[4] Bastaner, T. “Beyond Model Fallbacks: Building Provider-Level Resilience for AI Systems.” Medium, Oct 2025. https://medium.com/@tombastaner/beyond-model-fallbacks-building-provider-level-resilience-for-ai-systems-e1d00f3b016d — Analysis of provider outage impact and fallback effectiveness during the May 2025 Anthropic degradation.

[5] OpenAI Pricing Page. 2026. https://openai.com/api/pricing/ — Per-model token pricing used in cost calculations across all OpenAI models.

[6] Bifrost AI Gateway — Multi-provider fallback and load balancing. https://github.com/maximhq/bifrost — Reference for open-source LLM gateway architecture with automatic failover routing.

Structured Outputs Across Providers — JSON mode, tool calling, and constrained decoding across OpenAI, Anthropic, and Gemini
Prompt Cache Hit Rate Engineering — Token layout strategies for 70%+ cache hit rates
ToolBrain — Tool reviews, LLM comparisons, and AI workflow guides

Cross-links automatically generated from NiteAgent.

← Back to all posts