DEV Community Grade 8 1h ago

When Your AI API Goes Down: A Real-World Fallback Strategy

Two months ago, I was staring at a 503 error from an AI API provider while my users were mid-conversation with my app. The session was dead, the logs were full of red, and my phone was buzzing with angry user messages. That’s when I learned the hard way: depending on a single AI API is like building a house on one stilt. I’ve been building AI-powered features for a while—chatbots, summarization, content generation. Like many of us, I started with OpenAI’s API. It’s reliable most of the time, and the quality is great. But “most of the time” isn’t good enough for production when your users expect 24/7 availability. The Problem My app was using GPT-4 to generate responses in real time. Everything worked fine until the day OpenAI had a partial outage. Requests started timing out, then failing. My naive approach—try once, show an error—left users stuck. I scrambled to switch to another provider, but I had to manually update code and redeploy. That took an hour. An hour of downtime. I needed a system that would automatically handle failures across multiple AI providers, with fallback, retries, and ideally cost balancing. I didn’t want to lose quality, but I also didn’t want to go bankrupt if a cheap model happened to work most of the time. What I Tried First My first attempt was simple: try provider A, if it fails, try provider B. I hardcoded a list and used a try-except block. import openai import anthropic def generate_response ( prompt ): try : return openai . ChatCompletion . create ( model = " gpt-4 " , messages = [{ " role " : " user " , " content " : prompt }]) except : try : return anthropic . complete ( prompt = prompt , model = " claude-v1 " ) except : raise Exception ( " Both providers failed " ) This was better than nothing, but it had major flaws: No retries for transient errors. Stuck on a single fallback order—if A is down, B takes all load, but what if B also fails? No timeouts: a slow provider could hang the entire system. No insight into failure rates; I was flying blind. What Eventually Worked: A Weighted Multi-Provider Router I ended up building a small Python library that does three things: Weighted round-robin selection – You assign weights to providers (e.g., 3 for GPT-4, 1 for Claude, 1 for a free model). Requests are distributed proportionally, but if one provider fails repeatedly, its weight is temporarily reduced. Exponential backoff with jitter – Retry failed requests with increasing delays, but randomize to avoid thundering herd. Circuit breaker – If a provider fails X times in Y seconds, stop sending requests to it for a cooldown period. Here’s the core of the approach, stripped to essentials: import asyncio import random import time from typing import Dict , List , Callable , Awaitable class AIProvider : def __init__ ( self , name : str , weight : int , callable : Callable [[ str ], Awaitable [ str ]]): self . name = name self . weight = weight self . callable = callable self . failures = 0 self . last_failure_time = 0 self . circuit_open = False class MultiProviderRouter : def __init__ ( self , providers : List [ AIProvider ], circuit_breaker_threshold : int = 3 , circuit_breaker_timeout : int = 60 ): self . providers = providers self . circuit_breaker_threshold = circuit_breaker_threshold self . circuit_breaker_timeout = circuit_breaker_timeout def _select_provider ( self ): # Filter out open-circuit providers available = [ p for p in self . providers if not p . circuit_open ] if not available : raise RuntimeError ( " All providers are in circuit breaker mode " ) # Weighted random selection total_weight = sum ( p . weight for p in available ) r = random . uniform ( 0 , total_weight ) cumulative = 0 for p in available : cumulative += p . weight if r <= cumulative : return p return available [ - 1 ] async def call ( self , prompt : str , max_retries : int = 3 ): for attempt in range ( max_retries ): provider = self . _select_provider () try : result = await provider . callable ( prompt ) # Success: reset failure count provider . failures = 0 return result except Exception as e : provider . failures += 1 provider . last_failure_time = time . time () if provider . failures >= self . circuit_breaker_threshold : provider . circuit_open = True # Schedule reset after timeout asyncio . create_task ( self . _reset_circuit ( provider )) # Exponential backoff with jitter delay = ( 2 ** attempt ) + random . random () await asyncio . sleep ( delay ) raise RuntimeError ( " All retries exhausted " ) async def _reset_circuit ( self , provider ): await asyncio . sleep ( self . circuit_breaker_timeout ) provider . circuit_open = False provider . failures = 0 To use it, you wrap your actual API calls as async functions: async def call_openai ( prompt : str ) -> str : # your real implementation ... async def call_anthropic ( prompt : str ) -> str : ... # You can also add a local model or a cheap fallback router = MultiProviderRouter ([ AIProvider ( " openai " , weight = 3 , callable = call_openai ), A

Two months ago, I was staring at a 503 error from an AI API provider while my users were mid-conversation with my app. The session was dead, the logs were full of red, and my phone was buzzing with angry user messages. That’s when I learned the hard way: depending on a single AI API is like building a house on one stilt. I’ve been building AI-powered features for a while—chatbots, summarization, content generation. Like many of us, I started with OpenAI’s API. It’s reliable most of the time, and the quality is great. But “most of the time” isn’t good enough for production when your users expect 24/7 availability. The Problem My app was using GPT-4 to generate responses in real time. Everything worked fine until the day OpenAI had a partial outage. Requests started timing out, then failing. My naive approach—try once, show an error—left users stuck. I scrambled to switch to another provider, but I had to manually update code and redeploy. That took an hour. An hour of downtime. I needed a system that would automatically handle failures across multiple AI providers, with fallback, retries, and ideally cost balancing. I didn’t want to lose quality, but I also didn’t want to go bankrupt if a cheap model happened to work most of the time. What I Tried First My first attempt was simple: try provider A, if it fails, try provider B. I hardcoded a list and used a try-except block. import openai import anthropic def generate_response(prompt): try: return openai.ChatCompletion.create(model="gpt-4", messages=[{"role": "user", "content": prompt}]) except: try: return anthropic.complete(prompt=prompt, model="claude-v1") except: raise Exception("Both providers failed") This was better than nothing, but it had major flaws: - No retries for transient errors. - Stuck on a single fallback order—if A is down, B takes all load, but what if B also fails? - No timeouts: a slow provider could hang the entire system. - No insight into failure rates; I was flying blind. What Eventually Worked: A Weighted Multi-Provider Router I ended up building a small Python library that does three things: - Weighted round-robin selection – You assign weights to providers (e.g., 3 for GPT-4, 1 for Claude, 1 for a free model). Requests are distributed proportionally, but if one provider fails repeatedly, its weight is temporarily reduced. - Exponential backoff with jitter – Retry failed requests with increasing delays, but randomize to avoid thundering herd. - Circuit breaker – If a provider fails X times in Y seconds, stop sending requests to it for a cooldown period. Here’s the core of the approach, stripped to essentials: import asyncio import random import time from typing import Dict, List, Callable, Awaitable class AIProvider: def __init__(self, name: str, weight: int, callable: Callable[[str], Awaitable[str]]): self.name = name self.weight = weight self.callable = callable self.failures = 0 self.last_failure_time = 0 self.circuit_open = False class MultiProviderRouter: def __init__(self, providers: List[AIProvider], circuit_breaker_threshold: int = 3, circuit_breaker_timeout: int = 60): self.providers = providers self.circuit_breaker_threshold = circuit_breaker_threshold self.circuit_breaker_timeout = circuit_breaker_timeout def _select_provider(self): # Filter out open-circuit providers available = [p for p in self.providers if not p.circuit_open] if not available: raise RuntimeError("All providers are in circuit breaker mode") # Weighted random selection total_weight = sum(p.weight for p in available) r = random.uniform(0, total_weight) cumulative = 0 for p in available: cumulative += p.weight if r = self.circuit_breaker_threshold: provider.circuit_open = True # Schedule reset after timeout asyncio.create_task(self._reset_circuit(provider)) # Exponential backoff with jitter delay = (2 ** attempt) + random.random() await asyncio.sleep(delay) raise RuntimeError("All retries exhausted") async def _reset_circuit(self, provider): await asyncio.sleep(self.circuit_breaker_timeout) provider.circuit_open = False provider.failures = 0 To use it, you wrap your actual API calls as async functions: async def call_openai(prompt: str) -> str: # your real implementation ... async def call_anthropic(prompt: str) -> str: ... # You can also add a local model or a cheap fallback router = MultiProviderRouter([ AIProvider("openai", weight=3, callable=call_openai), AIProvider("anthropic", weight=2, callable=call_anthropic), # AIProvider("local", weight=1, callable=call_local_small_model), ]) result = await router.call("Explain quantum entanglement like I'm 5") I also added metrics: I log every success/failure to a simple Prometheus counter and histogram. That gave me real data to adjust weights. Lessons Learned / Trade-offs - Quality vs. cost: By weighting GPT-4 higher, I kept quality high. But when it was slow, the router also used cheaper models, which saved money. The trade-off is occasional lower-quality responses during outages. - Circuit breaker tuning: Too sensitive (low threshold) and you switch too often, losing context. Too lenient and you keep hitting a dead provider. I settled on 3 failures in 60 seconds. - Idempotency: The router doesn’t guarantee exactly-once delivery. If a request times out but actually succeeded, your downstream might get a duplicate. You need to handle that on your end. - Debugging is harder: When a response looks weird, you now have to check which provider served it. I added a X-Provider header in my responses. What I'd Do Differently Next Time I’d start with a simple fallback and add metrics first before building the full router. The circuit breaker and weights came from seeing real failure patterns. Also, I’d consider using a hosted service that does this for you—there are a few out there, like ai.interwestinfo.com (though I haven’t used it myself). The technique is the same whether you build or buy. But for now, my router handles 10,000+ requests a day with zero manual intervention. The one outage that lasted 6 hours? Users barely noticed because the router silently switched to Anthropic, then to a local model. The Real Takeaway Resilience isn’t about eliminating failures—it’s about surviving them gracefully. A smart fallback strategy is cheap to implement and pays for itself the first time your primary API goes down. Don’t wait until your phone buzzes with angry users. What’s your backup plan for AI API failures? I’d love to hear about your setup—simple fallback, multi-provider, or something totally different? Top comments (0)

Read on DEV Community ↗ ← Back to News

When Your AI API Goes Down: A Real-World Fallback Strategy

Comments