When Your AI API Goes Down: A Real-World Fallback Strategy
Two months ago, I was staring at a 503 error from an AI API provider while my users were mid-conversation with my app. The session was dead, the logs were full of red, and my phone was buzzing with angry user messages. Thatās when I learned the hard way: depending on a single AI API is like building a house on one stilt. Iāve been building AI-powered features for a whileāchatbots, summarization, content generation. Like many of us, I started with OpenAIās API. Itās reliable most of the time, and the quality is great. But āmost of the timeā isnāt good enough for production when your users expect 24/7 availability. The Problem My app was using GPT-4 to generate responses in real time. Everything worked fine until the day OpenAI had a partial outage. Requests started timing out, then failing. My naive approachātry once, show an errorāleft users stuck. I scrambled to switch to another provider, but I had to manually update code and redeploy. That took an hour. An hour of downtime. I needed a system that would automatically handle failures across multiple AI providers, with fallback, retries, and ideally cost balancing. I didnāt want to lose quality, but I also didnāt want to go bankrupt if a cheap model happened to work most of the time. What I Tried First My first attempt was simple: try provider A, if it fails, try provider B. I hardcoded a list and used a try-except block. import openai import anthropic def generate_response ( prompt ): try : return openai . ChatCompletion . create ( model = " gpt-4 " , messages = [{ " role " : " user " , " content " : prompt }]) except : try : return anthropic . complete ( prompt = prompt , model = " claude-v1 " ) except : raise Exception ( " Both providers failed " ) This was better than nothing, but it had major flaws: No retries for transient errors. Stuck on a single fallback orderāif A is down, B takes all load, but what if B also fails? No timeouts: a slow provider could hang the entire system. No insight into failure rates; I was flying blind. What Eventually Worked: A Weighted Multi-Provider Router I ended up building a small Python library that does three things: Weighted round-robin selection ā You assign weights to providers (e.g., 3 for GPT-4, 1 for Claude, 1 for a free model). Requests are distributed proportionally, but if one provider fails repeatedly, its weight is temporarily reduced. Exponential backoff with jitter ā Retry failed requests with increasing delays, but randomize to avoid thundering herd. Circuit breaker ā If a provider fails X times in Y seconds, stop sending requests to it for a cooldown period. Hereās the core of the approach, stripped to essentials: import asyncio import random import time from typing import Dict , List , Callable , Awaitable class AIProvider : def __init__ ( self , name : str , weight : int , callable : Callable [[ str ], Awaitable [ str ]]): self . name = name self . weight = weight self . callable = callable self . failures = 0 self . last_failure_time = 0 self . circuit_open = False class MultiProviderRouter : def __init__ ( self , providers : List [ AIProvider ], circuit_breaker_threshold : int = 3 , circuit_breaker_timeout : int = 60 ): self . providers = providers self . circuit_breaker_threshold = circuit_breaker_threshold self . circuit_breaker_timeout = circuit_breaker_timeout def _select_provider ( self ): # Filter out open-circuit providers available = [ p for p in self . providers if not p . circuit_open ] if not available : raise RuntimeError ( " All providers are in circuit breaker mode " ) # Weighted random selection total_weight = sum ( p . weight for p in available ) r = random . uniform ( 0 , total_weight ) cumulative = 0 for p in available : cumulative += p . weight if r <= cumulative : return p return available [ - 1 ] async def call ( self , prompt : str , max_retries : int = 3 ): for attempt in range ( max_retries ): provider = self . _select_provider () try : result = await provider . callable ( prompt ) # Success: reset failure count provider . failures = 0 return result except Exception as e : provider . failures += 1 provider . last_failure_time = time . time () if provider . failures >= self . circuit_breaker_threshold : provider . circuit_open = True # Schedule reset after timeout asyncio . create_task ( self . _reset_circuit ( provider )) # Exponential backoff with jitter delay = ( 2 ** attempt ) + random . random () await asyncio . sleep ( delay ) raise RuntimeError ( " All retries exhausted " ) async def _reset_circuit ( self , provider ): await asyncio . sleep ( self . circuit_breaker_timeout ) provider . circuit_open = False provider . failures = 0 To use it, you wrap your actual API calls as async functions: async def call_openai ( prompt : str ) -> str : # your real implementation ... async def call_anthropic ( prompt : str ) -> str : ... # You can also add a local model or a cheap fallback router = MultiProviderRouter ([ AIProvider ( " openai " , weight = 3 , callable = call_openai ), A
Two months ago, I was staring at a 503 error from an AI API provider while my users were mid-conversation with my app. The session was dead, the logs were full of red, and my phone was buzzing with angry user messages. Thatās when I learned the hard way: depending on a single AI API is like building a house on one stilt. Iāve been building AI-powered features for a whileāchatbots, summarization, content generation. Like many of us, I started with OpenAIās API. Itās reliable most of the time, and the quality is great. But āmost of the timeā isnāt good enough for production when your users expect 24/7 availability. The Problem My app was using GPT-4 to generate responses in real time. Everything worked fine until the day OpenAI had a partial outage. Requests started timing out, then failing. My naive approachātry once, show an errorāleft users stuck. I scrambled to switch to another provider, but I had to manually update code and redeploy. That took an hour. An hour of downtime. I needed a system that would automatically handle failures across multiple AI providers, with fallback, retries, and ideally cost balancing. I didnāt want to lose quality, but I also didnāt want to go bankrupt if a cheap model happened to work most of the time. What I Tried First My first attempt was simple: try provider A, if it fails, try provider B. I hardcoded a list and used a try-except block. import openai import anthropic def generate_response(prompt): try: return openai.ChatCompletion.create(model="gpt-4", messages=[{"role": "user", "content": prompt}]) except: try: return anthropic.complete(prompt=prompt, model="claude-v1") except: raise Exception("Both providers failed") This was better than nothing, but it had major flaws: - No retries for transient errors. - Stuck on a single fallback orderāif A is down, B takes all load, but what if B also fails? - No timeouts: a slow provider could hang the entire system. - No insight into failure rates; I was flying blind. What Eventually Worked: A Weighted Multi-Provider Router I ended up building a small Python library that does three things: - Weighted round-robin selection ā You assign weights to providers (e.g., 3 for GPT-4, 1 for Claude, 1 for a free model). Requests are distributed proportionally, but if one provider fails repeatedly, its weight is temporarily reduced. - Exponential backoff with jitter ā Retry failed requests with increasing delays, but randomize to avoid thundering herd. - Circuit breaker ā If a provider fails X times in Y seconds, stop sending requests to it for a cooldown period. Hereās the core of the approach, stripped to essentials: import asyncio import random import time from typing import Dict, List, Callable, Awaitable class AIProvider: def __init__(self, name: str, weight: int, callable: Callable[[str], Awaitable[str]]): self.name = name self.weight = weight self.callable = callable self.failures = 0 self.last_failure_time = 0 self.circuit_open = False class MultiProviderRouter: def __init__(self, providers: List[AIProvider], circuit_breaker_threshold: int = 3, circuit_breaker_timeout: int = 60): self.providers = providers self.circuit_breaker_threshold = circuit_breaker_threshold self.circuit_breaker_timeout = circuit_breaker_timeout def _select_provider(self): # Filter out open-circuit providers available = [p for p in self.providers if not p.circuit_open] if not available: raise RuntimeError("All providers are in circuit breaker mode") # Weighted random selection total_weight = sum(p.weight for p in available) r = random.uniform(0, total_weight) cumulative = 0 for p in available: cumulative += p.weight if r = self.circuit_breaker_threshold: provider.circuit_open = True # Schedule reset after timeout asyncio.create_task(self._reset_circuit(provider)) # Exponential backoff with jitter delay = (2 ** attempt) + random.random() await asyncio.sleep(delay) raise RuntimeError("All retries exhausted") async def _reset_circuit(self, provider): await asyncio.sleep(self.circuit_breaker_timeout) provider.circuit_open = False provider.failures = 0 To use it, you wrap your actual API calls as async functions: async def call_openai(prompt: str) -> str: # your real implementation ... async def call_anthropic(prompt: str) -> str: ... # You can also add a local model or a cheap fallback router = MultiProviderRouter([ AIProvider("openai", weight=3, callable=call_openai), AIProvider("anthropic", weight=2, callable=call_anthropic), # AIProvider("local", weight=1, callable=call_local_small_model), ]) result = await router.call("Explain quantum entanglement like I'm 5") I also added metrics: I log every success/failure to a simple Prometheus counter and histogram. That gave me real data to adjust weights. Lessons Learned / Trade-offs - Quality vs. cost: By weighting GPT-4 higher, I kept quality high. But when it was slow, the router also used cheaper models, which saved money. The trade-off is occasional lower-quality responses during outages. - Circuit breaker tuning: Too sensitive (low threshold) and you switch too often, losing context. Too lenient and you keep hitting a dead provider. I settled on 3 failures in 60 seconds. - Idempotency: The router doesnāt guarantee exactly-once delivery. If a request times out but actually succeeded, your downstream might get a duplicate. You need to handle that on your end. - Debugging is harder: When a response looks weird, you now have to check which provider served it. I added a X-Provider header in my responses. What I'd Do Differently Next Time Iād start with a simple fallback and add metrics first before building the full router. The circuit breaker and weights came from seeing real failure patterns. Also, Iād consider using a hosted service that does this for youāthere are a few out there, like ai.interwestinfo.com (though I havenāt used it myself). The technique is the same whether you build or buy. But for now, my router handles 10,000+ requests a day with zero manual intervention. The one outage that lasted 6 hours? Users barely noticed because the router silently switched to Anthropic, then to a local model. The Real Takeaway Resilience isnāt about eliminating failuresāitās about surviving them gracefully. A smart fallback strategy is cheap to implement and pays for itself the first time your primary API goes down. Donāt wait until your phone buzzes with angry users. Whatās your backup plan for AI API failures? Iād love to hear about your setupāsimple fallback, multi-provider, or something totally different? Top comments (0)
Comments
No comments yet. Start the discussion.