Why Your Microservices Need Circuit Breakers (And How to Add Them)
DEV Community

Why Your Microservices Need Circuit Breakers (And How to Add Them)

The Cascading Failure That Took Down Everything

Our payment service went down for 3 minutes. No big deal, right? Except every service that called payments kept retrying. The retry storms consumed all available connections. Within 10 minutes, all 12 services were down. 3 minutes of one service failing became 45 minutes of total outage. Circuit breakers prevent this.

How Circuit Breakers Work

State Machine:

CLOSED โ”€โ”€(failures exceed threshold)โ”€โ”€โ†’ OPEN
   โ†‘                                       โ”‚
   โ”‚                                       โ”‚
   โ””โ”€โ”€(success)โ”€โ”€โ† HALF-OPEN โ†โ”€โ”€(timeout)โ”€โ”€โ”˜
  • CLOSED: Normal operation. Requests pass through. Track failure rate.
  • OPEN: Requests fail immediately (fast failure). No traffic to the struggling service. Wait for timeout period.
  • HALF-OPEN: Allow one test request through. If it succeeds โ†’ CLOSED. If it fails โ†’ OPEN.

Implementation in Python

import time
from enum import Enum
from threading import Lock

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=30, success_threshold=3):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.success_threshold = success_threshold
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = None
        self.lock = Lock()

    def call(self, func, *args, **kwargs):
        with self.lock:
            if self.state == CircuitState.OPEN:
                if time.time() - self.last_failure_time > self.recovery_timeout:
                    self.state = CircuitState.HALF_OPEN
                    self.success_count = 0
                else:
                    raise CircuitBreakerOpenError(
                        f"Circuit breaker is OPEN. Retry after "
                        f"{self.recovery_timeout}s"
                    )
        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

    def _on_success(self):
        with self.lock:
            if self.state == CircuitState.HALF_OPEN:
                self.success_count += 1
                if self.success_count >= self.success_threshold:
                    self.state = CircuitState.CLOSED
                    self.failure_count = 0
            self.failure_count = 0

    def _on_failure(self):
        with self.lock:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = CircuitState.OPEN
            if self.state == CircuitState.HALF_OPEN:
                self.state = CircuitState.OPEN

# Usage
payment_breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=30)

def process_payment(order):
    try:
        return payment_breaker.call(payment_service.charge, order)
    except CircuitBreakerOpenError:
        return queue_for_retry(order)  # Graceful degradation

What to Do When the Circuit Opens

The circuit breaker buys you time. Use it wisely:

def handle_open_circuit(service_name, request):
    strategies = {
        'payment': lambda r: queue_for_retry(r),           # Retry later
        'recommendations': lambda r: return_cached(r),     # Serve stale data
        'analytics': lambda r: drop_silently(r),           # Non-critical, skip
        'auth': lambda r: allow_with_cached_token(r),      # Cached auth
        'search': lambda r: return_popular_results(r),     # Fallback results
    }
    return strategies.get(service_name, lambda r: return_error(r))(request)

Monitoring Circuit Breakers

circuit_breaker_metrics:
  - name: circuit_breaker_state
    type: gauge
    labels: [service, target]  # 0=closed, 1=open, 2=half_open
  - name: circuit_breaker_failures_total
    type: counter
    labels: [service, target]
  - name: circuit_breaker_rejected_total
    type: counter
    labels: [service, target]  # Requests rejected while circuit is open

alerts:
  - alert: CircuitBreakerOpen
    expr: circuit_breaker_state == 1
    for: 1m
    severity: warning
    message: "Circuit breaker for {{ $labels.target }} is OPEN"

The Configuration That Matters

circuit_breakers:
  payment-service:
    failure_threshold: 5
    recovery_timeout: 30s
    success_threshold: 3
    timeout_per_request: 5s

  search-service:
    failure_threshold: 10   # More tolerant
    recovery_timeout: 15s   # Recover faster
    success_threshold: 2
    timeout_per_request: 2s

  auth-service:
    failure_threshold: 3    # Less tolerant (critical)
    recovery_timeout: 10s   # Recover very fast
    success_threshold: 1
    timeout_per_request: 1s

Critical services get lower thresholds (less tolerance) and faster recovery.

If you want AI-powered circuit breaker tuning and cascading failure prevention, check out what we're building at Nova AI Ops.

Written by Dr. Samson Tanimawo BSc ยท MSc ยท MBA ยท PhD
Founder & CEO, Nova AI Ops.
https://novaaiops.com

Comments

No comments yet. Start the discussion.