DEV Community 1h ago

Your Kubernetes Controller Is Silently Dead and Nobody Knows

The Problem

Kubernetes controllers built on client-go have a silent failure mode that's easy to miss. When a ServiceAccount token goes stale, your controller keeps running, passes health checks, shows Ready, and does absolutely nothing.

client-go retries every watch error with exponential backoff - including 401 Unauthorized.

// client-go's default: log and retry. Forever.
func DefaultWatchErrorHandler(ctx context.Context, r *Reflector, err error) {
    // logs the error
    // returns (triggering a retry with backoff)
}

This makes sense for transient errors like network blips or 503s. But a 401 means your identity is invalid. No amount of retrying will fix it. Only a pod restart mounts a fresh token.

This is the default behavior in client-go, which sits underneath controller-runtime. No controller handles 401 out of the box. controller-runtime recently added the ability to set a custom SetWatchErrorHandler (#3149), but it's opt-in.

Where I Found It

During a Helm upgrade via ArgoCD on KAI-Scheduler (CNCF Sandbox GPU scheduler), a Config CR got deleted and recreated. This caused the operator to recreate scheduler pods whose projected ServiceAccount tokens became invalid.

The scheduler kept retrying 401s indefinitely while showing Running/Ready. The error was logged, but the process never exited and the pod never restarted. ArgoCD reported the application as Synced/Healthy the entire time.

Full incident: #1751. The 401 handling gap was tracked separately in #1817.

The Fix

Submitted a PR that wraps the HTTP transport on rest.Config:

type unauthorizedRoundTripper struct {
    rt http.RoundTripper
}

func (t *unauthorizedRoundTripper) RoundTrip(req *http.Request) (*http.Response, error) {
    resp, err := t.rt.RoundTrip(req)
    if err == nil && resp.StatusCode == http.StatusUnauthorized {
        log.InfraLogger.Errorf("API server returned 401 Unauthorized, exiting to trigger pod restart")
        os.Exit(1)
    }
    return resp, err
}

One call in the startup path:

config := clientconfig.GetConfigOrDie()
config.Wrap(wrapExitOnUnauthorized)

Every client created from this config goes through the wrapper. First 401 hits, the process exits, kubelet restarts it, fresh token is mounted.

Why `os.Exit` and Not Graceful Shutdown?

If your token is invalid, you can't release your leader election lease either (that's an API call too). The lease expires on its own in 15 seconds. os.Exit matches what kube-scheduler upstream does for unrecoverable errors.

If you build controllers on client-go, worth checking whether yours handles 401 at the transport or informer level. The error does get logged, but without an exit or a failed health probe, the pod stays Running and nothing self-heals.

Read on DEV Community ↗ ← Back to News

Your Kubernetes Controller Is Silently Dead and Nobody Knows

The Problem

Where I Found It

The Fix

Why os.Exit and Not Graceful Shutdown?

Comments

Why `os.Exit` and Not Graceful Shutdown?