DEV Community
Grade 10
1h ago
The Human-in-the-Loop SRE: Designing Automation Escalation Policies for AI-Assisted Operations
On April 23, 2021, a Fastly CDN configuration change triggered a global outage that took down the UK government website, the New York Times, Reddit, and hundreds of other major internet properties for approximately one hour. The triggering event was a configuration push. The propagation mechanism was automated. The time between the configuration being pushed and the global impact becoming visible was under a minute. The time required for a human operator to identify the cause and initiate the rollback was approximately forty-nine minutes longer than that. The Fastly incident is not primarily a story about automation failure. It is a story about the speed asymmetry between automated propagation and human response — and about what happens when the automation layer between a human decision and its production consequence moves faster than the accountability layer designed to govern it. This asymmetry is the defining operational challenge of AI-assisted SRE. The capability to automate incident detection, root cause hypothesis generation, and even remediation is now accessible at costs and latencies that were unavailable five years ago. The operational risk is not that this capability will be under-used. The risk is that it will be deployed without a rigorous escalation policy — a formal framework that defines exactly where automated execution ends and human judgement begins, under what conditions the boundary shifts, and how accountability is preserved for every action the AI takes on behalf of an operator who may not have been in the room when it was taken. The Human-in-the-Loop Spectrum AI-assisted SRE operations do not exist at a single point on the autonomy spectrum. They exist across a range, and the appropriate position on that range is a function of confidence, blast radius, novelty, and regulatory context — not of how sophisticated the AI system is. THE AUTOMATION AUTONOMY SPECTRUM ──────────────────────────────────────────────────────────────────────────── LEVEL 0 — MANUAL AI generates no recommendations. Human observes raw telemetry and decides. Appropriate when: AI system is unavailable, untrusted, or context is outside AI training distribution entirely. LEVEL 1 — ASSISTED AI surfaces relevant context, correlated signals, and historical patterns. Human makes all decisions. AI does not recommend actions. Appropriate when: novel failure pattern; first occurrence of incident type; regulated change requiring documented human judgement. LEVEL 2 — SUPERVISED AI recommends specific actions with confidence scores. Human approves each action before execution. AI does not execute autonomously. Appropriate when: high blast radius; unfamiliar but not novel pattern; action is reversible but consequential. LEVEL 3 — CONDITIONAL AUTONOMOUS AI executes actions autonomously within pre-approved policy boundaries. Human is notified after execution. Human can abort within a defined window. Appropriate when: well-characterised failure pattern; low blast radius; action is fully reversible; pattern seen > N times with consistent outcome. LEVEL 4 — AUTONOMOUS AI executes and verifies remediation without human notification unless verification fails. Audit trail maintained. Appropriate when: toil pattern fully characterised; action is idempotent; blast radius is bounded to a single service; recurrence rate justifies zero-latency response. ──────────────────────────────────────────────────────────────────────────── CRITICAL CONSTRAINT: No action may exist permanently at Level 4. Every Level 4 automation must have a scheduled re-qualification review that reassesses whether the failure pattern is still well-characterised and the blast radius assumption still holds. ──────────────────────────────────────────────────────────────────────────── The critical constraint — that no action may exist permanently at Level 4 — is not conservatism. It is the engineering response to a specific failure mode: automation that was correctly calibrated at deployment time and has silently drifted out of calibration as the system evolved. An OOM restart automation that was safe when first deployed becomes unsafe the moment the underlying cause shifts from a memory leak to a data corruption event that is triggering the same symptom. The re-qualification review is the mechanism that catches this drift before it produces an incident. The Four Escalation Triggers Every escalation policy is built from four primitive triggers. Each trigger defines a condition under which the automation level must shift upward — toward more human involvement, not less. Trigger 1 — Confidence Threshold Breach The AI system's confidence in its diagnosis or recommended action has fallen below a defined threshold. In the context of LLM-based operations (HolmesGPT, LiteLLM Proxy routing), confidence is expressed as a combination of model-reported token probability distributions and domain-specific heuristics applied to the recommendation output. A low-confidence diagnosis means
On April 23, 2021, a Fastly CDN configuration change triggered a global outage that took down the UK government website, the New York Times, Reddit, and hundreds of other major internet properties for approximately one hour. The triggering event was a configuration push. The propagation mechanism was automated. The time between the configuration being pushed and the global impact becoming visible was under a minute. The time required for a human operator to identify the cause and initiate the rollback was approximately forty-nine minutes longer than that. The Fastly incident is not primarily a story about automation failure. It is a story about the speed asymmetry between automated propagation and human response — and about what happens when the automation layer between a human decision and its production consequence moves faster than the accountability layer designed to govern it. This asymmetry is the defining operational challenge of AI-assisted SRE. The capability to automate incident detection, root cause hypothesis generation, and even remediation is now accessible at costs and latencies that were unavailable five years ago. The operational risk is not that this capability will be under-used. The risk is that it will be deployed without a rigorous escalation policy — a formal framework that defines exactly where automated execution ends and human judgement begins, under what conditions the boundary shifts, and how accountability is preserved for every action the AI takes on behalf of an operator who may not have been in the room when it was taken. The Human-in-the-Loop Spectrum AI-assisted SRE operations do not exist at a single point on the autonomy spectrum. They exist across a range, and the appropriate position on that range is a function of confidence, blast radius, novelty, and regulatory context — not of how sophisticated the AI system is. THE AUTOMATION AUTONOMY SPECTRUM ──────────────────────────────────────────────────────────────────────────── LEVEL 0 — MANUAL AI generates no recommendations. Human observes raw telemetry and decides. Appropriate when: AI system is unavailable, untrusted, or context is outside AI training distribution entirely. LEVEL 1 — ASSISTED AI surfaces relevant context, correlated signals, and historical patterns. Human makes all decisions. AI does not recommend actions. Appropriate when: novel failure pattern; first occurrence of incident type; regulated change requiring documented human judgement. LEVEL 2 — SUPERVISED AI recommends specific actions with confidence scores. Human approves each action before execution. AI does not execute autonomously. Appropriate when: high blast radius; unfamiliar but not novel pattern; action is reversible but consequential. LEVEL 3 — CONDITIONAL AUTONOMOUS AI executes actions autonomously within pre-approved policy boundaries. Human is notified after execution. Human can abort within a defined window. Appropriate when: well-characterised failure pattern; low blast radius; action is fully reversible; pattern seen > N times with consistent outcome. LEVEL 4 — AUTONOMOUS AI executes and verifies remediation without human notification unless verification fails. Audit trail maintained. Appropriate when: toil pattern fully characterised; action is idempotent; blast radius is bounded to a single service; recurrence rate justifies zero-latency response. ──────────────────────────────────────────────────────────────────────────── CRITICAL CONSTRAINT: No action may exist permanently at Level 4. Every Level 4 automation must have a scheduled re-qualification review that reassesses whether the failure pattern is still well-characterised and the blast radius assumption still holds. ──────────────────────────────────────────────────────────────────────────── The critical constraint — that no action may exist permanently at Level 4 — is not conservatism. It is the engineering response to a specific failure mode: automation that was correctly calibrated at deployment time and has silently drifted out of calibration as the system evolved. An OOM restart automation that was safe when first deployed becomes unsafe the moment the underlying cause shifts from a memory leak to a data corruption event that is triggering the same symptom. The re-qualification review is the mechanism that catches this drift before it produces an incident. The Four Escalation Triggers Every escalation policy is built from four primitive triggers. Each trigger defines a condition under which the automation level must shift upward — toward more human involvement, not less. Trigger 1 — Confidence Threshold Breach The AI system's confidence in its diagnosis or recommended action has fallen below a defined threshold. In the context of LLM-based operations (HolmesGPT, LiteLLM Proxy routing), confidence is expressed as a combination of model-reported token probability distributions and domain-specific heuristics applied to the recommendation output. A low-confidence diagnosis means the AI has identified a plausible pattern match but lacks sufficient corroborating signal to recommend action without human review. Executing actions based on low-confidence diagnoses is the operational equivalent of acting on a single data point in a monitoring dashboard: occasionally correct, reliably dangerous as a policy. Trigger 2 — Blast Radius Threshold The proposed action affects more infrastructure than the policy authorises for autonomous execution. Blast radius is assessed across three dimensions: service count (how many services are affected), traffic fraction (what percentage of user requests are served by the affected infrastructure), and reversibility (can the action be undone in under five minutes with a single command). High blast radius is not a disqualifying condition for automation. It is a condition that requires the automation level to shift to at least Level 2 (supervised) regardless of confidence score. Trigger 3 — Novelty Detection The failure pattern does not match any pattern in the AI system's training corpus or historical incident database. Novelty is the most dangerous condition for autonomous execution because it is precisely the condition where the AI's pattern-matching capability provides the least value — and where a confident-sounding but incorrect recommendation carries the highest operational cost. Novelty detection is the hardest trigger to implement well, because it requires the AI system to accurately assess the boundaries of its own knowledge. A system that cannot reliably distinguish "I have seen this pattern and am confident" from "I have seen a superficially similar pattern and am extrapolating" should not be operating at Level 3 or Level 4. Trigger 4 — Regulatory Boundary The proposed action would touch a regulated asset, require a documented change record, affect a system subject to NERC CIP, PCI-DSS, HIPAA, or equivalent obligations, or generate a compliance event. In regulated environments, no automated action may bypass the change management governance framework, regardless of confidence score or blast radius. This trigger is absolute. It does not have a confidence threshold exception. An AI system that correctly diagnoses a production issue with 99% confidence and proposes a remediation that would constitute an undocumented change to a regulated asset must escalate to Level 2 and generate a change record, even if the remediation would restore service faster without it. Designing the Escalation Policy Document The escalation policy is an operational governance document, not a configuration file. It must be version-controlled, reviewed and approved by SRE leadership and compliance, and referenced in every AI-assisted automation's runtime configuration. Its authority derives from human review, not from the AI system that consults it. ESCALATION POLICY: AI-ASSISTED INCIDENT RESPONSE ──────────────────────────────────────────────────────────────────────────── Service: production-platform (all services) AI System: HolmesGPT + LiteLLM Proxy + Ollama / GitHub Models Policy Version: v1.3 | Approved: SRE Lead + VP Engineering Last Reviewed: 2025-Q1 | Next Review: 2025-Q2 ──────────────────────────────────────────────────────────────────────────── SECTION 1: AUTONOMOUS EXECUTION AUTHORISED (Level 4) Conditions required (ALL must be true): ✓ Confidence score ≥ 0.85 (model-reported + heuristic composite) ✓ Pattern seen ≥ 10 times in incident history with consistent outcome ✓ Blast radius: single service, single namespace, ≤ 20% of replicas ✓ Action is idempotent and fully reversible in ≤ 5 minutes ✓ No regulated asset in scope ✓ Error budget > 25% remaining (not in Tier 3 freeze) Authorised actions at Level 4: → Rolling restart of single stateless deployment (OOM, deadlock) → Scale-up of single HPA-managed deployment by ≤ 2 replicas → Certificate rotation on non-production workloads → Log pipeline gateway restart (telemetry outage, no production impact) Required logging: structured Splunk event per action (mandatory) Re-qualification: every 90 days or after any incident where autonomous action was taken and outcome was suboptimal SECTION 2: SUPERVISED EXECUTION (Level 2 — Human Approval Required) Conditions triggering Level 2 (ANY is sufficient): ⚠ Confidence score 0.60–0.84 ⚠ Blast radius: > 20% of replicas OR > 1 service OR cross-namespace ⚠ First or second occurrence of this failure pattern ⚠ Error budget between 25–75% (Tier 2 degraded) ⚠ Action affects shared infrastructure (Argo CD, Prometheus, Istio) Approval mechanism: Slack approval button with 10-minute timeout Timeout behaviour: escalate to on-call if no response in 10 minutes Required logging: recommendation + approval/rejection + outcome SECTION 3: ASSISTED ONLY (Level 1 — No Action Authorised) Conditions triggering Level 1 (ANY is sufficient): ✗ Confidence score 0.15 OR holmesgpt:false_positive_rate:rate7d > 0.20 for: 1d labels: severity: ticket domain: ai_ops_quality annotations: summary: > HolmesGPT recom
Comments
No comments yet. Start the discussion.