AI Anomaly Detection in Grafana: 3 Mistakes We Made
Context - Why We Tried AI Anomaly Detection in the First Place
Our alerting stack before this experiment was a graveyard of static thresholds. CPU above 80%? Alert. Memory above 75%? Alert. P95 latency above 500ms? Alert. Every one of those numbers was picked by a human, at a point in time, for a service that has since changed completely.
The result was a Prometheus setup with roughly 200 alert rules that the on-call rotation had learned to mostly ignore. Alert fatigue was real and documented - we had a Slack channel called #alerts-noise that received more traffic than #incidents.
The incident that finally pushed us to act was a gradual memory leak in a Go microservice. The leak was slow - about 12MB per hour. It stayed under every static threshold for six full hours. No alert fired. Then the service hit the container memory limit, got OOM-killed, and the cascade took down three downstream services before we caught it.
The postmortem was uncomfortable. We had all the metrics. Prometheus had scraped every data point. We just had no rule that would have caught a slow, sustained drift rather than a sharp spike.
The target stack we built toward: Grafana 10.2.3, Prometheus 2.48, and a Python-based anomaly model using Isolation Forest from scikit-learn 1.4.0, deployed as a sidecar service on Kubernetes 1.28. The model would score incoming metric streams and expose those scores back to Prometheus, where Grafana alert rules would evaluate them. Clean in theory. Painful in practice.
Mistake 1 - We Trusted the Model Out of the Box Without Baseline Training Data
The first version of the model was trained on two weeks of Prometheus metrics pulled via the Prometheus HTTP API. We ran the training during a quiet period - post-holiday, low traffic, no deployments. The model learned what "normal" looked like during the quietest two weeks of our entire year.
Monday morning arrived. Traffic ramped up as users came back online. The model had never seen a Monday morning traffic pattern. It flagged every single ramp as an anomaly. We got 400 Grafana alerts in the first week. Most of them were garbage.
The subtler problem was the contamination parameter. We left it at the scikit-learn default of 0.1. What that means in practice: the Isolation Forest is mathematically instructed to label exactly 10% of all data points as anomalies, regardless of whether your data actually contains 10% anomalies. In a healthy, stable service, you might have 0.5% genuinely anomalous points. Setting contamination=0.1 forces the model to invent the other 9.5%. It will find them. It will call your Monday morning traffic an anomaly. It will call your weekly deployment window an anomaly. It will call a lot of things anomalies, because you told it to.
The fix took about three weeks to implement properly. We retrained on 90 days of data including at least two full deploy cycles and multiple traffic peaks. We tuned contamination down to 0.02 - roughly matching our observed real-incident rate. We added hour_of_week as an explicit feature dimension so the model could learn that Monday 9am is structurally different from Sunday 3am. The false-positive rate dropped by roughly 80% after retraining.
Watch out for: the contamination default. It is almost certainly wrong for your workload. Always benchmark it against your actual historical incident rate before deploying to production.
Here is the training script we use now, run weekly via a Kubernetes CronJob scheduled at 0 2 * * 0 (Sunday 2am UTC):
# anomaly_model/train.py
# Trains Isolation Forest on Prometheus metric data and serializes model + scaler
# Run via Kubernetes CronJob weekly; outputs versioned artifacts to /models/
import os
import joblib
import requests
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
PROMETHEUS_URL = os.getenv("PROMETHEUS_URL", "http://prometheus-svc:9090")
MODEL_OUTPUT_DIR = os.getenv("MODEL_OUTPUT_DIR", "/models")
LOOKBACK_DAYS = 90
STEP = "60s"
METRICS = {
"latency_p95": 'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))',
"error_rate": 'rate(http_requests_total{status=~"5.."}[5m])',
"cpu_throttle": 'rate(container_cpu_cfs_throttled_seconds_total[5m])',
}
def fetch_metric(query: str, start: datetime, end: datetime) -> pd.Series:
"""Query Prometheus range API and return a time-indexed Series."""
resp = requests.get(f"{PROMETHEUS_URL}/api/v1/query_range", params={
"query": query,
"start": start.timestamp(),
"end": end.timestamp(),
"step": STEP,
}, timeout=30)
resp.raise_for_status()
result = resp.json()["data"]["result"]
if not result:
raise ValueError(f"No data returned for query: {query}")
values = result[0]["values"] # [[timestamp, value], ...]
ts = pd.Series(
{datetime.fromtimestamp(float(t)): float(v) for t, v in values},
name=query[:40]
)
return ts
def build_feature_matrix(end: datetime) -> pd.DataFrame:
"""Fetch all metrics and assemble feature matrix with engineered features."""
start = end - timedelta(days=LOOKBACK_DAYS)
frames = {}
for name, query in METRICS.items():
frames[name] = fetch_metric(query, start, end)
df = pd.DataFrame(frames).dropna()
# Engineered features: hour-of-week captures weekly seasonality
df["hour_of_week"] = df.index.dayofweek * 24 + df.index.hour
# Binary flag: mark known deploy windows (weekdays 10-12 UTC)
df["is_deploy_window"] = (
(df.index.dayofweek < 5) & (df.index.hour >= 10) & (df.index.hour < 12)
).astype(int)
return df
def train_and_save():
end = datetime.utcnow()
df = build_feature_matrix(end)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df.values)
model = IsolationForest(
n_estimators=200,
contamination=0.02, # tuned: ~2% expected anomaly rate in production
random_state=42, # pinned to prevent score drift on rebuild
n_jobs=-1,
)
model.fit(X_scaled)
# Version tag: YYYYMMDD for traceability
version = end.strftime("%Y%m%d")
joblib.dump(model, f"{MODEL_OUTPUT_DIR}/isolation_forest_{version}.pkl")
joblib.dump(scaler, f"{MODEL_OUTPUT_DIR}/scaler_{version}.pkl")
# Symlink "latest" for the serving layer to pick up without restart
for artifact, name in [
(f"isolation_forest_{version}.pkl", "model_latest.pkl"),
(f"scaler_{version}.pkl", "scaler_latest.pkl")
]:
link = os.path.join(MODEL_OUTPUT_DIR, name)
if os.path.islink(link):
os.remove(link)
os.symlink(os.path.join(MODEL_OUTPUT_DIR, artifact), link)
print(f"[train] Model version {version} saved. Samples trained: {len(df)}")
if __name__ == "__main__":
train_and_save()
One thing worth noting: the StandardScaler fitted on training data must be serialized alongside the model using joblib.dump and loaded together at inference time. Forgetting to save the scaler is one of the most common sources of silent score corruption I've seen. The model loads fine, inference runs without errors, and the scores are completely wrong. There is no exception thrown. You will not know unless you are actively monitoring the score distribution.
Mistake 2 - We Wired the Model Output Directly into PagerDuty Without a Confidence Gate
Even after fixing the training data problem, the alerting integration was still a disaster. The first version piped raw anomaly scores directly into a Grafana Alert Rule with a simple threshold: if anomaly_score > 0.6, fire. No smoothing. No consecutive-breach requirement. No pending state window. A single anomalous data point - one 60-second scrape interval - could trigger a full PagerDuty incident.
We had a case where a transient network blip caused a 30-second latency spike. The anomaly score spiked to 0.72 for exactly one evaluation cycle. PagerDuty fired. Someone got woken up. By the time they opened their laptop, the score was back at 0.15 and every metric was green. That is not a monitoring system. That is a random number generator with a pager.
The problem was that we had omitted the for: duration in the Grafana alert rule YAML. In Grafana 10.2 alerting, the for: field controls how long a condition must be continuously true before the alert transitions from Pending to Firing. Without it, the alert fires on the first breach. We found through painful trial and error that for: 3m - meaning three consecutive one-minute evaluation cycles above the threshold - was the minimum viable duration to suppress transient spikes without masking real incidents.
We also added a Prometheus Recording Rule to pre-aggregate the raw score into a 5-minute rolling median before Grafana ever evaluates it:
# /etc/grafana/provisioning/alerting/anomaly-rules.yaml
# Grafana 10.2 alert provisioning - two-stage anomaly alert with severity routing
# Apply: restart Grafana pod or POST /api/admin/provisioning/alerting/reload
apiVersion: 1
groups:
- orgId: 1
name: anomaly_detection
folder: AI Monitoring
interval: 1m # evaluation cadence - matches Prometheus scrape interval
rules:
# Stage 1 - WARNING: smoothed score crosses lower threshold
- uid: anomaly-warn-001
title: "Anomaly Score Warning - Elevated"
condition: C
data:
- refId: A
datasourceUid: prometheus-ds
model:
expr: job:anomaly_score:avg5m # recording rule pre-aggregated value
intervalMs: 60000
maxDataPoints: 43200
- refId: C
datasourceUid: "__expr__"
model:
type: threshold
conditions:
- evaluator:
params: [0.65]
type: gt
query:
params: [A]
for: 3m # must stay above threshold for 3 consecutive evals before firing
labels:
severity: warning
team: platform
annotations:
summary: "Anomaly score elevated on {{ $labels.job }}"
description: >
Rolling 5m anomaly score is {{ $values.A.Value | printf "%.3f" }},
above warning threshold 0.65. Check Grafana dashboard: AI Anomaly Overview.
# Stage 2 - CRITICAL: high-confidence anomaly, pages on-call
- uid: anomaly-crit-001
title: "Anomaly Score Critical - Page On-Call"
condition: C
data:
- refId: A
datasourceUid: prometheus-ds
model:
expr: job:anomaly_score:avg5m
intervalMs: 60000
maxDataPoints: 43200
- refId: C
datasourceUid: "__expr__"
model:
type: threshold
conditions:
- evaluator:
params: [0.85]
type: gt
query:
params: [A]
for: 3m
labels:
severity: critical
team: platform
annotations:
summary: "HIGH CONFIDENCE anomaly on {{ $labels.job }} - investigate now"
description: >
Score {{ $values.A.Value | printf "%.3f" }} exceeds critical threshold 0.85.
Model version: {{ $labels.model_version }}.
The two-stage severity split - warning at 0.65 routed to Slack, critical at 0.85 routed to PagerDuty - was a deliberate choice. It gives the team visibility into developing situations without immediately escalating to an incident. In practice, the warning tier catches about 70% of real anomalies before they reach the critical threshold, giving engineers time to investigate during business hours rather than at 3am.
Watch out for: using raw anomaly_score directly as the Grafana alert expression instead of a smoothed recording rule. This causes alert flapping and burns your PagerDuty incident quota faster than almost anything else. Always pre-aggregate.
Mistake 3 - The Model Serving Layer Had No Versioning, Rollback, or Drift Detection
This one was the most expensive mistake. We treated the ML model like a static config file. It lived in a Docker image tagged latest. No version history. No rollback procedure. No metrics about the model's own behavior.
A routine base image rebuild bumped scikit-learn from 1.3.x to 1.4.0. We did not pin the version in requirements.txt. The change was silent - no build error, no test failure, no deployment alert. What changed internally was Isolation Forest's random seed behavior and tree-building logic. Score distributions shifted by approximately 0.08 across the board. Every metric that previously scored in the [0.5, 0.7] range now scored in the [0.0, 0.4] range. No alerts fired. For eleven days.
During that eleven-day window, we had a real incident: Redis connection pool exhaustion caused roughly 40 minutes of degraded checkout latency in production. The anomaly model was running the entire time. It saw the metrics. It scored everything below 0.3. Nobody got paged. We found out about the degradation through a customer support ticket.
The insidious part is that there was no error message. The model did not crash. The serving endpoint returned HTTP 200 on every inference request. The scores were just wrong, and we had no way to know that without observing the score distribution over time. The silent failure mode for model drift is: scores return, but they are all low. You will not see an exception. You will see nothing, until a real incident goes undetected.
The fix required treating the model as a first-class production service, not a background utility. We now pin scikit-learn==1.4.0 explicitly in requirements.txt. Docker images are tagged with the convention anomaly-model:YYYYMMDD-GITHASH and stored in ECR. The previous version is always retained for instant rollback via kubectl set image deployment/anomaly-model anomaly-model=<ECR_URI>:<previous_tag>. And the model service now exposes a /metrics endpoint that Prometheus scrapes, publishing anomaly_model_score_p95, anomaly_rate_5m, and model_version_timestamp. If anomaly_rate_5m drops below 0.005 for more than 30 minutes during peak traffic hours, we get a separate alert: "model may be silently underscoring - check for drift." That alert has fired twice since we added it. Both times, something real was wrong with the model configuration.
What We Do Differently Now - The Architecture That Actually Works
After three painful rounds of iteration, the current setup is stable. Here is what changed:
Comments
No comments yet. Start the discussion.