Observability Practices: A Hands-On Guide with Prometheus and Grafana
What is Observability?
Observability is the ability to understand the internal state of a system just by looking at its external outputs. Unlike traditional monitoring, which tells you whether something is wrong, observability helps you understand why it's wrong. It's built on three pillars:
- Logs: discrete, timestamped events (e.g., "User 123 logged in").
- Metrics: numeric measurements over time (e.g., request latency, memory usage).
- Traces: the path a request takes through a distributed system.
In this article, I'll walk through a real example: instrumenting a Node.js API with Prometheus for metrics collection and Grafana for visualization. Both tools are free, open-source, and widely used in production.
Why This Stack?
Prometheus and Grafana are a great starting point because:
- They're free and open-source.
- Prometheus uses a pull-based model, scraping metrics from your app at intervals.
- Grafana turns those metrics into readable dashboards.
- The combination is an industry standard, used alongside commercial tools like Datadog or New Relic.
Step 1: Instrumenting a Node.js App
We'll build a small Express API and expose custom metrics using prom-client, the official Prometheus client library for Node.js.
npm install express prom-client
// server.js
const express = require('express');
const client = require('prom-client');
const app = express();
const register = new client.Registry();
// Collect default Node.js metrics (CPU, memory, event loop lag, etc.)
client.collectDefaultMetrics({ register });
// Custom metric: counts total HTTP requests by route and status code
const httpRequestCounter = new client.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code'],
});
register.registerMetric(httpRequestCounter);
// Custom metric: measures request duration
const httpRequestDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.05, 0.1, 0.3, 0.5, 1, 2, 5],
});
register.registerMetric(httpRequestDuration);
// Middleware to track every request
app.use((req, res, next) => {
const end = httpRequestDuration.startTimer();
res.on('finish', () => {
const labels = { method: req.method, route: req.path, status_code: res.statusCode };
httpRequestCounter.inc(labels);
end(labels);
});
next();
});
// Sample endpoints
app.get('/', (req, res) => {
res.send('Welcome to the Observability Demo API');
});
app.get('/slow', async (req, res) => {
// Simulate a slow endpoint
await new Promise((resolve) => setTimeout(resolve, Math.random() * 2000));
res.send('This endpoint is intentionally slow');
});
app.get('/error', (req, res) => {
res.status(500).send('Something went wrong');
});
// Expose metrics endpoint for Prometheus to scrape
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
app.listen(3000, () => console.log('Server running on http://localhost:3000'));
This exposes a /metrics endpoint in the Prometheus text format, which looks something like this:
http_requests_total{method="GET",route="/",status_code="200"} 12
http_request_duration_seconds_bucket{le="0.5",method="GET",route="/slow"} 3
Step 2: Configuring Prometheus
Prometheus needs to know where to scrape metrics from. Here's a minimal prometheus.yml:
global:
scrape_interval: 5s
scrape_configs:
- job_name: 'node-app'
static_configs:
- targets: ['localhost:3000']
Run Prometheus with Docker:
docker run -d -p 9090:9090 -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus
Now visit http://localhost:9090 and query http_requests_total - you'll see live data flowing in from your app.
Step 3: Visualizing with Grafana
Run Grafana alongside Prometheus:
docker run -d -p 3001:3000 grafana/grafana
- Open
http://localhost:3001(default login: admin/admin). - Add Prometheus as a data source, pointing to
http://host.docker.internal:9090. - Create a new dashboard with panels for:
- Request rate:
rate(http_requests_total[1m]) - Error rate:
rate(http_requests_total{status_code="500"}[1m]) - p95 latency:
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
- Request rate:
Within minutes, you have a live dashboard showing traffic, errors, and latency - the core signals every team needs to detect issues before users complain.
Why This Matters in Practice
Imagine your /slow endpoint starts timing out under load. Without observability, you'd only find out when a user complains. With metrics like p95 latency and error rate visible on a dashboard (and alerts configured on top of them), your team can catch the regression within minutes of deployment - often before it affects most users.
This same pattern - instrument, expose, scrape, visualize - applies whether you're using Prometheus/Grafana, Datadog, New Relic, or Azure Monitor. The tools differ, but the principle is the same: you can't fix what you can't see.
Key Takeaways
- Observability goes beyond monitoring: it helps you answer why, not just what.
- Instrumenting code with metrics (counters, histograms) is a low-effort, high-value practice.
- Prometheus + Grafana is a free, production-grade way to get started.
- The same principles apply across any observability platform.
Comments
No comments yet. Start the discussion.