DEV Community 2h ago

Observability Practices: A Hands-On Guide with Prometheus and Grafana

What is Observability?

Observability is the ability to understand the internal state of a system just by looking at its external outputs. Unlike traditional monitoring, which tells you whether something is wrong, observability helps you understand why it's wrong. It's built on three pillars:

Logs: discrete, timestamped events (e.g., "User 123 logged in").
Metrics: numeric measurements over time (e.g., request latency, memory usage).
Traces: the path a request takes through a distributed system.

In this article, I'll walk through a real example: instrumenting a Node.js API with Prometheus for metrics collection and Grafana for visualization. Both tools are free, open-source, and widely used in production.

Why This Stack?

Prometheus and Grafana are a great starting point because:

They're free and open-source.
Prometheus uses a pull-based model, scraping metrics from your app at intervals.
Grafana turns those metrics into readable dashboards.
The combination is an industry standard, used alongside commercial tools like Datadog or New Relic.

Step 1: Instrumenting a Node.js App

We'll build a small Express API and expose custom metrics using prom-client, the official Prometheus client library for Node.js.

npm install express prom-client

// server.js
const express = require('express');
const client = require('prom-client');
const app = express();
const register = new client.Registry();

// Collect default Node.js metrics (CPU, memory, event loop lag, etc.)
client.collectDefaultMetrics({ register });

// Custom metric: counts total HTTP requests by route and status code
const httpRequestCounter = new client.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code'],
});
register.registerMetric(httpRequestCounter);

// Custom metric: measures request duration
const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.05, 0.1, 0.3, 0.5, 1, 2, 5],
});
register.registerMetric(httpRequestDuration);

// Middleware to track every request
app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer();
  res.on('finish', () => {
    const labels = { method: req.method, route: req.path, status_code: res.statusCode };
    httpRequestCounter.inc(labels);
    end(labels);
  });
  next();
});

// Sample endpoints
app.get('/', (req, res) => {
  res.send('Welcome to the Observability Demo API');
});

app.get('/slow', async (req, res) => {
  // Simulate a slow endpoint
  await new Promise((resolve) => setTimeout(resolve, Math.random() * 2000));
  res.send('This endpoint is intentionally slow');
});

app.get('/error', (req, res) => {
  res.status(500).send('Something went wrong');
});

// Expose metrics endpoint for Prometheus to scrape
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

app.listen(3000, () => console.log('Server running on http://localhost:3000'));

This exposes a /metrics endpoint in the Prometheus text format, which looks something like this:

http_requests_total{method="GET",route="/",status_code="200"} 12
http_request_duration_seconds_bucket{le="0.5",method="GET",route="/slow"} 3

Step 2: Configuring Prometheus

Prometheus needs to know where to scrape metrics from. Here's a minimal prometheus.yml:

global:
  scrape_interval: 5s

scrape_configs:
  - job_name: 'node-app'
    static_configs:
      - targets: ['localhost:3000']

Run Prometheus with Docker:

docker run -d -p 9090:9090 -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus

Now visit http://localhost:9090 and query http_requests_total - you'll see live data flowing in from your app.

Step 3: Visualizing with Grafana

Run Grafana alongside Prometheus:

docker run -d -p 3001:3000 grafana/grafana

Open http://localhost:3001 (default login: admin/admin).
Add Prometheus as a data source, pointing to http://host.docker.internal:9090.
Create a new dashboard with panels for:
- Request rate: rate(http_requests_total[1m])
- Error rate: rate(http_requests_total{status_code="500"}[1m])
- p95 latency: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Within minutes, you have a live dashboard showing traffic, errors, and latency - the core signals every team needs to detect issues before users complain.

Why This Matters in Practice

Imagine your /slow endpoint starts timing out under load. Without observability, you'd only find out when a user complains. With metrics like p95 latency and error rate visible on a dashboard (and alerts configured on top of them), your team can catch the regression within minutes of deployment - often before it affects most users.

This same pattern - instrument, expose, scrape, visualize - applies whether you're using Prometheus/Grafana, Datadog, New Relic, or Azure Monitor. The tools differ, but the principle is the same: you can't fix what you can't see.

Key Takeaways

Observability goes beyond monitoring: it helps you answer why, not just what.
Instrumenting code with metrics (counters, histograms) is a low-effort, high-value practice.
Prometheus + Grafana is a free, production-grade way to get started.
The same principles apply across any observability platform.

Read on DEV Community ↗ ← Back to News