DEV Community

Observability Practices: A Hands-On Guide with Prometheus and Grafana

What is Observability?

Observability is the ability to understand the internal state of a system just by looking at its external outputs. Unlike traditional monitoring, which tells you whether something is wrong, observability helps you understand why it's wrong. It's built on three pillars:

  • Logs: discrete, timestamped events (e.g., "User 123 logged in").
  • Metrics: numeric measurements over time (e.g., request latency, memory usage).
  • Traces: the path a request takes through a distributed system.

In this article, I'll walk through a real example: instrumenting a Node.js API with Prometheus for metrics collection and Grafana for visualization. Both tools are free, open-source, and widely used in production.

Why This Stack?

Prometheus and Grafana are a great starting point because:

  • They're free and open-source.
  • Prometheus uses a pull-based model, scraping metrics from your app at intervals.
  • Grafana turns those metrics into readable dashboards.
  • The combination is an industry standard, used alongside commercial tools like Datadog or New Relic.

Step 1: Instrumenting a Node.js App

We'll build a small Express API and expose custom metrics using prom-client, the official Prometheus client library for Node.js.

npm install express prom-client
// server.js
const express = require('express');
const client = require('prom-client');
const app = express();
const register = new client.Registry();

// Collect default Node.js metrics (CPU, memory, event loop lag, etc.)
client.collectDefaultMetrics({ register });

// Custom metric: counts total HTTP requests by route and status code
const httpRequestCounter = new client.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code'],
});
register.registerMetric(httpRequestCounter);

// Custom metric: measures request duration
const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.05, 0.1, 0.3, 0.5, 1, 2, 5],
});
register.registerMetric(httpRequestDuration);

// Middleware to track every request
app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer();
  res.on('finish', () => {
    const labels = { method: req.method, route: req.path, status_code: res.statusCode };
    httpRequestCounter.inc(labels);
    end(labels);
  });
  next();
});

// Sample endpoints
app.get('/', (req, res) => {
  res.send('Welcome to the Observability Demo API');
});

app.get('/slow', async (req, res) => {
  // Simulate a slow endpoint
  await new Promise((resolve) => setTimeout(resolve, Math.random() * 2000));
  res.send('This endpoint is intentionally slow');
});

app.get('/error', (req, res) => {
  res.status(500).send('Something went wrong');
});

// Expose metrics endpoint for Prometheus to scrape
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

app.listen(3000, () => console.log('Server running on http://localhost:3000'));

This exposes a /metrics endpoint in the Prometheus text format, which looks something like this:

http_requests_total{method="GET",route="/",status_code="200"} 12
http_request_duration_seconds_bucket{le="0.5",method="GET",route="/slow"} 3

Step 2: Configuring Prometheus

Prometheus needs to know where to scrape metrics from. Here's a minimal prometheus.yml:

global:
  scrape_interval: 5s

scrape_configs:
  - job_name: 'node-app'
    static_configs:
      - targets: ['localhost:3000']

Run Prometheus with Docker:

docker run -d -p 9090:9090 -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus

Now visit http://localhost:9090 and query http_requests_total - you'll see live data flowing in from your app.

Step 3: Visualizing with Grafana

Run Grafana alongside Prometheus:

docker run -d -p 3001:3000 grafana/grafana
  • Open http://localhost:3001 (default login: admin/admin).
  • Add Prometheus as a data source, pointing to http://host.docker.internal:9090.
  • Create a new dashboard with panels for:
    • Request rate: rate(http_requests_total[1m])
    • Error rate: rate(http_requests_total{status_code="500"}[1m])
    • p95 latency: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Within minutes, you have a live dashboard showing traffic, errors, and latency - the core signals every team needs to detect issues before users complain.

Why This Matters in Practice

Imagine your /slow endpoint starts timing out under load. Without observability, you'd only find out when a user complains. With metrics like p95 latency and error rate visible on a dashboard (and alerts configured on top of them), your team can catch the regression within minutes of deployment - often before it affects most users.

This same pattern - instrument, expose, scrape, visualize - applies whether you're using Prometheus/Grafana, Datadog, New Relic, or Azure Monitor. The tools differ, but the principle is the same: you can't fix what you can't see.

Key Takeaways

  • Observability goes beyond monitoring: it helps you answer why, not just what.
  • Instrumenting code with metrics (counters, histograms) is a low-effort, high-value practice.
  • Prometheus + Grafana is a free, production-grade way to get started.
  • The same principles apply across any observability platform.

Comments

No comments yet. Start the discussion.