DEV Community 5h ago

SIP Telephony Monitoring with eBPF: Full Observability for VoIP Infrastructure

How It Works

eBPF (extended Berkeley Packet Filter) allows running small programs directly in the Linux kernel. The eBPF verifier guarantees safety: the program cannot exceed allocated memory, cannot loop indefinitely, cannot modify the kernel.

My approach - eBPF socket filter on AF_PACKET. This is passive network traffic observation:

SIP Traffic → NIC → eBPF filter → AF_PACKET socket → Go → SIP Parser → Prometheus Metrics

Key point: the eBPF filter is a socket filter, not a tc/XDP filter. It only decides whether to copy a packet to the application. The packet continues through the network stack to its destination regardless. The filter cannot modify, block, or redirect traffic. Zero impact on call delivery.

The entire filter is 100 lines of C. Ports are configurable from Go code via BPF map, defaults are 5060/5061. eBPF drops 99% of traffic in kernel - only SIP packets on the right ports reach userspace.

The Full Metrics Stack

The exporter provides not just RFC 6076 metrics, but a complete observability stack for SIP infrastructure.

Real-time traffic

14 SIP request counters: INVITE, BYE, REGISTER, OPTIONS, CANCEL, ACK, SUBSCRIBE, NOTIFY, PUBLISH, INFO, PRACK, UPDATE, MESSAGE, REFER
30 response code counters: 100, 180, 181, 182, 183, 200, 202, 300, 302, 400, 401, 403, 404, 405, 407, 408, 480, 481, 486, 487, 488, 500, 501, 502, 503, 504, 600, 603, 604, 606
Active sessions gauge - current number of active SIP dialogs. Dialog is created on 200 OK to INVITE, removed on 200 OK to BYE or Session-Expires timeout (default 30 minutes)

Connection quality - RFC 6076

RFC 6076 defines standard SIP performance metrics. All metrics are cumulative, computed from atomic counters, updated on every scrape.

SER (Session Establishment Ratio) - percentage of successfully established sessions:

SER = (INVITE → 200 OK) / (Total INVITE - INVITE → 3xx) × 100

3xx (redirect) are excluded from the denominator - they are neither success nor failure, but a routing instruction. SER = 100 means all non-redirect INVITEs received 200 OK.

SEER (Session Establishment Effectiveness Ratio) - percentage of "effective" responses:

SEER = (INVITE → 200, 480, 486, 600, 603) / (Total INVITE - INVITE → 3xx) × 100

The numerator includes responses with a clear outcome: 200 OK (session established), 480 (temporarily unavailable), 486 (busy), 600 (busy everywhere), 603 (declined). SEER is always ≥ SER.

ISA (Ineffective Session Attempts) - percentage of infrastructure errors:

ISA = (INVITE → 408, 500, 503, 504) / Total INVITE × 100

408 (timeout), 500 (internal error), 503 (unavailable), 504 (gateway timeout) - server errors. ISA rising means infrastructure is degrading. Unlike SER/SEER, 3xx are NOT excluded from the denominator.

SCR (Session Completion Ratio) - percentage of fully completed sessions:

SCR = (Completed Sessions) / Total INVITE × 100

A completed session = INVITE → 200 OK → BYE → 200 OK (or Session-Expires timeout). SCR ≤ SER always: not all established sessions terminate correctly.

ASR (Answer Seizure Ratio) - classic telephony metric (ITU-T E.411):

ASR = (INVITE → 200 OK) / Total INVITE × 100

Unlike SER, 3xx are NOT excluded. ASR ≤ SER when redirect responses are present.

NER (Network Effectiveness Ratio) - network quality (GSMA IR.42):

NER = 100 − ISA

NER = 100 means no infrastructure errors. NER < 95 - time to worry.

Latency at every stage

Five histograms cover all SIP transaction phases:

RRD - Registration delay: REGISTER → 200 OK
TTR - Time to first response: INVITE → first 1xx
SPD - Session duration: INVITE 200 OK → BYE 200 OK
ORD - OPTIONS response delay: OPTIONS → any response
LRD - Registration redirect delay: REGISTER → 3xx

All histograms support histogram_quantile() for percentile-based alerting: p50, p95, p99.

Example for VictoriaMetrics / Prometheus:

95th percentile registration delay:

histogram_quantile(0.95, sum(rate(sip_exporter_rrd_bucket[5m])) by (le))

99th percentile session duration (specific carrier and device type):

histogram_quantile(0.99, sum(rate(sip_exporter_spd_bucket{carrier="mobile-operator-a",ua_type="yealink"}[5m])) by (le))

Additional metrics

ISS (Ineffective Session Severity) - absolute count of INVITE→408/500/503/504 responses. Unlike ISA (percentage), ISS enables alerting on absolute error volume: rate(sip_exporter_iss_total[5m]) > 20
SDC (Session Duration Counter) - Prometheus Counter of completed sessions. Useful for rate queries: rate(sip_exporter_sdc_total[5m])
sip_exporter_packets_total - total parsed SIP packets

Per-Carrier: Metrics by Traffic Source

Aggregated metrics hide problems of specific traffic sources. If SER = 85%, it's unclear - are all sources at 85%, or is one at 50% while others are at 95%? The exporter solves this via CIDR mapping: IP subnets → source name → carrier label on every metric.

Configuration:

carriers:
  - name: "telecom-alpha"
    cidrs:
      - "10.1.0.0/16"
      - "10.2.0.0/16"
      - "172.16.0.0/12"
  - name: "telecom-beta"
    cidrs:
      - "192.168.10.0/24"
      - "192.168.11.0/24"
      - "192.168.12.0/24"

How it works: Carrier is determined at request time (INVITE/REGISTER/OPTIONS) by source IP. If INVITE came from 10.1.5.20 - the exporter finds this IP belongs to 10.1.0.0/16 and labels all metrics for this call (including responses and dialog termination) with carrier="mobile-operator-a". Responses come from a different IP (the SIP server), but carrier is inherited from the tracker by Call-ID, not determined by response IP. This is correct: metrics belong to the call initiator, not the server.

Result:

sip_exporter_invite_total{carrier="mobile-operator-a",ua_type="other"} 1523
sip_exporter_ser{carrier="mobile-operator-a",ua_type="other"} 95.2
sip_exporter_ser{carrier="sip-trunk-provider",ua_type="other"} 87.4

Now it's clear: the trunk provider has SER = 87.4%, while the mobile operator has 95.2%. You can build separate dashboards and alerts for each traffic source. IPs not matching any CIDR subnet get carrier="other".

Per-UA-Type: Metrics by Device Type

Carrier shows who is calling, but not with what. And device type is often the key factor in problems. If Yealink phones start getting 408 timeouts while Grandstream works fine - without the ua_type label it would look like a general quality drop. With it - the problem is clearly localized to a specific device type.

Configuration:

user_agents:
  - regex: '(?i)^Yealink'
    label: yealink
  - regex: '(?i)^Grandstream'
    label: grandstream

How it works: The User-Agent header is extracted from each SIP request and matched against regex patterns. When a phone with User-Agent: Yealink SIP-T46S 66.15.0.10 sends an INVITE - the exporter matches ^Yealink and labels all call metrics with ua_type="yealink". Like carrier, ua_type is determined at request time and inherited by responses through the tracker by Call-ID.

Result:

sip_exporter_invite_total{carrier="mobile-operator-a",ua_type="yealink"} 1523
sip_exporter_ser{carrier="mobile-operator-a",ua_type="yealink"} 95.2
sip_exporter_ser{carrier="mobile-operator-a",ua_type="grandstream"} 87.4

Combined queries - both labels work together for two-dimensional analysis:

SER for Yealink phones on a specific carrier: sip_exporter_ser{carrier="mobile-operator-a",ua_type="yealink"}
Active sessions by device type: sum by (ua_type) (sip_exporter_sessions)
INVITE rate by carrier and device type: sum by (carrier, ua_type) (rate(sip_exporter_invite_total[5m]))

Performance

Load testing was done with SIPp via testcontainers-go - real SIP traffic, not mocks. Test environment: Debian 12, Linux kernel 6.x, Docker 29.3.1, Intel i7-8665U (4 cores / 8 threads), Go 1.25.9.

Full call lifecycle - each call is a complete SIP dialog: INVITE → 100 Trying → 180 Ringing → 200 OK → ACK → BYE → 200 OK. On loopback each packet is duplicated (send + receive), so 7 messages → 14 packets per call.

With GOMAXPROCS=8 (all cores):

CPS 1,000 → ~11,800 PPS → 8.7% CPU peak, 13 MB RAM, 0% packet loss
CPS 2,000 → ~23,600 PPS → 12.2% CPU peak, 15 MB RAM, 0% packet loss

With GOMAXPROCS=1 (single core):

CPS 2,000 → ~23,600 PPS → 9.2% CPU peak, 12 MB RAM, 0% packet loss

2,000 CPS, 0% packet loss, <12% CPU, ~15 MB RAM.

Scrape performance under 2,000 CPS load (14,000 PPS):

Min: 1.7 ms | Avg: 4.2 ms | P95: 6.4 ms | Max: 8.4 ms

Scraping doesn't interfere with packet processing. You can scrape every 5-10 seconds even at maximum load.

Why it's fast:

eBPF drops 99% of traffic in kernel - only SIP packets on ports 5060/5061 reach userspace
4 MB socket buffer - fits ~420ms of traffic at 28,000 PPS
Go GC pauses <1ms - 400x smaller than buffer capacity, packets never lost due to GC
SIP parsing ~1μs - microbenchmarks: INVITE 1.1μs, BYE 860ns, 200 OK 2.0μs

System requirements:

Up to 500 CPS → 1 core, 128 MB RAM
Up to 1,000 CPS → 1 core, 128 MB RAM
Up to 2,000 CPS → 2 cores, 256 MB RAM
Above 2,000 CPS → 4 cores, 512 MB RAM

Security: Why `--privileged` Is Safe

The container requires --privileged and network_mode: host. Here's why this is safe.

What capabilities are needed:

CAP_BPF - loading eBPF program into kernel via bpf() syscall
CAP_NET_RAW - creating AF_PACKET raw socket for reading packets
CAP_NET_ADMIN - binding eBPF filter to socket, configuring buffer

These are three specific capabilities for specific operations. All eBPF tools (Cilium, Falco, Pixie) require the same - this is a Linux kernel limitation, not a container one.

What the container does:

Loads eBPF socket filter into kernel (once, at startup)
Creates AF_PACKET raw socket bound to network interface
Reads packets from socket into Go channel (10,000 buffer)
Parses SIP headers
Exports metrics via /metrics endpoint

What the container does NOT do:

Does not modify packets - eBPF filter is passive (read-only)
Does not send SIP traffic - purely a listener
Does not write to host filesystem - all volumes are :ro
Does not access other containers, processes, or system resources
Does not open ports except /metrics (default 2112)
Does not establish outbound connections

The entire eBPF filter is 100 lines of C - fully auditable. Automated vulnerability scanning (govulncheck + Trivy) runs on every push. Current status: 0 vulnerabilities in code and image.

Quick Start

docker run --privileged --network host \
  -e SIP_EXPORTER_INTERFACE=eth0 \
  frzq/sip-exporter:latest

curl http://localhost:2112/metrics

Or with docker-compose:

services:
  sip-exporter:
    image: frzq/sip-exporter:latest
    privileged: true
    network_mode: host
    environment:
      - SIP_EXPORTER_INTERFACE=eth0
      # Optional: per-carrier metrics
      # - SIP_EXPORTER_CARRIERS_CONFIG=/etc/sip-exporter/carriers.yaml
      # Optional: per-device-type metrics
      # - SIP_EXPORTER_USER_AGENTS_CONFIG=/etc/sip-exporter/user_agents.yaml
    # volumes:
    #   - ./carriers.yaml:/etc/sip-exporter/carriers.yaml:ro
    #   - ./user_agents.yaml:/etc/sip-exporter/user_agents.yaml:ro

Compatible with Prometheus, VictoriaMetrics, and Grafana Cloud - any scraper supporting Prometheus exposition format.

Project: github.com/aibudaevv/sip-exporter (AGPL-3.0)

Read on DEV Community ↗ ← Back to News