SIP Telephony Monitoring with eBPF: Full Observability for VoIP Infrastructure
How It Works
eBPF (extended Berkeley Packet Filter) allows running small programs directly in the Linux kernel. The eBPF verifier guarantees safety: the program cannot exceed allocated memory, cannot loop indefinitely, cannot modify the kernel.
My approach - eBPF socket filter on AF_PACKET. This is passive network traffic observation:
SIP Traffic β NIC β eBPF filter β AF_PACKET socket β Go β SIP Parser β Prometheus Metrics
Key point: the eBPF filter is a socket filter, not a tc/XDP filter. It only decides whether to copy a packet to the application. The packet continues through the network stack to its destination regardless. The filter cannot modify, block, or redirect traffic. Zero impact on call delivery.
The entire filter is 100 lines of C. Ports are configurable from Go code via BPF map, defaults are 5060/5061. eBPF drops 99% of traffic in kernel - only SIP packets on the right ports reach userspace.
The Full Metrics Stack
The exporter provides not just RFC 6076 metrics, but a complete observability stack for SIP infrastructure.
Real-time traffic
- 14 SIP request counters:
INVITE,BYE,REGISTER,OPTIONS,CANCEL,ACK,SUBSCRIBE,NOTIFY,PUBLISH,INFO,PRACK,UPDATE,MESSAGE,REFER - 30 response code counters:
100,180,181,182,183,200,202,300,302,400,401,403,404,405,407,408,480,481,486,487,488,500,501,502,503,504,600,603,604,606 - Active sessions gauge - current number of active SIP dialogs. Dialog is created on
200 OKtoINVITE, removed on200 OKtoBYEorSession-Expirestimeout (default 30 minutes)
Connection quality - RFC 6076
RFC 6076 defines standard SIP performance metrics. All metrics are cumulative, computed from atomic counters, updated on every scrape.
SER (Session Establishment Ratio) - percentage of successfully established sessions:
SER = (INVITE β 200 OK) / (Total INVITE - INVITE β 3xx) Γ 100
3xx (redirect) are excluded from the denominator - they are neither success nor failure, but a routing instruction. SER = 100 means all non-redirect INVITEs received 200 OK.
SEER (Session Establishment Effectiveness Ratio) - percentage of "effective" responses:
SEER = (INVITE β 200, 480, 486, 600, 603) / (Total INVITE - INVITE β 3xx) Γ 100
The numerator includes responses with a clear outcome: 200 OK (session established), 480 (temporarily unavailable), 486 (busy), 600 (busy everywhere), 603 (declined). SEER is always β₯ SER.
ISA (Ineffective Session Attempts) - percentage of infrastructure errors:
ISA = (INVITE β 408, 500, 503, 504) / Total INVITE Γ 100
408 (timeout), 500 (internal error), 503 (unavailable), 504 (gateway timeout) - server errors. ISA rising means infrastructure is degrading. Unlike SER/SEER, 3xx are NOT excluded from the denominator.
SCR (Session Completion Ratio) - percentage of fully completed sessions:
SCR = (Completed Sessions) / Total INVITE Γ 100
A completed session = INVITE β 200 OK β BYE β 200 OK (or Session-Expires timeout). SCR β€ SER always: not all established sessions terminate correctly.
ASR (Answer Seizure Ratio) - classic telephony metric (ITU-T E.411):
ASR = (INVITE β 200 OK) / Total INVITE Γ 100
Unlike SER, 3xx are NOT excluded. ASR β€ SER when redirect responses are present.
NER (Network Effectiveness Ratio) - network quality (GSMA IR.42):
NER = 100 β ISA
NER = 100 means no infrastructure errors. NER < 95 - time to worry.
Latency at every stage
Five histograms cover all SIP transaction phases:
- RRD - Registration delay:
REGISTERβ200 OK - TTR - Time to first response:
INVITEβ first1xx - SPD - Session duration:
INVITE 200 OKβBYE 200 OK - ORD - OPTIONS response delay:
OPTIONSβ any response - LRD - Registration redirect delay:
REGISTERβ3xx
All histograms support histogram_quantile() for percentile-based alerting: p50, p95, p99.
Example for VictoriaMetrics / Prometheus:
95th percentile registration delay:
histogram_quantile(0.95, sum(rate(sip_exporter_rrd_bucket[5m])) by (le))
99th percentile session duration (specific carrier and device type):
histogram_quantile(0.99, sum(rate(sip_exporter_spd_bucket{carrier="mobile-operator-a",ua_type="yealink"}[5m])) by (le))
Additional metrics
- ISS (Ineffective Session Severity) - absolute count of
INVITEβ408/500/503/504responses. Unlike ISA (percentage), ISS enables alerting on absolute error volume:rate(sip_exporter_iss_total[5m]) > 20 - SDC (Session Duration Counter) - Prometheus Counter of completed sessions. Useful for rate queries:
rate(sip_exporter_sdc_total[5m]) - sip_exporter_packets_total - total parsed SIP packets
Per-Carrier: Metrics by Traffic Source
Aggregated metrics hide problems of specific traffic sources. If SER = 85%, it's unclear - are all sources at 85%, or is one at 50% while others are at 95%? The exporter solves this via CIDR mapping: IP subnets β source name β carrier label on every metric.
Configuration:
carriers:
- name: "telecom-alpha"
cidrs:
- "10.1.0.0/16"
- "10.2.0.0/16"
- "172.16.0.0/12"
- name: "telecom-beta"
cidrs:
- "192.168.10.0/24"
- "192.168.11.0/24"
- "192.168.12.0/24"
How it works: Carrier is determined at request time (INVITE/REGISTER/OPTIONS) by source IP. If INVITE came from 10.1.5.20 - the exporter finds this IP belongs to 10.1.0.0/16 and labels all metrics for this call (including responses and dialog termination) with carrier="mobile-operator-a". Responses come from a different IP (the SIP server), but carrier is inherited from the tracker by Call-ID, not determined by response IP. This is correct: metrics belong to the call initiator, not the server.
Result:
sip_exporter_invite_total{carrier="mobile-operator-a",ua_type="other"} 1523
sip_exporter_ser{carrier="mobile-operator-a",ua_type="other"} 95.2
sip_exporter_ser{carrier="sip-trunk-provider",ua_type="other"} 87.4
Now it's clear: the trunk provider has SER = 87.4%, while the mobile operator has 95.2%. You can build separate dashboards and alerts for each traffic source. IPs not matching any CIDR subnet get carrier="other".
Per-UA-Type: Metrics by Device Type
Carrier shows who is calling, but not with what. And device type is often the key factor in problems. If Yealink phones start getting 408 timeouts while Grandstream works fine - without the ua_type label it would look like a general quality drop. With it - the problem is clearly localized to a specific device type.
Configuration:
user_agents:
- regex: '(?i)^Yealink'
label: yealink
- regex: '(?i)^Grandstream'
label: grandstream
How it works: The User-Agent header is extracted from each SIP request and matched against regex patterns. When a phone with User-Agent: Yealink SIP-T46S 66.15.0.10 sends an INVITE - the exporter matches ^Yealink and labels all call metrics with ua_type="yealink". Like carrier, ua_type is determined at request time and inherited by responses through the tracker by Call-ID.
Result:
sip_exporter_invite_total{carrier="mobile-operator-a",ua_type="yealink"} 1523
sip_exporter_ser{carrier="mobile-operator-a",ua_type="yealink"} 95.2
sip_exporter_ser{carrier="mobile-operator-a",ua_type="grandstream"} 87.4
Combined queries - both labels work together for two-dimensional analysis:
- SER for Yealink phones on a specific carrier:
sip_exporter_ser{carrier="mobile-operator-a",ua_type="yealink"} - Active sessions by device type:
sum by (ua_type) (sip_exporter_sessions) - INVITE rate by carrier and device type:
sum by (carrier, ua_type) (rate(sip_exporter_invite_total[5m]))
Performance
Load testing was done with SIPp via testcontainers-go - real SIP traffic, not mocks. Test environment: Debian 12, Linux kernel 6.x, Docker 29.3.1, Intel i7-8665U (4 cores / 8 threads), Go 1.25.9.
Full call lifecycle - each call is a complete SIP dialog: INVITE β 100 Trying β 180 Ringing β 200 OK β ACK β BYE β 200 OK. On loopback each packet is duplicated (send + receive), so 7 messages β 14 packets per call.
With GOMAXPROCS=8 (all cores):
- CPS 1,000 β ~11,800 PPS β 8.7% CPU peak, 13 MB RAM, 0% packet loss
- CPS 2,000 β ~23,600 PPS β 12.2% CPU peak, 15 MB RAM, 0% packet loss
With GOMAXPROCS=1 (single core):
- CPS 2,000 β ~23,600 PPS β 9.2% CPU peak, 12 MB RAM, 0% packet loss
2,000 CPS, 0% packet loss, <12% CPU, ~15 MB RAM.
Scrape performance under 2,000 CPS load (14,000 PPS):
- Min: 1.7 ms | Avg: 4.2 ms | P95: 6.4 ms | Max: 8.4 ms
Scraping doesn't interfere with packet processing. You can scrape every 5-10 seconds even at maximum load.
Why it's fast:
- eBPF drops 99% of traffic in kernel - only SIP packets on ports 5060/5061 reach userspace
- 4 MB socket buffer - fits ~420ms of traffic at 28,000 PPS
- Go GC pauses <1ms - 400x smaller than buffer capacity, packets never lost due to GC
- SIP parsing ~1ΞΌs - microbenchmarks: INVITE 1.1ΞΌs, BYE 860ns, 200 OK 2.0ΞΌs
System requirements:
- Up to 500 CPS β 1 core, 128 MB RAM
- Up to 1,000 CPS β 1 core, 128 MB RAM
- Up to 2,000 CPS β 2 cores, 256 MB RAM
- Above 2,000 CPS β 4 cores, 512 MB RAM
Security: Why --privileged Is Safe
The container requires --privileged and network_mode: host. Here's why this is safe.
What capabilities are needed:
CAP_BPF- loading eBPF program into kernel viabpf()syscallCAP_NET_RAW- creatingAF_PACKETraw socket for reading packetsCAP_NET_ADMIN- binding eBPF filter to socket, configuring buffer
These are three specific capabilities for specific operations. All eBPF tools (Cilium, Falco, Pixie) require the same - this is a Linux kernel limitation, not a container one.
What the container does:
- Loads eBPF socket filter into kernel (once, at startup)
- Creates
AF_PACKETraw socket bound to network interface - Reads packets from socket into Go channel (10,000 buffer)
- Parses SIP headers
- Exports metrics via
/metricsendpoint
What the container does NOT do:
- Does not modify packets - eBPF filter is passive (read-only)
- Does not send SIP traffic - purely a listener
- Does not write to host filesystem - all volumes are
:ro - Does not access other containers, processes, or system resources
- Does not open ports except
/metrics(default 2112) - Does not establish outbound connections
The entire eBPF filter is 100 lines of C - fully auditable. Automated vulnerability scanning (govulncheck + Trivy) runs on every push. Current status: 0 vulnerabilities in code and image.
Quick Start
docker run --privileged --network host \
-e SIP_EXPORTER_INTERFACE=eth0 \
frzq/sip-exporter:latest
curl http://localhost:2112/metrics
Or with docker-compose:
services:
sip-exporter:
image: frzq/sip-exporter:latest
privileged: true
network_mode: host
environment:
- SIP_EXPORTER_INTERFACE=eth0
# Optional: per-carrier metrics
# - SIP_EXPORTER_CARRIERS_CONFIG=/etc/sip-exporter/carriers.yaml
# Optional: per-device-type metrics
# - SIP_EXPORTER_USER_AGENTS_CONFIG=/etc/sip-exporter/user_agents.yaml
# volumes:
# - ./carriers.yaml:/etc/sip-exporter/carriers.yaml:ro
# - ./user_agents.yaml:/etc/sip-exporter/user_agents.yaml:ro
Compatible with Prometheus, VictoriaMetrics, and Grafana Cloud - any scraper supporting Prometheus exposition format.
Project: github.com/aibudaevv/sip-exporter (AGPL-3.0)
Comments
No comments yet. Start the discussion.