eBPF for Observability: From Theory to Practice Without a PhD
eBPF has moved from a kernel hacker's toy to a mainstream observability tool. If you run Linux in production, you already benefit from it — Cilium, Falco, and most modern profilers use eBPF under the hood. This article explains what eBPF does, why it matters for observability, and how to start using it without writing a single line of C.
What eBPF Actually Does
eBPF (extended Berkeley Packet Filter) lets you run sandboxed programs inside the Linux kernel without modifying kernel source code or loading kernel modules. The kernel verifier ensures these programs cannot crash the system, access arbitrary memory, or run indefinitely.
Think of it as a programmable event system built into the kernel. You attach small programs to kernel events — a function being called, a packet arriving, a syscall being made — and those programs can collect data, make decisions, or modify behavior.
The key hook points for observability:
- kprobes/kretprobes — Attach to any kernel function entry/exit. Use for tracing TCP connections, file I/O, scheduler events.
- tracepoints — Stable kernel instrumentation points. Preferred over kprobes when available because they survive kernel upgrades.
- uprobes — Attach to userspace function entry/exit. Trace application code without recompilation.
- XDP (eXpress Data Path) — Process network packets before they reach the network stack. Used for load balancing and DDoS mitigation.
- perf events — CPU performance counters, hardware events. Used for continuous profiling.
Practical Tools You Can Use Today
Cilium Hubble: Network Observability
If you run Cilium as your CNI (and in 2026, about 40% of production Kubernetes clusters do), Hubble gives you network-level observability for free. It captures every network flow between pods with metadata: source, destination, protocol, HTTP method, response code, latency.
hubble observe --namespace production \
--protocol http \
--http-status 500-599 \
--last 1h
This replaces the need for dedicated service mesh observability in many cases. No sidecar injection, no application changes, no performance overhead beyond what Cilium already introduces as a CNI.
Pixie: Auto-Instrumented Application Traces
Pixie captures application-level traces using eBPF uprobes — no code changes, no SDK integration, no sidecar. It intercepts TLS traffic at the OpenSSL/BoringSSL boundary, so it sees unencrypted HTTP/gRPC/MySQL/PostgreSQL protocol data even in mTLS environments.
# Query HTTP latency by service
px.display(
px.DataFrame('http_events')
.groupby(['service', 'req_path'])
.agg(
p99_latency=('latency', px.quantiles, 0.99),
error_rate=('resp_status', px.fraction, px.gt, 499)
)
)
The limitation: Pixie works best with Go, C/C++, and Rust binaries. JVM and Python support exists but is less mature due to how these runtimes handle TLS.
Tetragon: Security Observability
Tetragon (also from the Cilium project) provides kernel-level security observability. It traces process execution, file access, and network connections at the kernel level, producing events that feed into SIEM systems or Kubernetes audit logs.
# TracingPolicy to detect binary execution in containers
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
name: detect-binary-exec
spec:
kprobes:
- call: "security_bprm_check"
syscall: false
args:
- index: 0
type: "linux_binprm"
TCP Retransmit Metrics Without Agents
One of the most practical eBPF use cases is tracking TCP retransmissions. Retransmits indicate network quality issues and directly impact application latency. Traditional monitoring requires packet capture (expensive) or application-level instrumentation (incomplete).
With eBPF, you attach to the tcp_retransmit_skb tracepoint and get per-connection retransmit counts with negligible overhead:
# Using bpftrace one-liner
bpftrace -e 'tracepoint:tcp:tcp_retransmit_skb {
@retransmits[ntop(args->saddr), ntop(args->daddr)] = count();
}'
Continuous Profiling
eBPF-based continuous profilers (Pyroscope, Parca, Grafana Pyroscope) sample CPU stack traces at low frequency (typically 19Hz or 97Hz) with overhead under 1%. This gives you always-on flame graphs for every service in production.
The insight density is remarkable. We identified a 15% CPU regression in a Go service within hours of deployment — the flame graph showed unexpected time in runtime.mallocgc due to a loop that allocated slices instead of reusing a buffer. Without continuous profiling, this would have gone unnoticed until capacity planning flagged the trend weeks later.
Getting Started
Start with the tools, not the kernel programming. Install Cilium with Hubble enabled. Deploy Grafana Pyroscope for continuous profiling. If you need deeper introspection, use bpftrace for ad-hoc investigation — it is the awk of kernel tracing. Only write custom eBPF programs when existing tools genuinely cannot solve your problem.