Zero-Downtime o11y (eBPF)

11 min

prelude mission critical environments managed by forward thinking site reliability engineering (sre) teams, security operations centers (socs), and tier 1 financial institutions face a fundamental paradox you need deep observability to maintain stability, but deploying traditional observability agents compromises that stability the kloudmate's ebpf approach solves this paradox it provides a drop in apm observability component that lives in the linux kernel, observing the behavior of the entire system dynamically without altering user level applications by deploying the kloudmate agent with the ebpf receiver enabled, you unlock instant, out of the box observability for your entire infrastructure, without requiring code changes, manual configurations, or application restarts a paradigm shift if your application can run on linux, it can be fully observed by kloudmate enjoy frictionless operations and complete network visibility with absolute zero downtime out of the box value the ebpf receiver leverages the extended berkeley packet filter to extract vast amounts of actionable metrics securely from day one, your soc and sre teams receive 1\ universal red metrics automatically generates and exports comprehensive request rate, error rate, and duration (latency) metrics across all observed services protocol agnostic automatically supports http/http2, grpc, mysql, postgresql, redis, mongodb, kafka, and elasticsearch operational context every metric includes critical operational markers such as exact http status codes, grpc status flags, and database query behaviors 2\ auto distributed tracing intelligently links incoming network requests to outgoing dependency calls instantaneously (e g , an http handler querying a database instance) generates strictly compliant opentelemetry trace spans, producing a seamless service graph representing your living architecture 3\ dynamic service inventory & metadata automatically identifies the process language ( km apm runtime language ) of your running applications without touching the binary enriches traces with dense host ids, process ids, and cloud provider metadata tags if running under kubernetes, it correlates metrics securely with k8s attributes ( namespace , pod name , deployment , node name ) dynamically 4\ granular network observability captures l3/l4 network flow metrics transparently—including full bytes transferred, tcp retransmits, state alterations, and packet drops between isolated services empowers security teams (soc) with definitive visibility into service to service communication dependencies and anomalous traffic mapping the kloudmate apm edge vs traditional telemetry while traditional opentelemetry heavily relies on language specific sdks (manual instrumentation) or bloated runtime dependencies (auto instrumentation) attached to every service, the kloudmate ebpf receiver is profoundly superior in high risk environments feature km ebpf receiver traditional opentelemetry code changes none deploy the agent to the node requires sdk dependencies or attaching agents application restarts no restarts required requires rolling restarts to inject instrumentation setup complexity low single daemonset / vm process per host high per service configuration, library updates language support universal go, rust, c++, python, java, node js, ruby requires per language sdks; compiled languages are difficult performance overhead extremely low runs in kernel space higher, especially with rich auto instrumentation missing services impossible — the kernel sees every packet un instrumented services remain blind spots deploying apm strategically start with ebpf the ebpf receiver serves as the ultimate foundational layer for enterprise observability for massive deployments like tier 1 banks or multi tenant saas platforms, deploy the km agent to grab immediate service graphs, rigorous network insights, and universal apm metrics across every language and stack you run enrich where necessary once ebpf has illuminated the vast majority of your distributed architecture, strategically deploy manual instrumentation docid\ jgzovknht8vf3phbc3dud sdks into the specific handful of applications requiring bespoke business logic mapping (such as tracing a specific user id or capturing unique transaction states) the kloudmate agent seamlessly merges ebpf kernel intelligence with your applications telemetry data giving your sre teams a flawless, unified view of reality configuration & setup the kloudmate ebpf agent is a minimal configuration observability agent designed to run as a daemonset within your kubernetes cluster or as a standalone process on a linux host this section details how to configure the agent perfectly for full stack visibility prerequisites a linux host with kernel version >= 5 8 (highly recommended for full feature support, though some features work on 4 18+) root privileges or cap sys admin and cap bpf capabilities for the agent process if running in kubernetes, the agent requires specific rbac permissions (node, pod, service, replicaset viewers) to enrich the metrics automatically basic configuration to enable the ebpf agent's core features, application red metrics, network observability, and opentelemetry exporting, you can add following to your receiver's definition here is a standard configuration example that enables the most common telemetry features across the entire host/node automatically, pointing data to a local otel collector or directly to the kloudmate backend \# ============================================================================ \# metrics & observability features \# ============================================================================ metrics features \ application # exports standard red (request rate, error rate, duration) metrics for http, grpc, db, and messaging requests at the service level \ application host # exports red metrics aggregated at the host/node level, providing a node centric view of application performance \ network # exports granular l3/l4 network flow metrics (tcp/udp bytes, drops, retransmits) between connection endpoints \ application process # exports red metrics aggregated down to the specific process id (pid) instance level for deeper isolation \ network inter zone # aggregates network flow metrics specifically based on traffic crossing defined geographic or logical zones (cidrs) \ application span # exports opentelemetry span based metrics (span metrics) derived directly from the generated trace spans \ application service graph # exports specifically formatted dependency metrics (service graph request total, etc ) used to build visual topological service maps \# ============================================================================ \# target discovery \# ============================================================================ discovery kubernetes enable true services \# monitor everything on the node automatically \ name all services namespace default, kube system, my app open ports '80, 443, 8080, 8443, 5432, 3306, 6379, 9092, 27017' \# k8s pod name ' ' # optional specifically match all kubernetes pods \# ============================================================================ \# network observability \# ============================================================================ network enable true source tc direction both \# ============================================================================ \# kubernetes enrichment \# ============================================================================ attributes kubernetes enable true \# ============================================================================ \# opentelemetry export settings \# ============================================================================ advanced fine tuning the agent allows very specific overrides, feature flags, and granular attribute filtering you can tune caching sizes and limit what exact attributes are exported per signal via the attributes select map ebpf \# enable heuristic detection for internal db protocols heuristic sql detect true \# configure maximum db operation caches (important for high load) mysql prepared statements cache size 1024 postgres prepared statements cache size 1024 attributes \# explicit filtering of the attributes appended to your telemetry \# for example, only send specific k8s tags on standard http traffic select http server request duration include \ http request method \ http response status code \ url path \ k8s namespace \ k8s pod name \ service name db client operation duration include \ db operation name \ db system name \ server address \ k8s pod name \ service name running the agent with ebpf in kubernetes when installing via the official kloudmate helm chart (as outlined in the installation guide), the ebpf support is automatically configured with necessary volume mounts and permissions you do not need to manually create daemonsets or map volume mounts simply ensure that the discovery block in your receivers ebpfreceiver section has kubernetes enable true and that the correct namespace lists are provided to begin monitoring traffic immediately trace context propagation you need to enable the application span in the metrics section of the receivers ebpfreceiver configuration also you must ensure that network source is set to socket filter