Infrastructure Monitoring
Kubernetes Monitoring
6 min
monitoring metrics and logs within a kubernetes cluster is critical to ensure workloads are healthy, resources are used efficiently, and the overall cluster remains reliable note if you're using managed kubernetes services like aks (azure kubernetes service), eks (elastic kubernetes service) , or gke (google kubernetes engine) you can follow the same steps outlined in this guide for monitoring the kloudmate kubernetes agent and opentelemetry based setup work uniformly across these platforms by continuously monitoring key components and their performance indicators, you can detect issues early, optimize cpu and memory usage, and keep your kubernetes environment running smoothly in this guide, we focus on the main kubernetes components to monitor, the signals they expose, and how the kloudmate kubernetes agent uses opentelemetry to collect these signals and forward them to the kloudmate backend kubernetes components to monitor the kloudmate kubernetes agent uses multiple opentelemetry receivers under the hood to capture cluster metrics and logs key components include • kubeletstats – collects node level metrics from kubelet • k8s cluster – collects cluster level metrics (namespaces, deployments, pods, etc ) • filelog – collects container logs from pods running in the cluster • k8s events – collects kubernetes event logs these components together provide full visibility into node health, cluster state, container behavior, and control plane events for details on installing the kloudmate kubernetes agent, see the docid\ llwe7n morlxytvuqc5kg document final agent configuration once the agent is installed (either via the ui wizard or helm), it runs with a default configuration that you can view and customize a representative snippet of the agent configuration (also visible in the ui) looks like deployment config exporters debug verbosity detailed otlphttp endpoint ${env\ km collector endpoint} headers authorization ${env\ km api key} extensions health check endpoint 0 0 0 0 13133 processors filter/drop unwanted error mode ignore spans exclude match type strict span names \ fs \ dns \ net batch send batch size 10000 timeout 10s k8sattributes auth type serviceaccount extract metadata \ k8s pod name \ k8s pod uid \ k8s deployment name \ k8s namespace name \ k8s node name \ k8s pod start time \ k8s statefulset uid \ k8s replicaset uid \ k8s daemonset uid \ k8s deployment uid \ k8s job uid \ k8s pod ip \ k8s daemonset name \ k8s statefulset name \ k8s replicaset name \ k8s cronjob name \ k8s job name passthrough false pod association \ sources \ from resource attribute name k8s pod uid resource attributes \ action upsert from attribute host name key host id \ action upsert key k8s cluster name value ${env\ km cluster name} \ action upsert key km agent version value ${env\ km agent version} resource/cluster attributes \ action upsert key k8s cluster name value ${env\ km cluster name} \ action update from attribute k8s node name key host id \ action update from attribute k8s node name key host name resource/k8s events attributes \ action upsert key service name value k8s events transform/setservicename error mode ignore log statements \ context log statements \ set(log attributes\["service name"], "k8s events") where log attributes\["servicename"] == "github com/open telemetry/opentelemetry collector contrib/receiver/k8seventsreceiver" \ set(attributes\["service name"], "k8s objects") where attributes\["@message object kind"] != nil and attributes\["service name"] == nil transform/loglevels log statements \ context log error mode ignore statements \ set(log cache, extractpatterns(log body, "(?p<0>(\\\\{ \\\\}))")) where isstring(log body) \ merge maps(log attributes, parsejson(log cache\["0"]), "upsert") where ismap(log cache) and log cache\["0"] != nil \ flatten(log attributes) where ismap(log cache) and log cache\["0"] != nil \ merge maps(log attributes, log body, "upsert") where ismap(log body) \ context log error mode ignore conditions \ severity number == 0 and severity text == "" statements \ set(log severity text, log attributes\["severity"]) where log attributes\["severity"] != nil \ set(log severity text, log attributes\["level"]) where log attributes\["level"] != nil and log severity text == "" \ set(log severity text, log attributes\["log level"]) where log attributes\["log level"] != nil and log severity text == "" \ set(log severity text, log attributes\["@message"]\["severity"]) where ismap(log attributes\["@message"]) and log attributes\["@message"]\["severity"] != nil and log severity text == "" \ set(log severity text, log attributes\["@message"]\["level"]) where ismap(log attributes\["@message"]) and log attributes\["@message"]\["level"] != nil and log severity text \== "" \ set(log severity text, "warn") where log attributes\["@message"]\["type"] == "deleted" or log attributes\["event"]\["type"] == "deleted" \ set(log severity text, "info") where (log attributes\["@message"]\["type"] == "added" or log attributes\["@message"]\["type"] == "modified" or log attributes\["event"]\["type"] == "added" or log attributes\["event"]\["type"] == "modified") and log severity text == "" \ set(log cache\["substr"], log body string) where isstring(log body) and len(log body string) < 256 and log severity text == "" \ set(log cache\["substr"], substring(log body string, 0, 256)) where isstring(log body) and len(log body string) >= 256 and log severity text == "" \ set(log cache, extractpatterns(log cache\["substr"], "(?i)(?p<0>(alert|crit|emerg|fatal|error|err|warn|notice|debug|dbug|trace))")) where log cache\["substr"] != nil \ set(log severity number, severity number fatal) where ismatch(log cache\["0"], "(?i)(alert|crit|emerg|fatal)") or ismatch(log severity text, "(?i)(alert|crit|emerg|fatal)") \ set(log severity text, "fatal") where log severity number == severity number fatal \ set(log severity number, severity number error) where (ismatch(log cache\["0"], "(?i)(error|err)") or ismatch(log severity text, "(?i)(error|err)")) and log severity number == 0 \ set(log severity text, "error") where log severity number == severity number error \ set(log severity number, severity number warn) where (ismatch(log cache\["0"], "(?i)(warn|notice)") or ismatch(log severity text, "(?i)(warn|notice|warning)")) and log severity number == 0 \ set(log severity text, "warn") where log severity number == severity number warn \ set(log severity number, severity number debug) where (ismatch(log cache\["0"], "(?i)(debug|dbug)") or ismatch(log severity text, "(?i)(debug|dbug)")) and log severity number == 0 \ set(log severity text, "debug") where log severity number == severity number debug \ set(log severity number, severity number trace) where (ismatch(log cache\["0"], "(?i)(trace)") or ismatch(log severity text, "(?i)(trace)")) and log severity number \== 0 \ set(log severity text, "trace") where log severity number == severity number trace \ set(log severity number, severity number info) where ismatch(log severity text, "(?i)(info)") and log severity number \== 0 \ set(log severity text, "info") where log severity number == 0 \ set(log severity number, severity number info) where log severity number == 0 \ context log error mode ignore statements \ set(log severity text, convertcase(log severity text, "lower")) transform/add labels log statements \ context log statements \[] transform/copyservicefromlogattributes error mode ignore log statements \ context log statements \ set(resource attributes\["service name"], log attributes\["service name"]) where log attributes\["service name"] != nil and resource attributes\["service name"] == nil receivers k8s cluster allocatable types to report \ cpu \ memory \ ephemeral storage \ storage \ pods auth type serviceaccount collection interval 30s distribution kubernetes metrics k8s container cpu request enabled true k8s container memory request enabled true k8s node condition enabled true node conditions to report \ ready \ diskpressure \ memorypressure \ pidpressure \ networkunavailable k8s events auth type serviceaccount namespaces \ km agent \ bookinfo k8sobjects/1 auth type serviceaccount objects \ interval 24h mode pull name pods \ interval 24h mode pull name nodes \ mode pull name replicasets \ mode pull name namespaces \ mode pull name deployments \ mode pull name daemonsets \ mode pull name statefulsets \ mode pull name configmaps k8sobjects/2 auth type serviceaccount objects \ mode watch name pods \ mode watch name nodes \ mode watch name replicasets \ mode watch name namespaces \ mode watch name deployments \ mode watch name daemonsets \ mode watch name statefulsets \ mode watch name configmaps otlp protocols grpc endpoint 0 0 0 0 4317 http endpoint 0 0 0 0 4318 service extensions \ health check pipelines logs/k8s events exporters \ otlphttp processors \ resource/k8s events \ resource \ resource/cluster \ k8sattributes \ transform/loglevels \ batch receivers \ k8s events logs/otlp exporters \ otlphttp processors \ resource \ resource/cluster \ k8sattributes \ transform/copyservicefromlogattributes \ batch receivers \ otlp metrics/k8s cluster exporters \ otlphttp processors \ resource \ k8sattributes \ resource/cluster \ batch receivers \ k8s cluster metrics/otlp exporters \ otlphttp processors \ resource \ k8sattributes \ resource/cluster \ batch receivers \ otlp traces/otlp exporters \ otlphttp processors \ resource \ resource/cluster \ filter/drop unwanted \ k8sattributes \ batch receivers \ otlp daemonset config exporters debug verbosity detailed otlphttp endpoint ${env\ km collector endpoint} headers authorization ${env\ km api key} extensions health check endpoint 0 0 0 0 13133 processors filter/drop unwanted error mode ignore spans exclude match type strict span names \ fs \ dns \ net attributes/metrics actions \ key k8s cluster name value ${env\ km cluster name} action insert attributes/logs actions \ key k8s cluster name from attribute k8s cluster name action upsert \ key k8s namespace name from attribute k8s namespace name action upsert \ key k8s deployment name from attribute k8s deployment name action upsert \ key k8s node name from attribute k8s node name action upsert \ key k8s pod name from attribute k8s pod name action upsert \ key service namespace from attribute k8s namespace name action upsert \ key service instance id from attribute k8s pod uid action upsert \ key k8s container name from attribute k8s container name action upsert \ key k8s container image name from attribute k8s container image name action upsert \ key k8s container image tag from attribute k8s container image tag action upsert batch send batch size 100000 timeout 20s cumulativetodelta include match type strict metrics \ system network io \ system disk io \ system disk operations rate \ system network packets rate \ system network errors rate \ system network dropped rate \ k8s pod network io rate \ k8s pod network errors rate \ k8s node network io rate \ k8s node network errors rate deltatorate metrics \ system network io \ system disk io \ system disk operations rate \ system network packets rate \ system network errors rate \ system network dropped rate \ k8s pod network io rate \ k8s pod network errors rate \ k8s node network io rate \ k8s node network errors rate groupbyattrs/filelog keys \ k8s pod uid k8sattributes auth type serviceaccount passthrough false filter node from env var km node name extract metadata \ k8s pod name \ k8s pod uid \ k8s deployment name \ k8s namespace name \ k8s node name \ k8s pod start time \ k8s statefulset uid \ k8s replicaset uid \ k8s daemonset uid \ k8s deployment uid \ k8s job uid \ k8s pod ip \ k8s daemonset name \ k8s statefulset name \ k8s replicaset name \ k8s cronjob name \ k8s job name \ k8s container name pod association \ sources \ from resource attribute name k8s pod uid metricstransform/system transforms \ action insert experimental match labels os type linux include system memory utilization match type strict new name system memory utilization consumed operations \ action aggregate label values aggregated values \ used \ cached aggregation type sum label state new value consumed \ action insert experimental match labels os type darwin include system memory utilization match type strict new name system memory utilization consumed operations \ action aggregate label values aggregated values \ used \ inactive aggregation type sum label state new value consumed \ action update include system memory utilization consumed match type strict operations \ action delete label value label state label value free \ action delete label value label state label value buffered \ action delete label value label state label value slab reclaimable \ action delete label value label state label value slab unreclaimable resource attributes \ action upsert from attribute k8s node uid key host id \ action upsert key k8s cluster name value ${env\ km cluster name} resource/add node name attributes \ action upsert key k8s node name value ${env\ km node name} resource/cluster attributes \ action upsert key k8s cluster name value ${env\ km cluster name} \ action update from attribute k8s node name key host name resource/hostmetrics attributes \ action insert key is k8s node value yes resourcedetection detectors \ env \ system \ docker override false system hostname sources \ os resource attributes host ip enabled true timeout 5s transform/addservicename error mode ignore log statements \ context log statements \ set(attributes\["service name"], resource attributes\["k8s container name"]) where resource attributes\["k8s container name"] != nil transform/copyservicefromlogattributes error mode ignore log statements \ context log statements \ set(resource attributes\["service name"], log attributes\["service name"]) where log attributes\["service name"] != nil and resource attributes\["service name"] == nil transform/deleteostype metric statements \ context datapoint statements \ delete key(attributes, "os type") transform/ostype metric statements \ context datapoint statements \ set(attributes\["os type"], resource attributes\["os type"]) transform/ratecalculation/copymetric error mode ignore metric statements \ context metric statements \ copy metric(name="system network io rate") where metric name == "system network io" \ copy metric(name="system disk io rate") where metric name == "system disk io" \ copy metric(name="system disk operations rate") where metric name \== "system disk operations" \ copy metric(name="system network packets rate") where metric name \== "system network packets" \ copy metric(name="system network errors rate") where metric name \== "system network errors" \ copy metric(name="system network dropped rate") where metric name \== "system network dropped" \ copy metric(name="k8s pod network io rate") where metric name == "k8s pod network io" \ copy metric(name="k8s pod network errors rate") where metric name \== "k8s pod network errors" \ copy metric(name="k8s node network io rate") where metric name == "k8s node network io" \ copy metric(name="k8s node network errors rate") where metric name \== "k8s node network errors" transform/ratecalculation/sumtogauge error mode ignore metric statements \ context metric statements \ convert sum to gauge() where metric name == "system network io" \ convert sum to gauge() where metric name == "system disk io" \ convert sum to gauge() where metric name == "system disk operations" \ convert sum to gauge() where metric name == "system network packets" \ convert sum to gauge() where metric name == "system network errors" \ convert sum to gauge() where metric name == "system network dropped" \ convert sum to gauge() where metric name == "k8s pod network io" \ convert sum to gauge() where metric name == "k8s pod network errors" \ convert sum to gauge() where metric name == "k8s node network io" \ convert sum to gauge() where metric name == "k8s node network errors" receivers filelog/containers exclude \ / / gz \ /var/log/pods/km agent / / log \ ${env\ km xlog paths / no exclude } include \ /var/log/pods/ / / log include file name resolved true include file path true include file path resolved true max log size 1mib operators \ id container parser type container \ id recombine multiline type recombine combine field body is first entry body matches "^(\\\d{4} \\\d{2} \\\d{2}\[t ]\\\d{2} \\\d{2} \\\d{2}|\\\\{|\\\\\[)" combine with "\n" max log size 1048576 overwrite with newest source identifier attributes\["log file path"] \ id parser json type json parser parse from body parse to attributes parsed json on error send \ id extract timestamp json type move from attributes parsed json timestamp to attributes timestamp extracted if attributes parsed json != nil and attributes parsed json timestamp != nil \ id extract timestamp json time type move from attributes parsed json time to attributes timestamp extracted if attributes parsed json != nil and attributes parsed json time != nil and attributes timestamp extracted == nil \ id extract timestamp json ts type move from attributes parsed json ts to attributes timestamp extracted if attributes parsed json != nil and attributes parsed json ts != nil and attributes timestamp extracted == nil \ id extract level json type move from attributes parsed json level to attributes log level if attributes parsed json != nil and attributes parsed json level != nil \ id extract level json severity type move from attributes parsed json severity to attributes log level if attributes parsed json != nil and attributes parsed json severity != nil and attributes log level == nil \ id extract level json loglevel type move from attributes parsed json log level to attributes log level if attributes parsed json != nil and attributes parsed json log level != nil and attributes log level == nil \ id extract message json type move from attributes parsed json message to attributes message if attributes parsed json != nil and attributes parsed json message != nil \ id extract message json msg type move from attributes parsed json msg to attributes message if attributes parsed json != nil and attributes parsed json msg != nil and attributes message == nil \ id extract traceid json type move from attributes parsed json trace id to attributes trace id if attributes parsed json != nil and attributes parsed json trace id != nil \ id extract traceid json traceid type move from attributes parsed json traceid to attributes trace id if attributes parsed json != nil and attributes parsed json traceid != nil and attributes trace id == nil \ id extract spanid json type move from attributes parsed json span id to attributes span id if attributes parsed json != nil and attributes parsed json span id != nil \ id extract spanid json spanid type move from attributes parsed json spanid to attributes span id if attributes parsed json != nil and attributes parsed json spanid != nil and attributes span id == nil \ id flatten json type flatten field attributes parsed json if attributes parsed json != nil \ id extract timestamp unix type move from attributes parsed json timestamp to attributes timestamp unix if attributes parsed json != nil and attributes parsed json timestamp != nil and type(attributes parsed json timestamp) == "float64" \ id convert unix timestamp type add field attributes timestamp extracted value expr(unixseconds(int(attributes timestamp unix))) if attributes timestamp unix != nil \ id parser timestamp iso8601 type regex parser parse from body regex ^(?p\<timestamp>\d{4} \d{2} \d{2}\[t ]\d{2} \d{2} \d{2}(? \\ \d+)?(?\ z|\[+ ]\d{2} ?\d{2})?) parse to attributes timestamp match on error send if attributes timestamp extracted == nil \ id move timestamp iso8601 type move from attributes timestamp match timestamp to attributes timestamp extracted if attributes timestamp match != nil and attributes timestamp match timestamp != nil \ id parser loglevel plain type regex parser parse from body regex (?i)\b(?p\<level>trace|debug|info|warn|warning|error|fatal|critical)\b parse to attributes level match on error send if attributes log level == nil \ id move loglevel plain type move from attributes level match level to attributes log level if attributes level match != nil and attributes level match level != nil \ id parser traceid plain type regex parser parse from body regex trace\[ ]?id\[= ]\s (?p\<trace id>\[a f0 9]{32}|\[a f0 9]{16}) parse to attributes trace match on error send if attributes trace id == nil \ id move traceid plain type move from attributes trace match trace id to attributes trace id if attributes trace match != nil and attributes trace match trace id != nil \ id time parser type time parser parse from attributes timestamp extracted layout type gotime layout 2006 01 02t15 04 05 999999999z07 00 if attributes timestamp extracted != nil \ id set default loglevel type add field attributes log level value info if attributes log level == nil \ id severity parser type severity parser parse from attributes log level poll interval 10s otlp protocols http endpoint 0 0 0 0 4318 grpc endpoint 0 0 0 0 4317 hostmetrics collection interval 60s scrapers cpu metrics system cpu frequency enabled true system cpu logical count enabled true system cpu utilization enabled true disk metrics system disk io enabled true filesystem exclude fs types fs types \ autofs \ binfmt misc \ bpf \ cgroup2 \ configfs \ debugfs \ devpts \ devtmpfs \ fusectl \ hugetlbfs \ iso9660 \ mqueue \ nsfs \ overlay \ proc \ procfs \ pstore \ rpc pipefs \ securityfs \ selinuxfs \ squashfs \ sysfs \ tracefs match type strict exclude mount points match type regexp mount points \ /dev/ \ /proc/ \ /sys/ \ /run/containerd/runc/ \ /run/credentials/ \ /run/k3s/containerd/ \ /var/lib/containers/storage/ \ /var/lib/docker/ \ /var/lib/kubelet/ \ /snap/ metrics system filesystem utilization enabled true load cpu average true memory metrics system memory utilization enabled true network metrics system network io enabled true paging {} process metrics process cpu utilization enabled true mute process cgroup error true mute process exe error true mute process io error true mute process name error true mute process user error true resource attributes process owner enabled true processes {} system metrics system uptime enabled true kubeletstats auth type serviceaccount collection interval 30s endpoint ${env\ km node name} 10250 extra metadata labels \ container id insecure skip verify true k8s api config auth type serviceaccount metric groups \ volume \ node \ pod \ container metrics k8s container cpu limit utilization enabled true k8s container cpu request utilization enabled true k8s container memory limit utilization enabled true k8s container memory request utilization enabled true k8s pod cpu limit utilization enabled true k8s pod cpu request utilization enabled true k8s pod memory limit utilization enabled true k8s node memory usage enabled true k8s pod memory request utilization enabled true k8s pod uptime enabled true service extensions \ health check pipelines logs/containers exporters \ otlphttp processors \ resource \ resource/add node name \ resource/cluster \ attributes/logs \ groupbyattrs/filelog \ k8sattributes \ transform/addservicename \ transform/copyservicefromlogattributes \ batch receivers \ filelog/containers metrics/hostmetrics exporters \ otlphttp processors \ resourcedetection \ resource \ resource/hostmetrics \ resource/cluster \ k8sattributes \ transform/ostype \ attributes/logs \ metricstransform/system \ transform/deleteostype \ attributes/metrics \ transform/ratecalculation/copymetric \ cumulativetodelta \ deltatorate \ transform/ratecalculation/sumtogauge \ batch receivers \ hostmetrics \ otlp metrics/kubeletstats exporters \ otlphttp processors \ resourcedetection \ resource/add node name \ resource \ k8sattributes \ resource/cluster \ attributes/logs \ transform/ratecalculation/copymetric \ transform/ratecalculation/sumtogauge \ attributes/metrics \ batch receivers \ kubeletstats traces/otlp exporters \ otlphttp processors \ resource \ resource/cluster \ k8sattributes receivers \ otlp (this is an example; the actual configuration is generated by the ui and may differ based on your selected preferences, such as container logs, namespaces, and apm auto instrumentation ) the agent uses these exporters, processors, and kubernetes receivers to collect metrics, logs, and events and send them securely to kloudmate for visualization, alerting, and analysis post‑integration data validation verify that metrics are flowing into kloudmate using the explore view; the metrics will be visible here only if the integration is successful log in to your kloudmate account go to explore , select opentelemetry – metrics , choose a metric exposed by the integration, and click run query to see time‑series data confirming telemetry is available standard kubernetes dashboards kubernetes monitoring dashboards are available in the https //templates kloudmate com/ to import and start using these templates, follow the steps described in https //docs kloudmate com/creating a dashboard#sfk3g kubernetes default metrics cluster metrics metric name description k8s container cpu limit maximum resource limit set for the container k8s container cpu request resource requested for the container k8s container ephemeralstorage limit maximum resource limit set for the container k8s container ephemeralstorage request resource requested for the container k8s container memory limit maximum resource limit set for the container k8s container memory request resource requested for the container k8s container ready whether a container has passed its readiness probe (0 for no, 1 for yes) k8s container restarts how many times the container has restarted in the recent past this value is pulled directly from the k8s api and the value can go indefinitely high and be reset to 0 at any time depending on how your kubelet is configured to prune dead containers it is best to not depend too much on the exact value but rather look at it as either == 0, in which case you can conclude there were no restarts in the recent past, or > 0, in which case you can conclude there were restarts in the recent past, and not try and analyze the value beyond that k8s container storage limit maximum resource limit set for the container k8s container storage request resource requested for the container k8s cronjob active jobs the number of actively running jobs for a cronjob k8s daemonset current scheduled nodes number of nodes that are running at least 1 daemon pod and are supposed to run the daemon pod k8s daemonset desired scheduled nodes number of nodes that should be running the daemon pod (including nodes currently running the daemon pod) k8s daemonset misscheduled nodes number of nodes that are running the daemon pod, but are not supposed to run the daemon pod k8s daemonset ready nodes number of nodes that should be running the daemon pod and have one or more of the daemon pod running and ready k8s deployment available total number of available pods (ready for at least minreadyseconds) targeted by this deployment k8s deployment desired number of desired pods in this deployment node & container metrics metric name description container cpu time total cumulative cpu time (sum of all cores) spent by the container/pod/node since its creation container cpu usage total cpu usage (sum of all cores per second) averaged over the sample window container filesystem available container filesystem available container filesystem capacity container filesystem capacity container filesystem usage container filesystem usage container memory available container memory available container memory major page faults container memory major page faults container memory page faults container memory page faults container memory rss container memory rss container memory usage container memory usage container memory working set container memory working set k8s node cpu time total cumulative cpu time (sum of all cores) spent by the container/pod/node since its creation k8s node cpu usage total cpu usage (sum of all cores per second) averaged over the sample window k8s node filesystem available node filesystem available k8s node filesystem capacity node filesystem capacity k8s node filesystem usage node filesystem usage k8s node memory available node memory available k8s node memory major page faults node memory major page faults k8s node memory page faults node memory page faults k8s node memory rss node memory rss k8s node memory usage node memory usage k8s node memory working set node memory working set k8s node network errors node network errors k8s node network io node network io these are the expected metrics available from the kubernetes agent integration they are collected automatically when kubernetes monitoring is enabled no additional setup or configuration is required for source reference, see the upstream opentelemetry metric definitions https //github com/open telemetry/opentelemetry collector contrib/blob/main/receiver/k8sclusterreceiver/documentation md https //github com/open telemetry/opentelemetry collector contrib/blob/main/receiver/kubeletstatsreceiver/documentation md