Monitoring metrics and logs within a Kubernetes cluster is extremely important to ensure everything is running well, resources are being used efficiently, and the overall health of the cluster is maintained.
By monitoring different parts of the system and the numbers that represent their performance, you can learn useful things, fix problems, make sure that resources such as memory and CPU are being used wisely, and ensure your Kubernetes setup is working nicely.
In this guide, we're going to look at the main things that Kubernetes keeps an eye on, the critical numbers that tell us how they are doing, who is allowed to see these stats, and how to set up a tool called OpenTelemetry Collector to collect these numbers and send them to a particular place called Prometheus or Cortex, which helps us to make sense of it all.
Kubernetes Components to Monitor:
The OpenTelemetry Collector offers a variety of tools to help with monitoring Kubernetes. Some of the most important components for collecting Kubernetes data.
kubeletstats receivers
k8sobjects receiver
hostmetrics receiver
kubernetes cluster receiver
filelog receiver
Prometheus receiver
These components are crucial components to monitoring the Kubernetes cluster.
Following are the steps to integrate OpenTelemetry with your cluster:
Step 1: Install certificate manager.
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.12.0/cert-manager.yaml
Step 2: Install open telemetry operator.
kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml
Step 3: Enable kubestate-metrics and node exporter by which Prometheus receiver can scrape the available k8s resources and node-level metrics.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install ksm prometheus-community/kube-state-metrics -n "default"
helm install nodeexporter prometheus-community/prometheus-node-exporter -n "default"
Step 4: Enable Kubernetes audit log for better log response. Click here to enable the audit log.
Step 5: The below configuration will create the required configuration and send Kubernetes telemetry data to KloudMate.
apiVersion: v1
kind: ServiceAccount
metadata:
name: otel-collector
namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: otel-collector
namespace: default
rules:
-
apiGroups:
- ""
- apps
- autoscaling
- batch
- extensions
- policy
- rbac.authorization.k8s.io
resources:
- componentstatuses
- configmaps
- nodes/proxy
- daemonsets
- deployments
- events
- cronjobs
- statefulsets
- endpoints
- horizontalpodautoscalers
- ingress
- jobs
- limitranges
- namespaces
- nodes
- pods
- nodes/stats
- persistentvolumes
- persistentvolumeclaims
- resourcequotas
- replicasets
- replicationcontrollers
- serviceaccounts
- services
verbs:
- get
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: otel-collector
namespace: default
subjects:
- kind: ServiceAccount
name: otel-collector
namespace: default
roleRef:
kind: ClusterRole
name: otel-collector
apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector
namespace: default
data:
config.yaml: |
receivers:
prometheus:
config:
scrape_configs:
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
job_name: integrations/kubernetes/cadvisor
kubernetes_sd_configs:
- role: node
relabel_configs:
- replacement: kubernetes.default.svc.cluster.local:443
target_label: __address__
- regex: (.+)
replacement: /api/v1/nodes/$${1}/proxy/metrics/cadvisor
source_labels:
- __meta_kubernetes_node_name
target_label: __metrics_path__
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
server_name: kubernetes
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
job_name: integrations/kubernetes/kubelet
kubernetes_sd_configs:
- role: node
relabel_configs:
- replacement: kubernetes.default.svc.cluster.local:443
target_label: __address__
- regex: (.+)
replacement: /api/v1/nodes/$${1}/proxy/metrics
source_labels:
- __meta_kubernetes_node_name
target_label: __metrics_path__
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
server_name: kubernetes
- job_name: integrations/kubernetes/kube-state-metrics
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
regex: kube-state-metrics
source_labels:
- __meta_kubernetes_pod_label_app_kubernetes_io_name
- job_name: integrations/node_exporter
kubernetes_sd_configs:
- namespaces:
names:
- default
role: pod
relabel_configs:
- action: keep
regex: prometheus-node-exporter.*
source_labels:
- __meta_kubernetes_pod_label_app_kubernetes_io_name
- action: replace
source_labels:
- __meta_kubernetes_pod_node_name
target_label: instance
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: namespace
kubeletstats:
collection_interval: 30s
auth_type: "serviceAccount"
insecure_skip_verify: true
endpoint: "https://${env:K8S_NODE_NAME}:10250"
metric_groups:
- node
- pod
- container
- volume
hostmetrics:
collection_interval: 30s
scrapers:
load:
filesystem:
memory:
network:
disk:
cpu:
k8s_cluster:
collection_interval: 30s
auth_type: kubeConfig
node_conditions_to_report: [Ready, MemoryPressure, DiskPressure, NetworkUnavailable]
allocatable_types_to_report: [cpu, memory, storage, ephemeral-storage]
k8s_events:
auth_type : "serviceAccount"
otlp:
protocols:
grpc:
http:
filelog:
include:
- /var/log/pods/*/*/*.log
exclude:
- /var/log/pods/*/otel-collector/*.log
start_at: beginning
include_file_path: true
include_file_name: false
operators:
- type: router
id: get-format
routes:
- output: parser-docker
expr: 'body matches "^\\{"'
- output: parser-crio
expr: 'body matches "^[^ Z]+ "'
- output: parser-containerd
expr: 'body matches "^[^ Z]+Z"'
- type: regex_parser
id: parser-crio
regex: '^(?P<time>[^ Z]+) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) ?(?P<log>.*)$'
output: extract_metadata_from_filepath
timestamp:
parse_from: attributes.time
layout_type: gotime
layout: '2006-01-02T15:04:05.999999999Z07:00'
- type: regex_parser
id: parser-containerd
regex: '^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) ?(?P<log>.*)$'
output: extract_metadata_from_filepath
timestamp:
parse_from: attributes.time
layout: '%Y-%m-%dT%H:%M:%S.%LZ'
- type: json_parser
id: parser-docker
output: extract_metadata_from_filepath
timestamp:
parse_from: attributes.time
layout: '%Y-%m-%dT%H:%M:%S.%LZ'
- type: move
from: attributes.log
to: body
- type: regex_parser
id: extract_metadata_from_filepath
regex: '^.*\/(?P<namespace>[^_]+)_(?P<pod_name>[^_]+)_(?P<uid>[a-f0-9\-]{36})\/(?P<container_name>[^\._]+)\/(?P<restart_count>\d+)\.log$'
parse_from: attributes["log.file.path"]
cache:
size: 128
- type: move
from: attributes.stream
to: attributes["log.iostream"]
- type: move
from: attributes.container_name
to: resource["k8s.container.name"]
- type: move
from: attributes.namespace
to: resource["k8s.namespace.name"]
- type: move
from: attributes.pod_name
to: resource["k8s.pod.name"]
- type: move
from: attributes.restart_count
to: resource["k8s.container.restart_count"]
- type: move
from: attributes.uid
to: resource["k8s.pod.uid"]
attributes:
service.name: "kubernetes"
processors:
cumulativetodelta:
batch:
send_batch_size: 10000
timeout: 30s
k8sattributes:
auth_type: "serviceAccount"
passthrough: true
filter:
node_from_env_var: KUBE_NODE_NAME
extract:
metadata:
- k8s.pod.name
- k8s.pod.uid
- k8s.deployment.name
- k8s.namespace.name
- k8s.node.name
- k8s.pod.start_time
labels:
- tag_name: app.label.component
key: app.kubernetes.io/component
from: pod
pod_association:
- sources:
- from: resource_attribute
name: k8s.pod.ip
- sources:
- from: resource_attribute
name: k8s.pod.uid
- sources:
- from: connection
resourcedetection:
detectors:
- env
- ec2
- system
- docker
timeout: 5s
override: false
attributes/logs:
actions:
- key: source
from_attribute: name
action: upsert
- key: source
from_attribute: operator_type
action: upsert
- key: source
from_attribute: log.file.name
action: upsert
- key: source
from_attribute: fluent.tag
action: upsert
- key: source
from_attribute: service.name
action: upsert
- key: source
from_attribute: project.name
action: upsert
- key: source
from_attribute: serviceName
action: upsert
- key: source
from_attribute: projectName
action: upsert
- key: source
from_attribute: pod_name
action: upsert
- key: source
from_attribute: container_name
action: upsert
- key: source
from_attribute: namespace
action: upsert
exporters:
otlphttp:
endpoint: 'https://otel.kloudmate.com:4318'
headers:
Authorization: xxxxxxxxxxx
service:
pipelines:
metrics:
receivers: [prometheus, kubeletstats, hostmetrics, otlp, k8s_cluster]
processors: [batch]
exporters: [otlphttp]
logs:
receivers: [ otlp, filelog, k8s_events]
processors: [k8sattributes, resourcedetection, attributes/logs]
exporters: [otlphttp]
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: otel-collector
namespace: default
spec:
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
serviceAccountName: otel-collector
containers:
- name: otel-collector
image: otel/opentelemetry-collector-contrib:latest
args: ["--config=/etc/otel-collector-config/config.yaml"]
env:
- name: K8S_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumeMounts:
- name: config-vol
mountPath: /etc/otel-collector-config
- name: varlogpods
mountPath: /var/log/pods
readOnly: true
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
- name: varlogcontainers
mountPath: /var/log/containers
readOnly: true
volumes:
- name: config-vol
configMap:
name: otel-collector
- name: varlogpods
hostPath:
path: /var/log/pods
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: varlogcontainers
hostPath:
path: /var/log/containers
securityContext:
runAsUser: 0
This configuration provided for the OpenTelemetry collector collects metrics and logs from all the mentioned components and then transmits these metrics to the KloudMate backend.
By using OpenTelemetry Collector alongside appropriate RBAC authorization, you can effectively gather and send out metrics. This process provides valuable information about resource usage, performance, and the overall well-being of the cluster.