Skip to content

SLI Kinds

Step 1 of the SLO wizard picks the SLI — the specific signal you measure compliance against. KloudMate ships six SLI kinds, each with its own conditional form fields. This page walks through them with a “When to use this” callout and the form fields for each.

The kind picker groups the kinds into three families (the same way Datadog frames its SLO types). Pick a family card first, then the specific kind:

FamilyReliability measured asError budgetKinds
By CountA ratio of good ÷ total events.A count of events.APM error rate, APM latency, APM request rate, Custom metric
By Monitor UptimeUptime from incidents or a synthetic monitor.A duration (downtime).Incident availability, Synthetic uptime
By Time SlicesThe share of time a metric meets a condition.A duration.Time slices

Two things changed in how SLIs are sourced, and they’re worth calling out up front:

  • Only Incident availability references a service. Every other kind is workspace-level — it measures a metric or a synthetic monitor, not an Incident-Management service. The service for an incident-based SLO is part of that SLI’s configuration, not a property of the SLO.
  • No raw-log or raw-trace SLIs. KloudMate doesn’t scan raw logs or arbitrary traces for SLIs. If you want reliability based on logs or traces, derive a metric from them in your collector and use Custom metric or Time slices. The APM kinds read pre-aggregated service span-metrics, not raw spans.

For the metric-based kinds (Custom metric, Time slices) the metric-name and attribute pickers suggest values discovered from your workspace’s OpenTelemetry metrics. The APM kinds’ service-name picker suggests services emitting span metrics. All pickers are free-solo, so you can type a value that doesn’t have telemetry yet — useful when you set up an SLO before the service starts emitting.

When to use: “This service should be up at least 99.9% of the time, measured by unresolved incident duration.”

Fields:

  • Service — required. The Incident-Management service whose incidents count against this SLO. This is the only place an SLO chooses a service. If the service is later deleted, it still shows in the picker as “name (deleted)” so the SLO keeps working.
  • Severities (optional) — multi-select, free-text (e.g. critical, high). Leave empty to count incidents of all severities.

The SLI is “share of the window when no matching incident was open on this service.” Time inside an enabled one-shot maintenance window on the service is excluded from the calculation rather than counted as bad. The error budget is a duration (seconds of downtime).

When to use: “Track a synthetic monitor’s uptime — the share of time its checks pass, with downtime as the error budget.”

Fields:

  • Synthetic monitor — required. Pick one of the workspace’s synthetic monitors.

Synthetic uptime is a preset, not a separate engine — it’s stored as a Time slices SLI over the monitor’s success signal (kloudmate_synthetic_check_success), with one slice per check (the slice width is set to the monitor’s check frequency) and a “slice is good when the check passed” condition. Slices with no run (monitor paused or not yet running) count as good, so a paused monitor doesn’t burn the budget. The error budget reads as downtime (a duration).

The By Count kinds express reliability as good ÷ total events, so the error budget is a count.

When to use: “Less than 1% of requests to this service should error.” (RED’s Errors.)

Fields:

  • Service name — required. Picker suggests services emitting span metrics.

The SLI is “share of a service’s spans that did not error.” A span counts as an error when its OpenTelemetry span status is Error. It’s computed from the service’s pre-aggregated request/error span-metrics — KloudMate doesn’t scan raw traces. If you need a custom definition of “error” (e.g. specific HTTP status codes), derive an error metric in your collector and use Custom metric instead.

When to use: “At least 99% of this service’s spans should complete under 200 ms.” (RED’s Duration.)

Fields:

  • Service name — required. Picker suggests services emitting span metrics.
  • Latency threshold (ms) — required.

The SLI counts spans that finished at or below the threshold as good, all spans as the denominator. It reads the service’s pre-aggregated duration histogram, summing the buckets whose upper bound is within the threshold — so a 200 ms threshold with a 99% target means “at least 99% of spans should be under 200 ms.”

When to use: “This service should serve at least 60 requests/minute continuously.” (RED’s Rate, time-sliced.)

Fields:

  • Service name — required. Picker suggests services emitting span metrics.
  • Min requests / minute — required. The floor.
  • Bucket size (minutes, optional) — default 1. Controls the time-slicing granularity; larger buckets smooth spiky traffic, smaller buckets detect short stalls.

The window is split into buckets; each bucket counts as good if its request volume is at or above the floor, bad otherwise. Empty buckets count as below the floor (bad) — correct for a traffic floor, but it means a sparse metric reads as catastrophic, so the preview warns when the metric covers little of the window.

When to use: “Build an SLI from arbitrary OTLP metrics — a ratio of two metric aggregations (good ÷ total).”

A Custom metric SLI is a ratio of two independent metric aggregations:

  • Good events (the numerator) — the metric + aggregation + optional filters for the events that count as a success.
  • Total events (the denominator) — the metric + aggregation + optional filters for all eligible events.

Both sides can use the same metric split by filters (e.g. a status label success over no filter) or two different metrics (e.g. http.requests.success over http.requests.total). The error budget is a count of events.

The aggregation for each side adapts to the metric you pick — see Type-aware aggregation below.

When to use: “A custom uptime definition — the share of time a metric (or a formula across several metrics) meets a condition.”

This is the time-based counterpart to Custom metric: instead of counting events, it splits the window into fixed slices and scores each slice good or bad. The error budget is a duration.

Fields:

  • Queries — one or more metric queries (labelled a, b, …), each a metric + aggregation + optional filters. Use Add query for more.
  • Formula (optional) — combine the queries by id, e.g. $a / $b. A single-query SLI evaluates as $a. Use Add formula to reveal the field.
  • Uptime condition“A slice is good when the value is < / / > / <value>.”
  • Slice width1 minute or 5 minutes.
  • No-data policy — how slices with no measured data count: Good (default — a gap isn’t a breach, suits low-traffic or synthetic uptime), Bad (a gap is a real problem, suits always-on metrics), or Excluded (drop the slice from the denominator).

Each slice’s (formula) value is compared against the condition; compliance is the share of good time.

For Custom metric and Time slices, the aggregation dropdown and its default adapt to the metric’s instrument type and temporality, captured automatically when you pick the metric (the same way dashboard panels and alert rules work):

Metric typeAggregations offeredDefault
Counter (delta)Sum, Rate, LastSum — totals the per-interval counts over the window.
Counter (cumulative)Increase, Rate, LastIncrease — last − first counts events over the window.
GaugeAvg, Sum, Min, Max, Last, Count…Avg over the window.
HistogramP25 … P99P95 (Custom metric only — histograms aren’t supported per-slice).

A free-typed metric whose type isn’t known yet is treated as a counter (the engine’s default).

The optional filter rows on Custom metric and Time slices queries have three columns:

  • Attribute — autocomplete that suggests the metric’s label keys.
  • OperatorEquals / Not equals / In (any of) / Not in.
  • Value — autocomplete that suggests values for the chosen attribute. Multi-select for In / Not in, scalar for Equals / Not equals.

+ Add filter appends a new row; the trash icon removes one.