Skip to content

Alert Lifecycle & States

Once you’ve created an alert, KloudMate evaluates it on a schedule and moves it through a series of states. The control knobs in Configure evaluation settingsEvaluate every, Pending duration, Recovery period, Alert state if No data, and Alert state if Error — each shape when and how those transitions happen.

This page explains the lifecycle so the behavior never surprises you. If you’ve ever wondered “why did my alert fire instantly?” or “why didn’t No Data wait out my pending duration?”, the answer is here.

  1. KloudMate runs the rule’s queries and expressions on the cadence set by Evaluate every (e.g. every 60s).
  2. A single rule can produce many instances — one per series returned by the query (one per host, function, endpoint, …). Each instance is tracked independently and has its own state.
  3. Every evaluation produces one raw outcome per instance, before any of the timing knobs are applied:
OutcomeWhat it means
BreachingThe alert condition evaluated to true. For a Condition Expression, that’s your threshold being crossed.
NormalThe alert condition evaluated to false.
No DataThe query returned no data points for the window (or the value was null / NaN).
ErrorA query failed, or the rule couldn’t be evaluated at all — bad query, source unreachable, or an evaluation timeout.

These raw outcomes are then fed through the timing knobs to produce the instance’s lifecycle state.

StateMeaning
NormalThe condition is within threshold. Nothing to do.
PendingThe condition is currently breaching but hasn’t held long enough to fire yet.
FiringThe condition has breached for at least the Pending duration. This is what notifies.
RecoveringThe condition has cleared, but the instance is still firing while it waits out the Recovery period. It has not resolved yet.
No DataThe query returned no data — and the rule is configured to surface that as its own state.
ErrorThe evaluation errored — and the rule is configured to surface that as its own state.

Pending and Recovering are derived states — KloudMate computes them from the raw outcome plus the timing knobs. The evaluator itself never reports them.

The core of the lifecycle is the firing episode — the path an instance takes from quiet, to firing, and back to quiet:

Normal ──breach──▶ Pending ──held ≥ pending duration──▶ Firing ──clears──▶ Recovering ──clear held ≥ recovery period──▶ Normal
  • Pending duration gates the way in. The condition must breach continuously for this long before the instance fires. A single Normal evaluation in the middle resets it back to Normal — the breach has to be sustained, not intermittent. Set it to 0 (leave it empty) to fire on the first breaching evaluation.
  • Recovery period gates the way out. After an instance clears, it stays Firing (shown as Recovering) until the condition has stayed clear for this long. This stops a flapping metric from resolving and immediately re-firing — which, because a resolved alert is terminal, would otherwise open a brand-new alert group on every flap. Set it to 0 (leave it empty) to resolve the instant the condition clears.

Exact transitions:

FromEvaluationGoes toWhen
NormalbreachingPendingPending duration > 0
NormalbreachingFiringPending duration = 0 (fire immediately)
PendingbreachingFiringbreach has been sustained ≥ Pending duration
PendingbreachingPendingstill inside the Pending duration
PendingclearsNormalresets — the breach was never confirmed
FiringclearsRecoveringRecovery period > 0
FiringclearsNormal (resolved)Recovery period = 0 (resolve immediately)
RecoveringclearsNormal (resolved)clear has been sustained ≥ Recovery period
RecoveringclearsRecoveringstill inside the Recovery period
RecoveringbreachingFiringa re-breach during recovery — snaps straight back to Firing
Knob (form label)What it controlsDefaultInherits from folder?
Evaluate everyHow often the rule runs.60sYes
Pending durationHow long a breach must hold before Firing.Fire immediatelyNo — per rule
Recovery periodHow long a clear must hold before resolving.Resolve immediatelyNo — per rule
Alert state if No dataWhich state a no-data result becomes.No DataYes
Alert state if ErrorWhich state an error result becomes.ErrorYes

This is the part that most often surprises people, and it’s the answer to a common question:

I configured the same query, pending duration, and recovery period. Only the Firing state goes through the pending and recovery windows — No Data and Error don’t. Is that expected?

Yes — that’s expected. The Pending duration and Recovery period are properties of the firing episode. They gate the ramp into and out of Firing. No Data and Error are separate evaluation outcomes — by default they are not part of a firing episode, so there is nothing for those windows to debounce. An instance switches into (and out of) No Data or Error immediately.

What each setting does on a no-data or error result:

Alert state if No data

TargetBehavior when the query returns no data
No Data (default)Instance switches to No Data immediately. No pending wait, no recovery wait.
NormalInstance switches to Normal immediately — treated as a clear.
ErrorInstance switches to Error immediately.
FiringTreated exactly like a threshold breach: it ramps through the Pending duration on the way in, and once firing it honors the Recovery period when data returns. An already-firing instance rides through the gap without resetting.

Alert state if Error

TargetBehavior when the evaluation errors
Error (default)Instance switches to Error immediately. No pending wait, no recovery wait.
NormalInstance switches to Normal immediately — treated as a clear (so a firing instance enters the Recovery period).
FiringTreated exactly like a threshold breach: it ramps through the Pending duration, and honors the Recovery period when the query recovers.
  • Transient blips on a firing alert. A firing instance only rides through a momentary No Data or Error gap when that state is mapped to Firing. If No Data / Error are left on their defaults, a firing instance switches to No Data / Error on the next bad evaluation — it does not stay “Firing” through the gap.
  • A series that disappears entirely. When a whole series stops appearing in the query results (a host is decommissioned, a metric is dropped), KloudMate waits a short grace window — about two evaluation intervals — and then treats the vanished instance as resolved and closes it. This is distinct from No Data, which is the query running and returning nothing.
  • Multi-instance rules. A rule’s overall state reflects its instances: if any instance is firing (or recovering), the rule reads as Firing; otherwise it surfaces the most significant instance state. Each instance still runs its own lifecycle independently.
  • Silences and maintenance windows don’t change state. They gate the outbound notification, not the lifecycle. A silenced instance keeps evaluating and transitioning, and its state changes still land in history — only the notification is withheld. See Silences and Maintenance Windows.