Measuring PagerDuty Alert Noise: A Diagnostic Framework

Every on-call veteran has the same intuition: a meaningful portion of what pages us isn’t actionable. What most of us don’t have is a way to prove it — let alone to classify why a given alert is noisy. Without classification, the only lever is blunt: raise the threshold, disable the alert, ignore the channel. None of these distinguishes an alert that’s noisy because the platform self-heals from one that’s noisy because it flaps.

I wanted a framework that diagnoses why an alert is noisy, so the fix matches the cause. PagerDuty is the concrete backdrop here — the fields created_at, resolved_at, and title are what make the measurements possible — but nothing in the framework is specific to it.

Alerts must be group-able before they can be measured

This is the step most discussions skip, but nothing downstream works without it.

Alert titles are written for humans. They embed mutable tokens: an instance ID, a percentage, and a count ratio. Two alerts from the same underlying rule almost never have identical titles:

[backend-47] CPU at 91%
[backend-12] CPU at 88%

If you group by raw title, frequency is always 1, and everything looks unique. Signal-to-noise is undefined.

Before anything else, alerts need a normalization pass that strips variable tokens and leaves a stable template. A reasonable starting rule set:

Numbers inside brackets […] → placeholder
Percentages like 91% → placeholder
Ratios like 448/500 → placeholder
Extend as needed: UUIDs, hostnames, timestamps, IPs

The two lines above collapse to:

[backend-X] CPU at X%

Now we have a groupable unit. Everything that follows assumes this step has been taken.

This deserves to be stated loudly, because it reframes the conversation. Alert noise isn’t measurable on raw alerts — it’s measurable on alert templates. Anyone debating SNR without a normalization pass is debating noise, not signal.

Three axes that matter

With templates in hand, three independent measurements per template become possible.

Frequency — how often the template fires. It tells you which templates dominate volume. It misses that high frequency isn’t inherently bad: a health check correctly firing 500 times during a real outage is doing its job.

Inter-event time — the gap between consecutive firings of the same template. Sort by created_at, take the diff. It tells you whether a template fires in bursts or steadily. A short inter-event time (seconds to a few minutes) usually means flapping — the same root condition trips the detector repeatedly. It misses that a template can have a short inter-event time during one incident and a long one across the quarter; look at the distribution, not the mean.

Resolution time — resolved_at - created_at. It tells you whether the system is doing work to fix the alert, or whether the condition flickers and clears on its own. It misses that short resolution time doesn’t mean unimportant — a well-designed autoscaler can resolve legitimate load alerts in seconds.

Each axis is useful. None is sufficient alone.

The diagnostic: two axes, four quadrants

The insight that made this worth writing down is that inter-event time × resolution time, conditioned on high frequency, produces a clean 2×2:

	Short resolution time	Long resolution time
Short inter-event	Flapping, self-healing	Flapping, human-intervention-required
Long inter-event	Rare transients	Genuine incidents

Each quadrant implies a different fix:

Flapping, self-healing (top-left): the underlying system is absorbing the condition before a human could even respond. The alert is observing a transient it cannot act on. Fix: add hysteresis or a debounce window at the detector. Don’t page on a condition that resolves faster than human reaction time.
Flapping, human-required (top-right): a real problem that keeps re-firing because each firing is a separate ticket. Fix: dedup at the alerting layer. One incident per burst, not one per firing. Adjust thresholds to reflect the sustained state, not the instantaneous one.
Rare transients (bottom-left): fire once in a while, clear quickly. Almost always noise — nothing to act on. Fix: raise the threshold, or remove the alert. If it’s never actionable, it shouldn’t page.
Genuine incidents (bottom-right): infrequent, take real time to resolve. This is what on-call should be seeing. Everything else is a distraction from this quadrant.

Most alert channels I’ve looked at spend the majority of their volume in the top-left and bottom-left — cheap-to-generate, cheap-to-clear — while the signal that matters is drowning in the bottom-right.

Why these three, not others

The three axes here share a property: they’re all temporal. That’s deliberate. Noise is, fundamentally, a property of how a signal distributes over time. You can’t detect it from labels; you detect it from timing.

Operationalizing

A one-off report is interesting. A recurring pass is useful.

The usable form is a weekly or monthly job that:

Pulls the window of incidents from PagerDuty’s list incidents endpoint
Normalizes titles into templates
Computes the three metrics per template
Flags the top N templates in each problematic quadrant
Tracks the total volume in each quadrant over time

The target metric isn’t incident count. Incident count drops for bad reasons too — people disabling alerts, outages going undetected. The target is the proportion of volume in the bottom-right quadrant — the fraction of alerts that represent genuine, actionable incidents. When that fraction rises, on-call is seeing more signal. When it falls, something needs attention regardless of whether the total volume is up or down.

Closing

Alert noise is usually discussed as a cultural problem — “the team is tired” — or a threshold-tuning problem — “bump it from 80% to 90%.” Both miss that noise has structure. The structure becomes visible only when you stop treating alerts as discrete events and start treating them as templates with temporal behavior.

Normalize first. Measure on three axes. Diagnose with the quadrant. The fix follows from the diagnosis.

Alerts must be group-able before they can be measured#

Three axes that matter#

The diagnostic: two axes, four quadrants#

Why these three, not others#

Operationalizing#

Closing#