Mohammad Gufran Jahangir February 5, 2026 0

It’s 2:13 AM. Your phone lights up.

“CPU HIGH on node ip-10-…”

You squint. You open the dashboard. CPU is 92%. Then 65%. Then 88%.
You wait. Nothing breaks. You go back to sleep.

Ten minutes later: another page.
Then another.
By morning, you’ve “handled” 14 alerts and fixed exactly zero real problems.

That’s alert fatigue: noise masquerading as safety.

This blog gives you a practical, step-by-step way to fix it using four levers:

Actionable alerts (every page leads to a clear action)
Routing (the right people get the right signal)
Dedup (one incident = one page, not 37)
Suppression (planned work and known dependencies don’t melt your on-call)

If you apply what’s below, you’ll cut noise fast without missing real outages.

Table of Contents

The core idea: pages are expensive

Treat paging like a production write:

it must be intentional
it must be owned
it must be actionable
it must be rate-limited

If your system pages for “interesting information” instead of “urgent action,” it will burn your team out.

So here’s the rule that changes everything:

If nobody can take a meaningful action within 15 minutes, it should not page.

It can still alert (Slack/email/dashboard). But it should not wake humans.

Step 0: Diagnose your alert fatigue in 30 minutes

Before changing anything, answer these four questions (even roughly):

How many pages per on-call shift?
Top 10 noisiest alerts (by count)?
% pages that didn’t require action? (false positives / “FYI pages”)
% pages that were routed to the wrong team?

Now you have targets.

Quick win: most teams fix 60–80% of pain by cleaning the top 10 noisy alerts first.

Lever 1: Actionable alerts (the “should I wake up?” test)

An actionable alert is not “something changed.”
It’s “something is broken (or about to break) and here’s what to do.”

The Actionable Alert Checklist

An alert is actionable if it answers these five questions in the alert message:

What is impacted? (service/user impact, not hostnames)
How bad is it? (severity + what’s failing)
What changed? (deploy, dependency, traffic spike if known)
What should I do first? (one concrete first step)
Where do I look next? (dashboard/runbook reference — no links needed, just names)

If your alert can’t answer these, it’s usually noise.

Symptom vs cause: stop paging on “temperature,” page on “fire”

Many alerts page on symptoms that fluctuate naturally:

CPU > 80%
memory > 75%
pod restarts
disk usage high

These are signals, but not always incidents.

Upgrade “CPU high” into something actionable

Instead of paging on CPU:

Page when CPU saturation causes user impact, like:

latency is high
error rate is high
queue lag is rising and will breach an SLA
pods are throttled and requests are failing

Example transformation

Bad page:

“CPU > 85% on node X”

Good page:

“Checkout API p95 latency > 1.2s for 10m AND CPU throttling > 20% on pods (likely saturation). First action: scale replicas or raise CPU requests.”

Notice how the “good page” points to impact + likely cause + first action.

Use “multi-window” to kill flappy alerts

A classic noise source is a metric that briefly crosses a threshold.

Fix it with two windows:

Fast window catches sudden drops (2–5 minutes)
Slow window confirms it’s real (10–30 minutes)

Example logic (plain English):

Alert if error rate is high now and still high over the last 15 minutes.

This stops “one bad minute” from paging someone.

Use “burn-rate” thinking (beginner-friendly version)

Instead of asking “is latency above X?”, ask:

“Are we burning reliability budget too fast?”

A simple version:

If you’re slightly degraded, notify (ticket/Slack)
If you’re rapidly degrading, page (humans now)

Example (API availability):

Warning: error rate > 1% for 15m (investigate in business hours)
Critical: error rate > 5% for 5m (page now)

Make every page carry a “first action”

Here are real “first actions” that prevent panic:

“Check last deploy in service X; rollback if started within 30 minutes.”
“If queue lag rising and workers at max, scale worker deployment to N replicas.”
“If DB connections > 90% and app errors increase, enable connection pool limit and scale DB read replica.”

If the first action is “investigate,” your alert is incomplete. Investigation is a process, not an action.

The two-tier model: notify vs page

Most systems should have at least these two levels:

Notify: something unusual; no immediate human action required
Page: user impact now, or will be within the next 15–30 minutes

If you only have one level (paging), everything becomes a fire drill.

Lever 2: Routing (right alert → right team → right channel)

Routing is where good alerts go to die.

Even a perfect alert becomes painful if:

it goes to everyone
it goes to the wrong team
it goes to the wrong channel (page vs Slack)
it has no owner

The routing rule

Every alert must have an owner label (team/service), and the routing must follow that label.

If you can’t assign ownership, you can’t operate the system.

Create a tiny routing taxonomy (keep it boring)

Use labels/tags like these:

service: checkout-api
team: payments
env: prod
severity: critical / warning / info
component: database / cache / queue / ingress
runbook: name-only reference

This makes routing deterministic.

Build a simple “channel strategy”

A practical default:

Critical (page): user impact or imminent outage
Warning (Slack/email): requires attention in hours, not minutes
Info (dashboard): trends, capacity hints, FYIs

If your warning alerts page, you’ve basically deleted “warning.”

Add escalation only where it truly helps

Escalation is powerful—but dangerous if misused.

A good escalation chain looks like:

On-call for owning service team
If not acknowledged in 10 minutes → backup on-call
If still not acknowledged → incident commander / platform on-call

Bad escalation:

“Page everyone until someone responds.”

Real routing example: microservices + platform

Service-level alerts (e.g., error rate, latency, queue lag) route to the service team.
Platform-level alerts (e.g., cluster capacity exhausted, shared ingress failing) route to platform/SRE.

The magic is the label:

team=platform vs team=payments

Then routing is automatic.

Lever 3: Dedup (one incident, one page)

Nothing destroys on-call faster than a cascade:

1 database slows down
12 services fail health checks
12 services page
48 pods restart
48 “pod restart” alerts
Someone gets 60 pages for one root cause

Dedup is how you turn chaos into one manageable incident.

The “incident key” concept

Dedup works when alerts that belong to the same incident share a stable key, such as:

service + env + symptom
dependency + env + region
cluster + env + capacity

Example:

All alerts related to “checkout-api errors in prod” share:
- dedup_key=checkout-api|prod|errors

Then your paging tool treats them as updates to the same incident, not new incidents.

Grouping rules (so 50 alerts become 1 notification)

Common grouping choices:

Group by service, env, severity
Optionally by region if multi-region

Avoid grouping by host/pod/container IDs. Those explode cardinality and destroy dedup.

Time-based dedup (the flapping fixer)

Even with grouping, you need a “dedup window”:

If the same alert fires repeatedly within, say, 15 minutes, don’t page again—append as an update.

This eliminates “alert storms” from transient instability.

Prefer “root cause alerting” where possible

Some systems can identify dependency failures (DB down, DNS failing, ingress broken).

Even if you can’t do true correlation, you can still implement a simple rule:

If dependency alert is firing, suppress downstream pages and send them as informational updates.

That’s the bridge between dedup and suppression.

Lever 4: Suppression (mute the noise without hiding reality)

Suppression is not “ignore problems.”
It’s “don’t wake humans for things we already understand.”

Two safe categories:

Planned maintenance
Known noisy conditions with a better primary signal

Suppression type 1: Maintenance windows (the obvious one)

If you deploy, migrate, or run load tests, you should:

suppress only the expected alerts
for a limited time window
scoped to env/service

Example:

During DB migration for payments-db in staging, suppress:
- replica lag warnings
- connection churn warnings
  But keep critical “data corruption” or “writes failing” alerts active.

The goal isn’t silence—it’s precision.

Suppression type 2: Dependency-based suppression (the life saver)

When a core dependency fails, downstream alerts become redundant.

Example dependency chain:

Redis is down → sessions fail → 10 services error → 10 pages

Better:

Redis down pages platform/owning team
Downstream services send non-paging alerts tagged “secondary to redis”

This preserves awareness without turning it into a storm.

Suppression type 3: Auto-suppression for known transient events

Some events are noisy but expected:

node rotation
cluster upgrades
autoscaler scale-in/scale-out
rolling deploys

A safe approach:

suppress low-severity alerts during the event
keep high-severity “impact alerts” active

Key rule: never suppress user-impact alerts just because you’re deploying.

Putting it all together: a step-by-step implementation plan

Week 1: Stop the bleeding (fast wins)

Pick top 10 noisiest alerts
For each, do one of:
- downgrade to notify (no page)
- add duration (must persist 10m)
- add multi-window check
- add better signal (latency/errors instead of CPU)
Add required fields to page messages:
- impact, severity, owner, first action

Result: immediate reduction in pages.

Week 2: Fix routing and ownership

Add labels/tags: service, team, env, severity
Create routing rules:
- critical → on-call paging for team
- warning → team channel
Add “unowned alert” rule:
- if no team, route to a “triage” queue (not page)

Result: fewer misroutes, less frustration, faster response.

Week 3: Dedup and incident keys

Decide grouping: by service + env + severity
Add a dedup window (e.g., 15m)
Establish incident keys for common incidents:
- db|prod|connections
- ingress|prod|5xx
- checkout-api|prod|errors

Result: alert storms collapse into single incidents.

Week 4: Suppression rules (the safety layer)

Add maintenance windows (scoped, time-limited)
Add dependency suppression:
- “If DB down is firing, don’t page all services for DB-related errors”
Implement “planned event” silences for rotations/upgrades (low severity only)

Result: fewer pages during planned work, less chaos during incidents.

Real examples (copy these patterns)

Example 1: Turn “CPU high” into “saturation impacting users”

Notify (warning):

“CPU > 85% for 15m on checkout-api pods (watch saturation)”

Page (critical):

“Checkout API p95 latency > 1.2s for 10m AND CPU throttling > 20% — likely CPU saturation. First action: scale replicas by +50% or raise CPU requests.”

Example 2: Queue lag (predict outage before customers feel it)

Notify:

“Queue lag rising, ETA to breach SLA: 2 hours”

Page:

“Queue lag will breach SLA in 20 minutes at current rate. First action: scale workers to N, verify downstream dependency healthy.”

This is actionable because it ties lag to time-to-impact.

Example 3: Database connection exhaustion without storms

Page (DB owner team):

“DB connections > 90% for 5m + app errors increasing. First action: enable connection pool cap and scale read replica / increase max connections safely.”

Suppress:

downstream “service error spikes” pages tagged as secondary, route as notify.

Alert message template that keeps on-call sane

Use this format for every paging alert:

Title: [SEV-1] Checkout API: elevated 5xx in prod
Impact: “Users cannot complete purchases (estimated 12% of requests failing)”
Scope: “prod, region us-west, service checkout-api”
Signal: “5xx 6.2% for 5m; p95 latency 1.8s for 10m”
Likely causes: “DB connection saturation or upstream timeout”
First action: “Check DB connections; if >90% cap pool and scale read replica”
Next checks: “Look at ‘Checkout Overview’ dashboard; recent deploy at 01:58”

Even beginners can follow this.

The “do not do this” list (common traps)

Don’t page on single-host metrics unless that host is truly critical and irreplaceable.
Don’t alert on averages when tail latency matters (p95/p99 are usually better).
Don’t send pages without an owner. That’s how alerts become background noise.
Don’t hide problems with suppression—suppress duplicates, not reality.
Don’t create alerts with no clear action (“CPU is interesting!” is not an action).

How to keep alert fatigue from coming back

Alert fatigue returns when alerts don’t have a lifecycle.

Create a simple lifecycle:

Proposed (new alert, notify-only)
Validated (proved it catches real issues)
Paging (only after validation + runbook)
Reviewed (monthly: keep/tune/remove)
Retired (if it never finds real incidents)

Monthly ritual (30 minutes):

top 10 pages by count
top 10 pages by “no action taken”
top 5 misroutes
retire or tune at least 3 alerts

This one habit prevents relapse.

Final takeaway

Alert fatigue isn’t solved by “fewer alerts.”
It’s solved by better alerts and better flow:

Make alerts actionable
Route them to the right owner
Dedup them into one incident
Suppress what’s expected or secondary

And when you do it right, something amazing happens:

Your phone rings less…
and when it rings, you trust it.

Alert fatigue fix (Prometheus + Alertmanager, Datadog, New Relic)

Actionable alerts • smart routing • dedup • suppression — with copy-ready examples

You don’t “get used to” alert fatigue. You adapt around it—and that’s when outages sneak through.

The fix isn’t “fewer alerts.” The fix is a better signal pipeline:

Actionable alerts (every page has a clear next move)
Routing (right team, right channel, right severity)
Dedup (one incident = one page, not 40)
Suppression (planned work + known dependency failures don’t spam humans)

Below is a practical, tool-by-tool guide for Prometheus + Alertmanager, Datadog, and New Relic.

The one rule that instantly reduces paging noise

If nobody can take a meaningful action within 15 minutes, it should not page.

It can still notify (Slack/email/dashboard). But it should not wake someone up.

Part A — Build alerts that humans can actually use (works for all tools)

1) The “Actionable Alert” checklist (use this on every page)

A paging alert is actionable only if it answers:

Impact: what users are seeing (errors, latency, downtime)
Scope: service + env + region (not hostnames)
Trigger: what threshold and for how long
Likely cause: one or two best guesses (not 12 possibilities)
First action: one concrete thing to do now (scale, rollback, failover, drain, restart, etc.)

If “first action” is “investigate” → it’s not ready to page.

2) Stop paging on symptoms; page on saturation + impact

Symptoms like CPU, memory, restarts are “interesting,” but often not urgent.
Upgrade them into “page-worthy” only when they tie to impact.

Bad page: CPU > 85% on node
Good page: Checkout API p95 latency high AND CPU throttling high (saturation causing impact)

3) Use multi-window to kill flapping

“Fast window” catches sudden spikes (2–5m)
“Slow window” confirms it’s real (10–30m)

Example logic: page when it’s bad now and still bad over 15 minutes.

Part B — Prometheus + Alertmanager (the “engineering control panel”)

Prometheus decides when something is wrong.
Alertmanager decides who gets notified, how often, and what gets suppressed.

1) Actionable Prometheus alerts (copy-ready patterns)

Pattern 1: Latency impact (page)

- alert: CheckoutHighLatency
  expr: histogram_quantile(0.95, sum by (le, service) (rate(http_request_duration_seconds_bucket{service="checkout"}[5m]))) > 1.2
  for: 10m
  labels:
    severity: critical
    team: payments
    service: checkout
    env: prod
  annotations:
    summary: "Checkout latency high (p95 > 1.2s for 10m)"
    impact: "Users may see slow checkout or timeouts"
    first_action: "Check recent deploy; if within 30m, consider rollback. If CPU throttling high, scale replicas +50%."

Pattern 2: “CPU high” turned into actionable “CPU throttling” (notify or page depending on impact)

Notify if throttling is high but errors are not.
Page only if throttling correlates with latency/errors.

2) Routing in Alertmanager (right team, right channel)

Use labels like team, service, env, severity. Then route on them.

Minimal route tree idea:

severity=critical AND env=prod → page on-call for owning team
severity=warning → team chat channel
env!=prod → never page (notify only)

Example skeleton:

route:
  receiver: default-notify
  group_by: ['alertname', 'service', 'env']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 2h

  routes:
  - matchers:
    - env="prod"
    - severity="critical"
    receiver: pager-payments
  - matchers:
    - team="payments"
    receiver: notify-payments
  - matchers:
    - env!="prod"
    receiver: default-notify

3) Dedup in Alertmanager (grouping that prevents “40 pages per incident”)

The most important settings:

group_by: controls “what becomes one incident”
group_wait: wait a bit to collect related alerts
group_interval: how often to send updates
repeat_interval: how often to re-notify if still firing

Practical defaults

group_by: [alertname, service, env]
group_wait: 30s
group_interval: 5m
repeat_interval: 2h (don’t repage every 5 minutes)

If you group by pod or instance, you’ll recreate alert storms.

4) Suppression (silences + inhibitions)

Silences (planned maintenance)

Use silences for:

deployments
DB migrations
load tests
cluster upgrades

Scope them tightly:

env=staging, service=checkout for 1 hour
Not “mute everything everywhere.”

Inhibit rules (dependency-down stops downstream spam)

This is the big one.

Example concept:
If DatabaseDown is firing in prod, then suppress downstream alerts like Checkout5xxHigh that are clearly caused by DB down.

inhibit_rules:
- source_matchers:
  - alertname="DatabaseDown"
  - env="prod"
  target_matchers:
  - env="prod"
  - alertname=~"Checkout5xxHigh|CheckoutLatencyHigh"
  equal: ['env']

Result: one clean “DB down” page + downstream alerts become non-paging updates (or get inhibited entirely).

Part C — Datadog (fast wins with tagging + monitor hygiene)

Datadog’s power is speed: you can reduce noise quickly if you standardize tags and monitor behavior.

1) Make Datadog alerts actionable (use message templates)

Put these in every monitor message:

Impact (what users see)
Scope (service/env)
First action
“If this started after deploy, rollback” hint

Example message format (paste into monitors):

Impact: Checkout errors elevated for users
Scope: service:checkout env:prod region:us-west
First action: Check last deploy; if within 30m rollback. If no deploy, check DB connections and upstream timeouts.
Owner: @team-payments-oncall

2) Routing in Datadog (tags decide who gets notified)

Key move: standardize tags:

service:checkout
team:payments
env:prod

Then use:

Notification handles based on tag filters (team channels / on-call)
Separate policies for env:prod vs non-prod

Practical routing strategy

env:prod AND severity:critical → page on-call
env:prod AND severity:warning → team channel (no page)
env:staging|dev → notify only

3) Dedup in Datadog (stop “multi-alert explosions”)

Datadog monitors can behave like:

multi-alert (one per host/pod/container)
single alert (aggregate)

Rule:

Page on service-level aggregate
Keep per-host/per-pod alerts as notify-only (or only for platform team)

Example

Page: sum:trace.http.request.errors{service:checkout,env:prod}
Notify: avg:system.cpu.user{host:*} grouped per host

Also consider:

“Renotify” settings: don’t repage too often
Evaluation delay: avoids transient spikes right after deploys/autoscaling

4) Suppression in Datadog (maintenance without blindness)

Use muting/downtime/scheduling for:

deploy windows (service-scoped)
load tests
planned failovers

Important guardrail:
Never mute user-impact monitors globally during deploy.
Mute only the monitors that are expected to flap (like “pod restarts”, “replica count changed”), not “checkout errors high”.

5) Correlation pattern in Datadog (dependency-down reduces cascades)

Create a primary dependency monitor (DB down, Redis down).
Then adjust downstream monitors to:

notify only when dependency monitor is alerting, or
use a composite / condition logic so pages happen only when the primary isn’t already firing

Result: one clear page instead of 12.

Part D — New Relic (issues-based alerting + clean workflows)

New Relic is strongest when you embrace its “issues/incidents” flow: group many signals into fewer incidents.

1) Actionable alerts in New Relic (NRQL + clear thresholds)

Example: error-rate paging condition

Warning: errors elevated for 15 minutes
Critical: errors high for 5 minutes (page)

NRQL idea (conceptual):

Measure error % for a service and alert on sustained thresholds.

Message should include:

Impact (what fails)
Scope (entity/service/env)
First action (rollback/scale/check dependency)

2) Routing in New Relic (policies + workflows)

A clean model:

Policies per team or domain (Payments, Platform, Data)
Workflows route by attributes like:
- team=payments
- service=checkout
- environment=production
- severity

Best practice: make routing deterministic via consistent attributes/tags.

3) Dedup in New Relic (incident preference + issue grouping)

New Relic can reduce storms if you configure:

incident/violation grouping so multiple signals become a single issue
conditions aligned to service/entity rather than per-host noise

Practical approach

Page on “service golden signals”:
- latency, error rate, saturation, traffic anomalies
Keep per-host metrics as warning/notify unless platform needs them

4) Suppression in New Relic (muting rules + maintenance windows)

Use muting for:

planned deploy windows
scheduled load tests
known noisy maintenance events

Guardrail:
Mute the expected noisy conditions (restarts, expected transient health checks), not real user-impact signals.

5) Dependency suppression pattern in New Relic

Create a primary condition for dependency health (DB/Redis/Ingress).
Then:

route downstream symptoms to a non-paging channel when dependency is down, or
reduce downstream paging thresholds during known dependency incidents

The goal: one high-quality incident representing the root cause.

The “copy-this” operating model (works across all three tools)

Severity model (simple and effective)

SEV-1 (Page): users impacted now or within 15–30 minutes
SEV-2 (Notify): needs attention soon, not immediate waking
SEV-3 (Info): trends / capacity / FYI (dashboards)

Golden signals that deserve paging (service-level)

Error rate
Latency (p95/p99)
Saturation (throttling, queue lag, pool exhaustion)
Availability (health endpoints only if they reflect real user failure)

Things that usually should not page by default

CPU high (without impact)
memory high (without impact)
pod restarts (unless correlated with errors/latency)
node disk usage (notify; page only when it threatens outage soon)

A 14-day step-by-step rollout plan (do this and you’ll feel the difference)

Days 1–3: Stop the worst noise

Identify top 10 paging alerts by volume
For each: downgrade to notify or add for: duration or tie to impact
Add “first action” to every page

Days 4–7: Fix routing and ownership

Enforce labels/tags: team/service/env/severity
Route prod critical pages to on-call; everything else to notify channels
Create a triage path for unowned alerts (no paging)

Days 8–10: Implement dedup

Group by service+env (not pod/host)
Add repeat/renotify controls
Ensure “one incident = one page” behavior

Days 11–14: Add suppression safely

Maintenance windows (tight scope, short duration)
Dependency-based inhibition/suppression (root cause pages, downstream updates)
Review after one on-call cycle and adjust

The litmus test: “Would I be glad to receive this at 3 AM?”

Take any paging alert and ask:

Does it clearly state impact?
Is the owner obvious?
Is the first action obvious?
Will it page again 20 times if it keeps firing?

If the answer isn’t “yes,” tune it until it is.

Reply wit

Mohammad Gufran Jahangir

Category:

Alert fatigue fix: actionable alerts, routing, dedup, suppression