It’s 2:13 AM. Your phone lights up.
“CPU HIGH on node ip-10-…”
You squint. You open the dashboard. CPU is 92%. Then 65%. Then 88%.
You wait. Nothing breaks. You go back to sleep.
Ten minutes later: another page.
Then another.
By morning, you’ve “handled” 14 alerts and fixed exactly zero real problems.
That’s alert fatigue: noise masquerading as safety.
This blog gives you a practical, step-by-step way to fix it using four levers:
- Actionable alerts (every page leads to a clear action)
- Routing (the right people get the right signal)
- Dedup (one incident = one page, not 37)
- Suppression (planned work and known dependencies don’t melt your on-call)
If you apply what’s below, you’ll cut noise fast without missing real outages.

The core idea: pages are expensive
Treat paging like a production write:
- it must be intentional
- it must be owned
- it must be actionable
- it must be rate-limited
If your system pages for “interesting information” instead of “urgent action,” it will burn your team out.
So here’s the rule that changes everything:
If nobody can take a meaningful action within 15 minutes, it should not page.
It can still alert (Slack/email/dashboard). But it should not wake humans.
Step 0: Diagnose your alert fatigue in 30 minutes
Before changing anything, answer these four questions (even roughly):
- How many pages per on-call shift?
- Top 10 noisiest alerts (by count)?
- % pages that didn’t require action? (false positives / “FYI pages”)
- % pages that were routed to the wrong team?
Now you have targets.
Quick win: most teams fix 60–80% of pain by cleaning the top 10 noisy alerts first.
Lever 1: Actionable alerts (the “should I wake up?” test)
An actionable alert is not “something changed.”
It’s “something is broken (or about to break) and here’s what to do.”
The Actionable Alert Checklist
An alert is actionable if it answers these five questions in the alert message:
- What is impacted? (service/user impact, not hostnames)
- How bad is it? (severity + what’s failing)
- What changed? (deploy, dependency, traffic spike if known)
- What should I do first? (one concrete first step)
- Where do I look next? (dashboard/runbook reference — no links needed, just names)
If your alert can’t answer these, it’s usually noise.
Symptom vs cause: stop paging on “temperature,” page on “fire”
Many alerts page on symptoms that fluctuate naturally:
- CPU > 80%
- memory > 75%
- pod restarts
- disk usage high
These are signals, but not always incidents.
Upgrade “CPU high” into something actionable
Instead of paging on CPU:
Page when CPU saturation causes user impact, like:
- latency is high
- error rate is high
- queue lag is rising and will breach an SLA
- pods are throttled and requests are failing
Example transformation
Bad page:
- “CPU > 85% on node X”
Good page:
- “Checkout API p95 latency > 1.2s for 10m AND CPU throttling > 20% on pods (likely saturation). First action: scale replicas or raise CPU requests.”
Notice how the “good page” points to impact + likely cause + first action.
Use “multi-window” to kill flappy alerts
A classic noise source is a metric that briefly crosses a threshold.
Fix it with two windows:
- Fast window catches sudden drops (2–5 minutes)
- Slow window confirms it’s real (10–30 minutes)
Example logic (plain English):
Alert if error rate is high now and still high over the last 15 minutes.
This stops “one bad minute” from paging someone.
Use “burn-rate” thinking (beginner-friendly version)
Instead of asking “is latency above X?”, ask:
“Are we burning reliability budget too fast?”
A simple version:
- If you’re slightly degraded, notify (ticket/Slack)
- If you’re rapidly degrading, page (humans now)
Example (API availability):
- Warning: error rate > 1% for 15m (investigate in business hours)
- Critical: error rate > 5% for 5m (page now)
Make every page carry a “first action”
Here are real “first actions” that prevent panic:
- “Check last deploy in service X; rollback if started within 30 minutes.”
- “If queue lag rising and workers at max, scale worker deployment to N replicas.”
- “If DB connections > 90% and app errors increase, enable connection pool limit and scale DB read replica.”
If the first action is “investigate,” your alert is incomplete. Investigation is a process, not an action.
The two-tier model: notify vs page
Most systems should have at least these two levels:
- Notify: something unusual; no immediate human action required
- Page: user impact now, or will be within the next 15–30 minutes
If you only have one level (paging), everything becomes a fire drill.
Lever 2: Routing (right alert → right team → right channel)
Routing is where good alerts go to die.
Even a perfect alert becomes painful if:
- it goes to everyone
- it goes to the wrong team
- it goes to the wrong channel (page vs Slack)
- it has no owner
The routing rule
Every alert must have an owner label (team/service), and the routing must follow that label.
If you can’t assign ownership, you can’t operate the system.
Create a tiny routing taxonomy (keep it boring)
Use labels/tags like these:
service: checkout-apiteam: paymentsenv: prodseverity: critical / warning / infocomponent: database / cache / queue / ingressrunbook: name-only reference
This makes routing deterministic.
Build a simple “channel strategy”
A practical default:
- Critical (page): user impact or imminent outage
- Warning (Slack/email): requires attention in hours, not minutes
- Info (dashboard): trends, capacity hints, FYIs
If your warning alerts page, you’ve basically deleted “warning.”
Add escalation only where it truly helps
Escalation is powerful—but dangerous if misused.
A good escalation chain looks like:
- On-call for owning service team
- If not acknowledged in 10 minutes → backup on-call
- If still not acknowledged → incident commander / platform on-call
Bad escalation:
- “Page everyone until someone responds.”
Real routing example: microservices + platform
Service-level alerts (e.g., error rate, latency, queue lag) route to the service team.
Platform-level alerts (e.g., cluster capacity exhausted, shared ingress failing) route to platform/SRE.
The magic is the label:
team=platformvsteam=payments
Then routing is automatic.
Lever 3: Dedup (one incident, one page)
Nothing destroys on-call faster than a cascade:
- 1 database slows down
- 12 services fail health checks
- 12 services page
- 48 pods restart
- 48 “pod restart” alerts
- Someone gets 60 pages for one root cause
Dedup is how you turn chaos into one manageable incident.
The “incident key” concept
Dedup works when alerts that belong to the same incident share a stable key, such as:
service + env + symptomdependency + env + regioncluster + env + capacity
Example:
- All alerts related to “checkout-api errors in prod” share:
dedup_key=checkout-api|prod|errors
Then your paging tool treats them as updates to the same incident, not new incidents.
Grouping rules (so 50 alerts become 1 notification)
Common grouping choices:
- Group by
service,env,severity - Optionally by
regionif multi-region
Avoid grouping by host/pod/container IDs. Those explode cardinality and destroy dedup.
Time-based dedup (the flapping fixer)
Even with grouping, you need a “dedup window”:
- If the same alert fires repeatedly within, say, 15 minutes, don’t page again—append as an update.
This eliminates “alert storms” from transient instability.
Prefer “root cause alerting” where possible
Some systems can identify dependency failures (DB down, DNS failing, ingress broken).
Even if you can’t do true correlation, you can still implement a simple rule:
If dependency alert is firing, suppress downstream pages and send them as informational updates.
That’s the bridge between dedup and suppression.
Lever 4: Suppression (mute the noise without hiding reality)
Suppression is not “ignore problems.”
It’s “don’t wake humans for things we already understand.”
Two safe categories:
- Planned maintenance
- Known noisy conditions with a better primary signal
Suppression type 1: Maintenance windows (the obvious one)
If you deploy, migrate, or run load tests, you should:
- suppress only the expected alerts
- for a limited time window
- scoped to env/service
Example:
- During DB migration for
payments-dbin staging, suppress:- replica lag warnings
- connection churn warnings
But keep critical “data corruption” or “writes failing” alerts active.
The goal isn’t silence—it’s precision.
Suppression type 2: Dependency-based suppression (the life saver)
When a core dependency fails, downstream alerts become redundant.
Example dependency chain:
- Redis is down → sessions fail → 10 services error → 10 pages
Better:
- Redis down pages platform/owning team
- Downstream services send non-paging alerts tagged “secondary to redis”
This preserves awareness without turning it into a storm.
Suppression type 3: Auto-suppression for known transient events
Some events are noisy but expected:
- node rotation
- cluster upgrades
- autoscaler scale-in/scale-out
- rolling deploys
A safe approach:
- suppress low-severity alerts during the event
- keep high-severity “impact alerts” active
Key rule: never suppress user-impact alerts just because you’re deploying.
Putting it all together: a step-by-step implementation plan
Week 1: Stop the bleeding (fast wins)
- Pick top 10 noisiest alerts
- For each, do one of:
- downgrade to notify (no page)
- add duration (must persist 10m)
- add multi-window check
- add better signal (latency/errors instead of CPU)
- Add required fields to page messages:
- impact, severity, owner, first action
Result: immediate reduction in pages.
Week 2: Fix routing and ownership
- Add labels/tags:
service,team,env,severity - Create routing rules:
- critical → on-call paging for
team - warning → team channel
- critical → on-call paging for
- Add “unowned alert” rule:
- if no
team, route to a “triage” queue (not page)
- if no
Result: fewer misroutes, less frustration, faster response.
Week 3: Dedup and incident keys
- Decide grouping: by
service + env + severity - Add a dedup window (e.g., 15m)
- Establish incident keys for common incidents:
db|prod|connectionsingress|prod|5xxcheckout-api|prod|errors
Result: alert storms collapse into single incidents.
Week 4: Suppression rules (the safety layer)
- Add maintenance windows (scoped, time-limited)
- Add dependency suppression:
- “If DB down is firing, don’t page all services for DB-related errors”
- Implement “planned event” silences for rotations/upgrades (low severity only)
Result: fewer pages during planned work, less chaos during incidents.
Real examples (copy these patterns)
Example 1: Turn “CPU high” into “saturation impacting users”
Notify (warning):
- “CPU > 85% for 15m on checkout-api pods (watch saturation)”
Page (critical):
- “Checkout API p95 latency > 1.2s for 10m AND CPU throttling > 20% — likely CPU saturation. First action: scale replicas by +50% or raise CPU requests.”
Example 2: Queue lag (predict outage before customers feel it)
Notify:
- “Queue lag rising, ETA to breach SLA: 2 hours”
Page:
- “Queue lag will breach SLA in 20 minutes at current rate. First action: scale workers to N, verify downstream dependency healthy.”
This is actionable because it ties lag to time-to-impact.
Example 3: Database connection exhaustion without storms
Page (DB owner team):
- “DB connections > 90% for 5m + app errors increasing. First action: enable connection pool cap and scale read replica / increase max connections safely.”
Suppress:
- downstream “service error spikes” pages tagged as secondary, route as notify.
Alert message template that keeps on-call sane
Use this format for every paging alert:
Title: [SEV-1] Checkout API: elevated 5xx in prod
Impact: “Users cannot complete purchases (estimated 12% of requests failing)”
Scope: “prod, region us-west, service checkout-api”
Signal: “5xx 6.2% for 5m; p95 latency 1.8s for 10m”
Likely causes: “DB connection saturation or upstream timeout”
First action: “Check DB connections; if >90% cap pool and scale read replica”
Next checks: “Look at ‘Checkout Overview’ dashboard; recent deploy at 01:58”
Even beginners can follow this.
The “do not do this” list (common traps)
- Don’t page on single-host metrics unless that host is truly critical and irreplaceable.
- Don’t alert on averages when tail latency matters (p95/p99 are usually better).
- Don’t send pages without an owner. That’s how alerts become background noise.
- Don’t hide problems with suppression—suppress duplicates, not reality.
- Don’t create alerts with no clear action (“CPU is interesting!” is not an action).
How to keep alert fatigue from coming back
Alert fatigue returns when alerts don’t have a lifecycle.
Create a simple lifecycle:
- Proposed (new alert, notify-only)
- Validated (proved it catches real issues)
- Paging (only after validation + runbook)
- Reviewed (monthly: keep/tune/remove)
- Retired (if it never finds real incidents)
Monthly ritual (30 minutes):
- top 10 pages by count
- top 10 pages by “no action taken”
- top 5 misroutes
- retire or tune at least 3 alerts
This one habit prevents relapse.
Final takeaway
Alert fatigue isn’t solved by “fewer alerts.”
It’s solved by better alerts and better flow:
- Make alerts actionable
- Route them to the right owner
- Dedup them into one incident
- Suppress what’s expected or secondary
And when you do it right, something amazing happens:
Your phone rings less…
and when it rings, you trust it.
Alert fatigue fix (Prometheus + Alertmanager, Datadog, New Relic)
Actionable alerts • smart routing • dedup • suppression — with copy-ready examples
You don’t “get used to” alert fatigue. You adapt around it—and that’s when outages sneak through.
The fix isn’t “fewer alerts.” The fix is a better signal pipeline:
- Actionable alerts (every page has a clear next move)
- Routing (right team, right channel, right severity)
- Dedup (one incident = one page, not 40)
- Suppression (planned work + known dependency failures don’t spam humans)
Below is a practical, tool-by-tool guide for Prometheus + Alertmanager, Datadog, and New Relic.
The one rule that instantly reduces paging noise
If nobody can take a meaningful action within 15 minutes, it should not page.
It can still notify (Slack/email/dashboard). But it should not wake someone up.
Part A — Build alerts that humans can actually use (works for all tools)
1) The “Actionable Alert” checklist (use this on every page)
A paging alert is actionable only if it answers:
- Impact: what users are seeing (errors, latency, downtime)
- Scope: service + env + region (not hostnames)
- Trigger: what threshold and for how long
- Likely cause: one or two best guesses (not 12 possibilities)
- First action: one concrete thing to do now (scale, rollback, failover, drain, restart, etc.)
If “first action” is “investigate” → it’s not ready to page.
2) Stop paging on symptoms; page on saturation + impact
Symptoms like CPU, memory, restarts are “interesting,” but often not urgent.
Upgrade them into “page-worthy” only when they tie to impact.
Bad page: CPU > 85% on node
Good page: Checkout API p95 latency high AND CPU throttling high (saturation causing impact)
3) Use multi-window to kill flapping
- “Fast window” catches sudden spikes (2–5m)
- “Slow window” confirms it’s real (10–30m)
Example logic: page when it’s bad now and still bad over 15 minutes.
Part B — Prometheus + Alertmanager (the “engineering control panel”)
Prometheus decides when something is wrong.
Alertmanager decides who gets notified, how often, and what gets suppressed.
1) Actionable Prometheus alerts (copy-ready patterns)
Pattern 1: Latency impact (page)
- alert: CheckoutHighLatency
expr: histogram_quantile(0.95, sum by (le, service) (rate(http_request_duration_seconds_bucket{service="checkout"}[5m]))) > 1.2
for: 10m
labels:
severity: critical
team: payments
service: checkout
env: prod
annotations:
summary: "Checkout latency high (p95 > 1.2s for 10m)"
impact: "Users may see slow checkout or timeouts"
first_action: "Check recent deploy; if within 30m, consider rollback. If CPU throttling high, scale replicas +50%."
Pattern 2: “CPU high” turned into actionable “CPU throttling” (notify or page depending on impact)
Notify if throttling is high but errors are not.
Page only if throttling correlates with latency/errors.
2) Routing in Alertmanager (right team, right channel)
Use labels like team, service, env, severity. Then route on them.
Minimal route tree idea:
severity=critical AND env=prod→ page on-call for owning teamseverity=warning→ team chat channelenv!=prod→ never page (notify only)
Example skeleton:
route:
receiver: default-notify
group_by: ['alertname', 'service', 'env']
group_wait: 30s
group_interval: 5m
repeat_interval: 2h
routes:
- matchers:
- env="prod"
- severity="critical"
receiver: pager-payments
- matchers:
- team="payments"
receiver: notify-payments
- matchers:
- env!="prod"
receiver: default-notify
3) Dedup in Alertmanager (grouping that prevents “40 pages per incident”)
The most important settings:
- group_by: controls “what becomes one incident”
- group_wait: wait a bit to collect related alerts
- group_interval: how often to send updates
- repeat_interval: how often to re-notify if still firing
Practical defaults
group_by: [alertname, service, env]group_wait: 30sgroup_interval: 5mrepeat_interval: 2h(don’t repage every 5 minutes)
If you group by pod or instance, you’ll recreate alert storms.
4) Suppression (silences + inhibitions)
Silences (planned maintenance)
Use silences for:
- deployments
- DB migrations
- load tests
- cluster upgrades
Scope them tightly:
env=staging,service=checkoutfor 1 hour
Not “mute everything everywhere.”
Inhibit rules (dependency-down stops downstream spam)
This is the big one.
Example concept:
If DatabaseDown is firing in prod, then suppress downstream alerts like Checkout5xxHigh that are clearly caused by DB down.
inhibit_rules:
- source_matchers:
- alertname="DatabaseDown"
- env="prod"
target_matchers:
- env="prod"
- alertname=~"Checkout5xxHigh|CheckoutLatencyHigh"
equal: ['env']
Result: one clean “DB down” page + downstream alerts become non-paging updates (or get inhibited entirely).
Part C — Datadog (fast wins with tagging + monitor hygiene)
Datadog’s power is speed: you can reduce noise quickly if you standardize tags and monitor behavior.
1) Make Datadog alerts actionable (use message templates)
Put these in every monitor message:
- Impact (what users see)
- Scope (service/env)
- First action
- “If this started after deploy, rollback” hint
Example message format (paste into monitors):
- Impact: Checkout errors elevated for users
- Scope: service:checkout env:prod region:us-west
- First action: Check last deploy; if within 30m rollback. If no deploy, check DB connections and upstream timeouts.
- Owner: @team-payments-oncall
2) Routing in Datadog (tags decide who gets notified)
Key move: standardize tags:
service:checkoutteam:paymentsenv:prod
Then use:
- Notification handles based on tag filters (team channels / on-call)
- Separate policies for
env:prodvs non-prod
Practical routing strategy
env:prod AND severity:critical→ page on-callenv:prod AND severity:warning→ team channel (no page)env:staging|dev→ notify only
3) Dedup in Datadog (stop “multi-alert explosions”)
Datadog monitors can behave like:
- multi-alert (one per host/pod/container)
- single alert (aggregate)
Rule:
- Page on service-level aggregate
- Keep per-host/per-pod alerts as notify-only (or only for platform team)
Example
- Page:
sum:trace.http.request.errors{service:checkout,env:prod} - Notify:
avg:system.cpu.user{host:*}grouped per host
Also consider:
- “Renotify” settings: don’t repage too often
- Evaluation delay: avoids transient spikes right after deploys/autoscaling
4) Suppression in Datadog (maintenance without blindness)
Use muting/downtime/scheduling for:
- deploy windows (service-scoped)
- load tests
- planned failovers
Important guardrail:
Never mute user-impact monitors globally during deploy.
Mute only the monitors that are expected to flap (like “pod restarts”, “replica count changed”), not “checkout errors high”.
5) Correlation pattern in Datadog (dependency-down reduces cascades)
Create a primary dependency monitor (DB down, Redis down).
Then adjust downstream monitors to:
- notify only when dependency monitor is alerting, or
- use a composite / condition logic so pages happen only when the primary isn’t already firing
Result: one clear page instead of 12.
Part D — New Relic (issues-based alerting + clean workflows)
New Relic is strongest when you embrace its “issues/incidents” flow: group many signals into fewer incidents.
1) Actionable alerts in New Relic (NRQL + clear thresholds)
Example: error-rate paging condition
- Warning: errors elevated for 15 minutes
- Critical: errors high for 5 minutes (page)
NRQL idea (conceptual):
- Measure error % for a service and alert on sustained thresholds.
Message should include:
- Impact (what fails)
- Scope (entity/service/env)
- First action (rollback/scale/check dependency)
2) Routing in New Relic (policies + workflows)
A clean model:
- Policies per team or domain (Payments, Platform, Data)
- Workflows route by attributes like:
team=paymentsservice=checkoutenvironment=production- severity
Best practice: make routing deterministic via consistent attributes/tags.
3) Dedup in New Relic (incident preference + issue grouping)
New Relic can reduce storms if you configure:
- incident/violation grouping so multiple signals become a single issue
- conditions aligned to service/entity rather than per-host noise
Practical approach
- Page on “service golden signals”:
- latency, error rate, saturation, traffic anomalies
- Keep per-host metrics as warning/notify unless platform needs them
4) Suppression in New Relic (muting rules + maintenance windows)
Use muting for:
- planned deploy windows
- scheduled load tests
- known noisy maintenance events
Guardrail:
Mute the expected noisy conditions (restarts, expected transient health checks), not real user-impact signals.
5) Dependency suppression pattern in New Relic
Create a primary condition for dependency health (DB/Redis/Ingress).
Then:
- route downstream symptoms to a non-paging channel when dependency is down, or
- reduce downstream paging thresholds during known dependency incidents
The goal: one high-quality incident representing the root cause.
The “copy-this” operating model (works across all three tools)
Severity model (simple and effective)
- SEV-1 (Page): users impacted now or within 15–30 minutes
- SEV-2 (Notify): needs attention soon, not immediate waking
- SEV-3 (Info): trends / capacity / FYI (dashboards)
Golden signals that deserve paging (service-level)
- Error rate
- Latency (p95/p99)
- Saturation (throttling, queue lag, pool exhaustion)
- Availability (health endpoints only if they reflect real user failure)
Things that usually should not page by default
- CPU high (without impact)
- memory high (without impact)
- pod restarts (unless correlated with errors/latency)
- node disk usage (notify; page only when it threatens outage soon)
A 14-day step-by-step rollout plan (do this and you’ll feel the difference)
Days 1–3: Stop the worst noise
- Identify top 10 paging alerts by volume
- For each: downgrade to notify or add
for:duration or tie to impact - Add “first action” to every page
Days 4–7: Fix routing and ownership
- Enforce labels/tags: team/service/env/severity
- Route prod critical pages to on-call; everything else to notify channels
- Create a triage path for unowned alerts (no paging)
Days 8–10: Implement dedup
- Group by service+env (not pod/host)
- Add repeat/renotify controls
- Ensure “one incident = one page” behavior
Days 11–14: Add suppression safely
- Maintenance windows (tight scope, short duration)
- Dependency-based inhibition/suppression (root cause pages, downstream updates)
- Review after one on-call cycle and adjust
The litmus test: “Would I be glad to receive this at 3 AM?”
Take any paging alert and ask:
- Does it clearly state impact?
- Is the owner obvious?
- Is the first action obvious?
- Will it page again 20 times if it keeps firing?
If the answer isn’t “yes,” tune it until it is.
Reply wit