Mohammad Gufran Jahangir January 31, 2026 0

Incidents are not a sign you’re doing engineering wrong. They’re a sign you’re doing engineering at scale.

What separates strong teams from stressed teams isn’t “never having incidents.” It’s this:

When something breaks, everyone knows what to do.
Customers get clear updates.
Engineers don’t burn out.
The system gets better after every incident.

This blog is a complete, beginner-friendly, step-by-step incident management playbook that you can copy and run.

We’ll cover:

Building a sane on-call system
Defining severity levels that actually work
Writing communications templates (internal + customer)
Running postmortems that prevent repeats (without blame)

Let’s turn chaos into a system.

Table of Contents

The Incident Mindset (2 rules that change everything)

Rule 1: The goal is stabilize first, then investigate

In the middle of an incident, your job is not to be Sherlock Holmes.

Your job is to:

Stop the bleeding (reduce impact)
Restore service (even if temporarily)
Then find root cause (after people can breathe)

Rule 2: Incidents are a team sport

One person cannot investigate, fix, communicate, and coordinate without making mistakes.

Great incident management is role-based, not hero-based.

Part 1 — On-Call That Doesn’t Destroy People

1) What on-call is actually for

On-call exists to:

Detect real problems quickly
Restore service fast
Reduce customer impact

On-call is NOT for:

Debugging every warning
Being awake all night because dashboards are noisy
Acting as “support” for non-urgent questions

If your on-call feels like a punishment, your system is signaling: you have too many alerts or unclear ownership.

2) The minimum on-call structure (simple and effective)

Role 1: Primary On-Call (Responder)

Acknowledges alerts
Runs initial triage
Starts incident process when needed

Role 2: Secondary On-Call (Backup)

Helps when primary is stuck or overloaded
Handles parallel tasks
Can become “fix owner” while primary coordinates

Role 3: Incident Commander (IC) (for major incidents)

Coordinates the incident
Keeps the timeline
Makes decisions and assigns owners
Protects responders from distraction

You don’t need a full army. For most teams, Primary + Secondary is enough, and you add an IC for Sev-1/Sev-2.

3) The on-call rotation that reduces burnout

A sane rotation has:

Predictable schedule (people can plan life)
Reasonable load
Clear escalation
Protected time after a rough night

Practical defaults

Rotation length: 1 week
Escalation: primary → secondary → team lead → platform lead
After-hours policy:
- If you get paged at night and work > 1 hour, you get next morning off (or equivalent comp)

Burnout isn’t a personal weakness. It’s usually a systems design issue.

4) The alerting rules that make on-call survivable

If you implement only one thing from this blog, implement this:

Only page for alerts that are:

Actionable (someone can do something now)
Urgent (waiting causes real harm)
Owned (a team knows it’s theirs)
Clear (includes where to look + what to do)

Good page:

“Payments API 5xx > 3% for 5 minutes (prod). Runbook: restart canary, check DB connections.”

Bad page:

“CPU is high.”

High CPU might be normal. High customer errors is not.

5) Build a “first 15 minutes” runbook (the most valuable runbook)

Every critical service should have a short runbook with:

Where are the dashboards?
Where are the logs?
How to check dependencies?
How to roll back?
How to failover?
Who to call?

If your responder has to guess, you lose minutes.

Part 2 — Severity Levels That Don’t Cause Confusion

Severity is not “how scary it feels.” It’s impact.

A good severity model answers:

Who must join?
How often do we update?
What is the target response time?
What level of leadership visibility is required?

Here’s a clean model you can adopt.

Severity model (Sev-1 to Sev-4)

Sev-1 (Critical)

Definition: Widespread outage or major revenue/safety impact
Examples:

Checkout/payment down for all users
Production data corruption
Security breach in progress
Response:
Incident Commander required
Updates every 15–30 minutes
All hands as needed

Sev-2 (High)

Definition: Significant degradation or partial outage with customer impact
Examples:

30% users can’t log in
API latency doubled causing timeouts
One region down but failover works for most
Response:
IC strongly recommended
Updates every 30–60 minutes
Cross-team support if needed

Sev-3 (Medium)

Definition: Limited impact; workaround exists; non-critical component degraded
Examples:

Reporting dashboard down (core app works)
Background jobs delayed but no data loss
Response:
Standard on-call handles
Updates every 2–4 hours (or as agreed)

Sev-4 (Low)

Definition: No immediate customer impact; informational
Examples:

Single node restart
Minor alert resolved automatically
Response:
Ticket + fix during business hours

Severity decision cheat sheet (use during chaos)

Ask these 3 questions:

Are customers blocked from core actions?
Is money/data/security at risk?
Is the impact growing?

If “yes” to any → likely Sev-1/Sev-2.

Part 3 — The Incident Lifecycle (Step-by-Step)

Here is the incident flow that top teams follow—adapted for real-world engineering.

Step 0: Detection

Detection happens via:

Monitoring alerts
Customer reports
Support tickets
Social media (yes, it happens)

Goal: confirm quickly if it’s real.

Step 1: Acknowledge + Start a timer

The moment you suspect a serious incident:

Acknowledge the alert
Start a timestamped incident channel/bridge
Write the first line:

“Investigating elevated errors on Payments API (prod).”

This single line reduces panic because it tells everyone: someone is on it.

Step 2: Triage (5 minutes)

Triage is NOT deep debugging. It’s classification.

Checklist:

What is broken? (symptom)
Who is impacted? (scope)
Since when? (start time)
What changed recently? (deploys/config/infra)
Is there a fast mitigation? (rollback/failover/feature flag)

Output: severity + initial hypothesis + mitigation options.

Step 3: Assign roles (for Sev-1/Sev-2)

This is where incidents stop being chaotic.

Minimum roles

Incident Commander (IC): “drives the process”
Tech Lead (TL): “drives the fix”
Comms Lead: “drives updates”

In small teams, one person may do two roles, but never all three for a big incident.

Step 4: Mitigate (stop the bleeding)

Mitigation examples (use what fits):

Roll back latest deploy
Disable a feature flag
Reduce traffic (rate limiting)
Fail over region
Restart a degraded component
Scale a dependency temporarily
Shed non-critical load

Important: A rollback is not failure. It’s speed.

Step 5: Investigate (while service stabilizes)

Now you do the deeper work:

narrow down the cause
compare timelines: deploys, config changes, incidents
examine logs/metrics/traces
validate hypotheses

Step 6: Resolve + Verify

Resolution is not “alert stopped.” It’s:

customer actions succeed
error rates normal
latency normal
queues drain
no hidden failures

Then:

announce resolution
keep monitoring for 30–60 minutes (“watch period”)

Step 7: Document the timeline (while memory is fresh)

The best time to capture facts is during the incident.

Write quick timestamps like:

10:02 – Alert fired (5xx > 5%)
10:05 – IC assigned
10:12 – Rollback started
10:20 – Errors down to baseline
10:35 – Full recovery confirmed

This makes the postmortem 10x easier.

Part 4 — Comms Templates That Prevent Confusion (Copy/Paste)

Communication is not optional. It’s part of fixing the incident.

People panic when they lack information. Your job is to replace panic with clarity.

Below are no-link, ready-to-use templates.

A) Internal comms templates (Slack/Teams)

1) Incident declared (initial message)

Subject: 🔥 Incident declared — [Service] — Sev-[X]
Message:

What: Seeing [symptom] in [service]
Impact: [who/what affected]
When: Since [time]
Severity: Sev-[X]
Current action: Investigating / mitigating via [rollback/failover/etc.]
Next update: in [15/30/60] minutes
Roles: IC: [name], TL: [name], Comms: [name]

2) Status update (every 15–60 mins depending on severity)

Update #[n] — [time]

Current status: (Investigating / Mitigating / Monitoring / Resolved)
What we know:
What we’re doing now:
Customer impact: (improving / stable / worsening)
ETA: (if unknown, say “unknown; next update at X”)
Next update: [time]

3) Need help / escalation request

Need support — [area]

What we need: [DB expert / networking / platform / vendor]
Why: [symptom and hypothesis]
Urgency: Sev-[X], join bridge/channel now
Owner: [TL or IC name]

4) Resolution message

✅ Resolved — [service] — Sev-[X]

Root cause (initial): [short and factual]
Fix applied: [what was done]
Impact window: [start] to [end]
Customer status: restored
Follow-ups: postmortem scheduled + action items incoming

B) Customer-facing update templates (status page / email)

1) Customer initial notice

We’re investigating an issue
We are currently investigating reports of [symptom] affecting [product area].
Some users may experience [impact].
Next update in [time window].

2) Customer progress update

We’ve identified the cause and are working on mitigation
We have identified the cause of the issue and are applying mitigation steps to restore normal service.
Users may continue to see [impact] while recovery is in progress.
Next update in [time window].

3) Customer resolution note

Issue resolved
The issue affecting [product area] has been resolved.
During the incident, users may have experienced [impact] between [start] and [end].
We are conducting a detailed review and will implement additional safeguards.

The golden rule of comms (keeps trust high)

Never guess. Never overpromise.
If you don’t know, say:

“We don’t have an ETA yet. Next update at [time].”

That single line is professional and calming.

Part 5 — Postmortems That Actually Prevent Repeat Incidents

A postmortem is not a blame document.
It’s a system improvement document.

The output should be:

a clear story (what happened)
contributing factors (why it happened)
action items (what changes)
proof it won’t repeat (or risk is reduced)

Postmortem structure (simple and powerful)

1) Summary (5 lines)

What happened (one sentence)
Customer impact (who/how)
Duration (start-end)
Severity
Current status (resolved/monitoring)

2) Customer impact (concrete)

% users impacted
error rates / latency
affected features
revenue/business impact if known (optional)

3) Timeline (facts only)

Use timestamped entries. No opinions.

4) Root cause (one clear statement)

Example:
“Database connection pool exhaustion caused request failures after a config change reduced max connections.”

5) Contributing factors (the real learning)

This is where most teams improve:

Missing alert
Poor runbook
Risky deployment
No circuit breaker
No load test
Unclear ownership
Inadequate rollback

6) What went well

fast detection
good coordination
quick mitigation
clear comms

7) What didn’t go well

alert noise delayed response
unclear severity
missing dashboard
slow escalation

8) Action items (the only part that truly matters)

Each action item must have:

Owner
Due date
Priority
How we verify
Expected impact

Real postmortem example (short but realistic)

Incident

Sev-1: Payments API failures in prod

Summary:
From 10:02–10:35, users experienced payment failures due to DB connection exhaustion after a configuration change. Rolling back the change restored service.

Impact:

22% of payment attempts failed
Peak 5xx: 12%
Duration: 33 minutes

Root cause:
A config change lowered DB max connections; traffic spike caused connection pool exhaustion.

Contributing factors:

No alert on connection pool utilization
No canary check for DB connection thresholds
Runbook lacked “rollback config” step

Action items:

Add alert: DB connection pool > 80% for 5 min (Owner: Platform, Due: Feb 10)
Add canary check blocking deploy if DB connections exceed threshold (Owner: SRE, Due: Feb 15)
Update runbook with rollback command + verification steps (Owner: Payments TL, Due: Feb 5)
Load test payments under traffic spikes monthly (Owner: QA/Perf, Due: Mar 1)

This is what a useful postmortem looks like: clear, factual, and improvement-driven.

Part 6 — The “Maturity Ladder” (how to level up over time)

Level 1: Reactive

You page on-call
You fix
You move on

Level 2: Structured response

Severity defined
Roles assigned
Regular updates

Level 3: Repeatable systems

Runbooks
Good paging signals
Postmortems with tracked actions

Level 4: Prevention-focused

SLOs and error budgets
Automated rollback
Safe deploy patterns (canary, feature flags)
Chaos testing / game days

Your goal isn’t perfection. Your goal is progress.

Part 7 — Your “Ready to Run” Incident Kit (copy and implement)

If you want to operationalize everything in this blog, implement this kit:

1) A single “Declare Incident” message format

Use the internal incident declared template.

2) Severity definition (Sev-1..Sev-4)

Keep it impact-based.

3) Role assignment rule

For Sev-1/Sev-2: IC + TL + Comms.

4) Update frequency rule

Sev-1: every 15–30 mins
Sev-2: every 30–60 mins
Sev-3: every 2–4 hours

5) Postmortem within 3–5 business days

And action items tracked like production bugs.

Final takeaway (print this in your head)

Incidents don’t become manageable because systems stop failing.
They become manageable because your response becomes predictable.

On-call is humane because alerting is disciplined.
Severity is clear because impact is defined.
Comms are calm because templates exist.
Postmortems work because action items have owners and deadlines.

Nice — here’s a one-page Incident Management SOP (checklists only) you can paste into your internal wiki and start using today.

Incident Management SOP (One Page)

0) When to declare an incident

Declare an incident if any is true:

Customer-facing errors/latency is elevated and sustained
Core user journey is blocked (login/checkout/payments/api)
Data loss/corruption risk
Security incident suspected/in progress
Impact is growing or unclear but serious

1) Severity (impact-based)

Sev-1 (Critical)

Widespread outage OR major revenue/data/security impact
Updates: every 15–30 min
Roles: IC required

Sev-2 (High)

Significant degradation OR partial outage with clear customer impact
Updates: every 30–60 min
Roles: IC recommended

Sev-3 (Medium)

Limited impact; workaround exists; non-core feature affected
Updates: every 2–4 hours

Sev-4 (Low)

No immediate customer impact; informational
Handle as ticket during business hours

2) Roles (assign for Sev-1/Sev-2)

Incident Commander (IC)

Owns process, timeline, decisions, task assignment
Protects responders from distractions
Ensures updates go out on time

Tech Lead (TL)

Owns investigation and fix plan
Assigns technical tasks to engineers
Verifies mitigation and recovery

Comms Lead

Sends internal + external updates
Maintains consistent message and cadence
Captures customer impact summary

(In small teams, one person may do 2 roles — never all 3 in Sev-1.)

3) Golden priorities (always in this order)

Stabilize (stop the bleeding)
Restore service (even if temporary)
Investigate root cause (after stability)
Prevent recurrence (postmortem actions)

4) First 5 minutes checklist (Primary On-Call)

Acknowledge alert
Confirm impact (errors/latency/customer reports)
Declare severity (Sev-1..Sev-4)
Create incident channel/bridge
Post “Incident declared” message (template below)
Assign roles (IC/TL/Comms) if Sev-1/Sev-2
Start timeline notes (timestamps)

5) Triage checklist (first 15 minutes)

What is failing? (symptom)
Who is impacted? (% users, regions, tier)
Since when? (start time)
What changed recently? (deploy/config/infra)
Dependencies healthy? (DB/cache/queue/3rd party)
Fast mitigation available? (rollback/feature flag/failover)
Decide mitigation path + assign owner

6) Mitigation playbook (choose what fits)

Roll back latest deploy
Disable feature flag / bypass risky path
Rate limit / shed non-critical load
Restart stuck components safely
Scale up temporarily (app/DB/queue)
Fail over region / switch to degraded mode
Stop noisy batch jobs or background consumers

Rule: Prefer reversible actions first.

7) Communications cadence

Sev-1: update every 15–30 min
Sev-2: update every 30–60 min
Sev-3: update every 2–4 hours
Sev-4: ticket only

If no ETA: say “No ETA yet. Next update at [time].”

8) Internal comms templates (copy/paste)

A) Incident declared

🔥 Incident declared — [Service] — Sev-[X]

What: [symptom]
Impact: [who/what affected]
When: since [time]
Severity: Sev-[X]
Action now: [investigating / rollback / failover]
Next update: [time]
Roles: IC [name] | TL [name] | Comms [name]

B) Status update

Update #[n] — [time]

Status: Investigating / Mitigating / Monitoring / Resolved
What we know: …
What we’re doing: …
Impact: improving / stable / worsening
Next update: [time]

C) Resolution

✅ Resolved — [Service] — Sev-[X]

Fix: [what changed]
Impact window: [start–end]
Customer status: restored
Next: postmortem + action items

9) Customer-facing templates (no promises, no blame)

A) Investigating

We are investigating an issue affecting [feature]. Some users may experience [impact]. Next update in [time].

B) Identified / Mitigating

We have identified the cause and are applying mitigation to restore service. Users may continue to see [impact]. Next update in [time].

C) Resolved

The issue has been resolved. Users may have experienced [impact] between [start] and [end]. We are completing a review and implementing safeguards.

10) Resolution criteria (don’t close early)

Core user actions succeed (sample checks)
Errors back to baseline
Latency back to baseline
Queues draining normally (if applicable)
No ongoing dependency degradation
Watch period completed (30–60 min)

11) Postmortem (within 3–5 business days)

Must include:

Summary (what/impact/duration/severity)
Customer impact (metrics)
Timeline (facts with timestamps)
Root cause (clear statement)
Contributing factors (why it happened)
What went well / didn’t go well
Action items with Owner + Due date + Verification

Action item quality bar

Prevent or reduce recurrence
Detect faster next time
Reduce blast radius
Improve rollback/failover
Improve runbook/alerts

Mohammad Gufran Jahangir

Category:

Uncategorized

Incident Management: On-Call, Severity, Comms Templates, and Postmortems (the practical playbook)

The Incident Mindset (2 rules that change everything)

Rule 1: The goal is stabilize first, then investigate

Rule 2: Incidents are a team sport

Part 1 — On-Call That Doesn’t Destroy People

1) What on-call is actually for

2) The minimum on-call structure (simple and effective)

Role 1: Primary On-Call (Responder)

Role 2: Secondary On-Call (Backup)

Role 3: Incident Commander (IC) (for major incidents)

3) The on-call rotation that reduces burnout

Practical defaults

4) The alerting rules that make on-call survivable

Only page for alerts that are:

Good page:

Bad page:

5) Build a “first 15 minutes” runbook (the most valuable runbook)

Part 2 — Severity Levels That Don’t Cause Confusion

Severity model (Sev-1 to Sev-4)

Sev-1 (Critical)

Sev-2 (High)

Sev-3 (Medium)

Sev-4 (Low)

Severity decision cheat sheet (use during chaos)

Part 3 — The Incident Lifecycle (Step-by-Step)

Step 0: Detection

Step 1: Acknowledge + Start a timer

Step 2: Triage (5 minutes)

Step 3: Assign roles (for Sev-1/Sev-2)

Minimum roles

Step 4: Mitigate (stop the bleeding)

Step 5: Investigate (while service stabilizes)

Step 6: Resolve + Verify

Step 7: Document the timeline (while memory is fresh)

Part 4 — Comms Templates That Prevent Confusion (Copy/Paste)

A) Internal comms templates (Slack/Teams)

1) Incident declared (initial message)

2) Status update (every 15–60 mins depending on severity)

3) Need help / escalation request

4) Resolution message

B) Customer-facing update templates (status page / email)

1) Customer initial notice

2) Customer progress update

3) Customer resolution note

The golden rule of comms (keeps trust high)

Part 5 — Postmortems That Actually Prevent Repeat Incidents

Postmortem structure (simple and powerful)

1) Summary (5 lines)

2) Customer impact (concrete)

3) Timeline (facts only)

4) Root cause (one clear statement)

5) Contributing factors (the real learning)

6) What went well

7) What didn’t go well

8) Action items (the only part that truly matters)

Real postmortem example (short but realistic)

Incident

Part 6 — The “Maturity Ladder” (how to level up over time)

Level 1: Reactive

Level 2: Structured response

Level 3: Repeatable systems

Level 4: Prevention-focused

Part 7 — Your “Ready to Run” Incident Kit (copy and implement)

1) A single “Declare Incident” message format

2) Severity definition (Sev-1..Sev-4)

3) Role assignment rule

4) Update frequency rule

5) Postmortem within 3–5 business days

Final takeaway (print this in your head)

Incident Management SOP (One Page)

0) When to declare an incident

1) Severity (impact-based)

Sev-1 (Critical)

Sev-2 (High)

Sev-3 (Medium)

Sev-4 (Low)

2) Roles (assign for Sev-1/Sev-2)

Incident Commander (IC)

Tech Lead (TL)

Comms Lead