Mohammad Gufran Jahangir January 29, 2026 0

Most teams “have SLOs” the way most teams “have monitoring”:

a dashboard nobody trusts
alerts that fire for the wrong reasons
targets picked because they sound nice (“let’s do 99.99%”)
and zero impact on what engineers actually build next

This blog is the opposite.

By the end, you’ll be able to create SLOs that engineers follow, product teams understand, and leadership respects—because they’re tied to real user experience, have a clear budget for failure, and drive day-to-day decisions.

Let’s build it like an engineer.

Table of Contents

The simple definitions (so you never get confused again)

SLI (Service Level Indicator)

A measurement of something users care about.

Examples:

“Percent of checkout requests that succeed”
“95th percentile latency of search results”
“Percent of messages delivered within 2 seconds”

SLI is data. If you can’t measure it reliably, you can’t build an SLO on it.

SLO (Service Level Objective)

A target for an SLI over a time window.

Example:

“99.9% of checkout requests succeed over 28 days”
“95% of search requests finish under 300 ms over 7 days”

SLO is a goal. It’s your team saying: “This is the reliability we promise.”

Error Budget

The allowed amount of failure within that SLO window.

If your SLO is 99.9% success over 28 days, your error budget is:

0.1% failures allowed in that window

Error budgets are powerful because they turn “reliability” into a resource you can spend.

Spend it carefully = ship faster.
Spend it recklessly = freeze changes and fix reliability.

The magic idea: Reliability is a trade-off, not a religion

Teams fail with SLOs when they treat them as “a score” or “a punishment.”

SLOs work when they become a decision system:

If we are within budget → we can ship features faster
If we are burning budget too fast → we slow down and improve reliability
If we are out of budget → we stop risky releases until we stabilize

That’s it. Clear. Fair. Repeatable.

Step-by-step: How to create SLOs that actually work

Step 1: Start with a user journey, not a metric

Don’t begin with CPU or pod restarts. Begin with what users actually do.

Pick one critical journey:

“User logs in”
“User searches”
“User checks out”
“Device sends telemetry”
“Hospital submits a bid”
“File upload completes”

Ask one question:

If this journey is broken, do we lose trust or revenue?

If yes, it deserves an SLO.

Step 2: Choose the right kind of SLI (most teams pick the wrong one)

There are 4 “golden” SLI types that map to user experience:

Availability / Success rate
- “% of requests that succeed”
Latency
- “% under threshold” or “p95 latency”
Throughput
- “Requests per second” (usually supporting, not the main SLO)
Correctness / Quality
- “% of correct responses” (harder, but valuable)

For most product systems, the fastest win is:

✅ Request success rate SLI for the critical endpoint(s)

Step 3: Define “good” vs “bad” clearly (this is where SLOs become real)

Here’s the biggest mistake:

Counting every 200 response as “success.”

A successful request must reflect user success, not server optimism.

Example: Checkout (good SLI)

Good events:

HTTP 2xx or 3xx
AND business outcome = “order created”
AND latency under X (optional but often important)

Bad events:

HTTP 5xx
timeouts
validation failures due to your service bugs
dependency failures (payments provider down) — decide how you want to count these

Pro rule (very important):

Only include events where the user actually tried the journey.

Example: Don’t include background health checks in the SLI.
They inflate numbers and confuse everyone.

Step 4: Pick the SLO window (this choice changes behavior)

Common windows:

7 days → fast feedback, good for fast-moving teams
28 or 30 days → standard for “monthly reliability”
90 days → great for long-term commitments, slower feedback

Practical advice:

Start with 28 days for “main reliability story”
Add a 7-day view for “release safety signals”

Step 5: Set an SLO target that’s believable (not wishful)

Don’t start with 99.99%. Start with what your service can realistically do.

A simple method that works:

The “baseline then improve” method

Measure current performance for 2–4 weeks
Choose a target that is:
- slightly better than baseline, or
- matches business need

Why this works:
People trust a target that reflects reality. They fight a target that feels fake.

Example:

Your checkout success rate is currently 99.6% over 28 days.

Start SLO at:

99.7% (a reachable improvement)

Then increase once you have real reliability work done.

Step 6: Calculate the error budget (and make it human-friendly)

Example 1: Success rate SLO

SLO: 99.9% success over 28 days

Error budget = 0.1% failures.

Let’s convert it into “what it means”:

If you have 50,000 checkout attempts/day
Over 28 days → 1,400,000 attempts

Allowed failures = 0.1% of 1,400,000
= 1,400 failures in 28 days

Now it’s concrete:
You can “spend” 1,400 failed checkouts per month before you must slow down releases.

Example 2: Latency SLO (better expressed as a percentage under threshold)

SLO: 95% of search requests under 300 ms over 7 days

If total requests/week = 20,000,000
Allowed slow requests = 5%
= 1,000,000 requests can exceed 300 ms

That’s your budget for slowness.

Step 7: Decide what counts against the budget (fairness matters)

Teams abandon SLOs when they feel punished for things they can’t control.

So be explicit:

Count these (usually)

your service’s 5xx errors
timeouts in your code path
internal dependency failures you are responsible for
deploy-related issues

Consider excluding or separating

planned maintenance (tracked separately)
upstream client misuse (bad requests)
major third-party outages (sometimes tracked separately)

Pro pattern:
Keep two views:

User SLO (end-to-end, includes dependencies)
Service SLO (what your team controls directly)

This prevents arguments and keeps reliability honest.

The best “starter” SLO templates (copy and adapt)

Template A: API success rate SLO (most useful)

Journey: “User can complete checkout”
SLI: Successful checkout requests / total checkout requests
Good event: Order created + no timeout
SLO: 99.9% over 28 days
Error budget: 0.1% failed checkouts allowed

Template B: Latency SLO (useful when speed is the product)

Journey: “User sees search results”
SLI: % of search responses under 300 ms
SLO: 95% under 300 ms over 7 days
Error budget: 5% slow responses allowed

Template C: Streaming / messaging SLO (for delivery systems)

Journey: “Message delivered to consumer quickly”
SLI: % of messages delivered within 2 seconds
SLO: 99% within 2 seconds over 30 days

Now the part that makes SLOs “actually work”: Burn rate + release rules

Most teams stop at “we have a number.”

Working teams add rules.

Error budget burn rate (simple explanation)

Burn rate answers:

“Are we losing our budget too fast?”

If you spend a month’s budget in a day, you’re in trouble.

Practical rule-of-thumb (easy and effective)

If you burn 25% of monthly budget in 1 day → investigate immediately
If you burn 50% of monthly budget in 3 days → pause risky releases
If you burn 100% → release freeze (except reliability fixes)

These rules create predictable behavior and stop surprise outages.

Real examples that show SLO thinking in action

Example 1: Checkout service (99.9% SLO)

You release a new promo code feature. Suddenly:

5xx errors rise
success rate drops
error budget burn rate spikes

Decision:
Pause further feature rollouts, roll back if needed, fix the promo code path.

Why it works:
The system isn’t blaming anyone. It’s simply enforcing the contract.

Example 2: Kubernetes cluster reliability (avoid this common mistake)

A lot of teams try:

“Cluster uptime SLO = 99.99%”

But users don’t care about “cluster uptime.”
They care about whether their requests succeed and are fast.

Better:

Service-level SLOs for critical paths
Platform SLOs as supporting indicators (like “pod scheduling latency”)

Example 3: Data pipeline SLO (batch)

Journey: “Daily report available by 7 AM”

SLI:

% of days report is available by 7 AM

SLO:

99% of days on-time over 90 days

Error budget:

1% late days allowed → about 1 day late per quarter

This is perfect for batch systems where request-level SLIs don’t match reality.

How many SLOs should you create? (the answer surprises people)

Start with 2–5 SLOs per service, max.

Too many SLOs:

confuse teams
create alert fatigue
reduce trust

The best starting set

Availability / success rate for the main user journey
Latency for the same journey (optional but strong)
A key dependency SLO (optional)

Then expand slowly.

Common SLO mistakes (and the fixes that keep readers from repeating them)

Mistake 1: “We picked 99.99% because it sounds good”

Fix: start with baseline performance, then improve.

Mistake 2: Measuring the wrong thing (health checks, internal endpoints)

Fix: measure user journey events only.

Mistake 3: Latency SLO using average latency

Fix: use percentile or % under threshold.

Mistake 4: No consequences

Fix: define release rules tied to error budget burn.

Mistake 5: Blame games over dependencies

Fix: separate user SLO vs service-controlled SLO.

The easiest 30-day rollout plan (works for most teams)

Week 1: Choose + define

pick 1–2 key journeys
define good/bad events
validate data quality

Week 2: Measure baseline

run SLI measurement
confirm it matches user reality

Week 3: Set targets + error budget

pick achievable targets
calculate budget
agree on burn rules

Week 4: Operate

add alerting on burn rate
define release rules
publish one “SLO report” internally

By day 30, your SLO isn’t a document—it’s a system.

The final secret: SLOs are a tool for curiosity

The best teams don’t use SLOs to prove they’re good.

They use SLOs to ask better questions:

Why did we burn budget this week?
Which release changed behavior?
Which dependency is hurting us most?
What reliability work gives the biggest improvement per engineer-week?

When SLOs create curiosity, they keep people reading, improving, and trusting the system.

Quick “SLO that works” checklist (save this)

Based on a real user journey
SLI definition matches user success
Has a clear time window
Target is realistic (baseline-informed)
Error budget is calculated and understood
Burn rules exist and influence releases
Ownership is clear
Reviewed regularly and improved over time

Mohammad Gufran Jahangir

Category:

SLI / SLO / Error Budgets: Create SLOs that actually work (step-by-step, with real examples)

The simple definitions (so you never get confused again)

SLI (Service Level Indicator)

SLO (Service Level Objective)

Error Budget

The magic idea: Reliability is a trade-off, not a religion

Step-by-step: How to create SLOs that actually work

Step 1: Start with a user journey, not a metric

Step 2: Choose the right kind of SLI (most teams pick the wrong one)

Step 3: Define “good” vs “bad” clearly (this is where SLOs become real)

Example: Checkout (good SLI)

Pro rule (very important):

Step 4: Pick the SLO window (this choice changes behavior)

Step 5: Set an SLO target that’s believable (not wishful)

The “baseline then improve” method

Example:

Step 6: Calculate the error budget (and make it human-friendly)

Example 1: Success rate SLO

Example 2: Latency SLO (better expressed as a percentage under threshold)

Step 7: Decide what counts against the budget (fairness matters)

Count these (usually)

Consider excluding or separating

The best “starter” SLO templates (copy and adapt)

Template A: API success rate SLO (most useful)

Template B: Latency SLO (useful when speed is the product)

Template C: Streaming / messaging SLO (for delivery systems)

Now the part that makes SLOs “actually work”: Burn rate + release rules

Error budget burn rate (simple explanation)

Practical rule-of-thumb (easy and effective)

Real examples that show SLO thinking in action

Example 1: Checkout service (99.9% SLO)

Example 2: Kubernetes cluster reliability (avoid this common mistake)

Example 3: Data pipeline SLO (batch)

How many SLOs should you create? (the answer surprises people)

The best starting set

Common SLO mistakes (and the fixes that keep readers from repeating them)

Mistake 1: “We picked 99.99% because it sounds good”

Mistake 2: Measuring the wrong thing (health checks, internal endpoints)

Mistake 3: Latency SLO using average latency

Mistake 4: No consequences

Mistake 5: Blame games over dependencies

The easiest 30-day rollout plan (works for most teams)

Week 1: Choose + define

Week 2: Measure baseline

Week 3: Set targets + error budget

Week 4: Operate

The final secret: SLOs are a tool for curiosity

Quick “SLO that works” checklist (save this)