Mohammad Gufran Jahangir January 29, 2026 0

Most teams “have SLOs” the way most teams “have monitoring”:

  • a dashboard nobody trusts
  • alerts that fire for the wrong reasons
  • targets picked because they sound nice (“let’s do 99.99%”)
  • and zero impact on what engineers actually build next

This blog is the opposite.

By the end, you’ll be able to create SLOs that engineers follow, product teams understand, and leadership respects—because they’re tied to real user experience, have a clear budget for failure, and drive day-to-day decisions.

Let’s build it like an engineer.


Table of Contents

The simple definitions (so you never get confused again)

SLI (Service Level Indicator)

A measurement of something users care about.

Examples:

  • “Percent of checkout requests that succeed”
  • “95th percentile latency of search results”
  • “Percent of messages delivered within 2 seconds”

SLI is data. If you can’t measure it reliably, you can’t build an SLO on it.


SLO (Service Level Objective)

A target for an SLI over a time window.

Example:

  • “99.9% of checkout requests succeed over 28 days”
  • “95% of search requests finish under 300 ms over 7 days”

SLO is a goal. It’s your team saying: “This is the reliability we promise.”


Error Budget

The allowed amount of failure within that SLO window.

If your SLO is 99.9% success over 28 days, your error budget is:

  • 0.1% failures allowed in that window

Error budgets are powerful because they turn “reliability” into a resource you can spend.

Spend it carefully = ship faster.
Spend it recklessly = freeze changes and fix reliability.


The magic idea: Reliability is a trade-off, not a religion

Teams fail with SLOs when they treat them as “a score” or “a punishment.”

SLOs work when they become a decision system:

  • If we are within budget → we can ship features faster
  • If we are burning budget too fast → we slow down and improve reliability
  • If we are out of budget → we stop risky releases until we stabilize

That’s it. Clear. Fair. Repeatable.


Step-by-step: How to create SLOs that actually work

Step 1: Start with a user journey, not a metric

Don’t begin with CPU or pod restarts. Begin with what users actually do.

Pick one critical journey:

  • “User logs in”
  • “User searches”
  • “User checks out”
  • “Device sends telemetry”
  • “Hospital submits a bid”
  • “File upload completes”

Ask one question:

If this journey is broken, do we lose trust or revenue?

If yes, it deserves an SLO.


Step 2: Choose the right kind of SLI (most teams pick the wrong one)

There are 4 “golden” SLI types that map to user experience:

  1. Availability / Success rate
    • “% of requests that succeed”
  2. Latency
    • “% under threshold” or “p95 latency”
  3. Throughput
    • “Requests per second” (usually supporting, not the main SLO)
  4. Correctness / Quality
    • “% of correct responses” (harder, but valuable)

For most product systems, the fastest win is:

Request success rate SLI for the critical endpoint(s)


Step 3: Define “good” vs “bad” clearly (this is where SLOs become real)

Here’s the biggest mistake:

Counting every 200 response as “success.”

A successful request must reflect user success, not server optimism.

Example: Checkout (good SLI)

Good events:

  • HTTP 2xx or 3xx
  • AND business outcome = “order created”
  • AND latency under X (optional but often important)

Bad events:

  • HTTP 5xx
  • timeouts
  • validation failures due to your service bugs
  • dependency failures (payments provider down) — decide how you want to count these

Pro rule (very important):

Only include events where the user actually tried the journey.

Example: Don’t include background health checks in the SLI.
They inflate numbers and confuse everyone.


Step 4: Pick the SLO window (this choice changes behavior)

Common windows:

  • 7 days → fast feedback, good for fast-moving teams
  • 28 or 30 days → standard for “monthly reliability”
  • 90 days → great for long-term commitments, slower feedback

Practical advice:

  • Start with 28 days for “main reliability story”
  • Add a 7-day view for “release safety signals”

Step 5: Set an SLO target that’s believable (not wishful)

Don’t start with 99.99%. Start with what your service can realistically do.

A simple method that works:

The “baseline then improve” method

  1. Measure current performance for 2–4 weeks
  2. Choose a target that is:
    • slightly better than baseline, or
    • matches business need

Why this works:
People trust a target that reflects reality. They fight a target that feels fake.

Example:

Your checkout success rate is currently 99.6% over 28 days.

Start SLO at:

  • 99.7% (a reachable improvement)

Then increase once you have real reliability work done.


Step 6: Calculate the error budget (and make it human-friendly)

Example 1: Success rate SLO

SLO: 99.9% success over 28 days

Error budget = 0.1% failures.

Let’s convert it into “what it means”:

If you have 50,000 checkout attempts/day
Over 28 days → 1,400,000 attempts

Allowed failures = 0.1% of 1,400,000
= 1,400 failures in 28 days

Now it’s concrete:
You can “spend” 1,400 failed checkouts per month before you must slow down releases.

Example 2: Latency SLO (better expressed as a percentage under threshold)

SLO: 95% of search requests under 300 ms over 7 days

If total requests/week = 20,000,000
Allowed slow requests = 5%
= 1,000,000 requests can exceed 300 ms

That’s your budget for slowness.


Step 7: Decide what counts against the budget (fairness matters)

Teams abandon SLOs when they feel punished for things they can’t control.

So be explicit:

Count these (usually)

  • your service’s 5xx errors
  • timeouts in your code path
  • internal dependency failures you are responsible for
  • deploy-related issues

Consider excluding or separating

  • planned maintenance (tracked separately)
  • upstream client misuse (bad requests)
  • major third-party outages (sometimes tracked separately)

Pro pattern:
Keep two views:

  • User SLO (end-to-end, includes dependencies)
  • Service SLO (what your team controls directly)

This prevents arguments and keeps reliability honest.


The best “starter” SLO templates (copy and adapt)

Template A: API success rate SLO (most useful)

  • Journey: “User can complete checkout”
  • SLI: Successful checkout requests / total checkout requests
  • Good event: Order created + no timeout
  • SLO: 99.9% over 28 days
  • Error budget: 0.1% failed checkouts allowed

Template B: Latency SLO (useful when speed is the product)

  • Journey: “User sees search results”
  • SLI: % of search responses under 300 ms
  • SLO: 95% under 300 ms over 7 days
  • Error budget: 5% slow responses allowed

Template C: Streaming / messaging SLO (for delivery systems)

  • Journey: “Message delivered to consumer quickly”
  • SLI: % of messages delivered within 2 seconds
  • SLO: 99% within 2 seconds over 30 days

Now the part that makes SLOs “actually work”: Burn rate + release rules

Most teams stop at “we have a number.”

Working teams add rules.

Error budget burn rate (simple explanation)

Burn rate answers:

“Are we losing our budget too fast?”

If you spend a month’s budget in a day, you’re in trouble.

Practical rule-of-thumb (easy and effective)

  • If you burn 25% of monthly budget in 1 day → investigate immediately
  • If you burn 50% of monthly budget in 3 days → pause risky releases
  • If you burn 100% → release freeze (except reliability fixes)

These rules create predictable behavior and stop surprise outages.


Real examples that show SLO thinking in action

Example 1: Checkout service (99.9% SLO)

You release a new promo code feature. Suddenly:

  • 5xx errors rise
  • success rate drops
  • error budget burn rate spikes

Decision:
Pause further feature rollouts, roll back if needed, fix the promo code path.

Why it works:
The system isn’t blaming anyone. It’s simply enforcing the contract.


Example 2: Kubernetes cluster reliability (avoid this common mistake)

A lot of teams try:

  • “Cluster uptime SLO = 99.99%”

But users don’t care about “cluster uptime.”
They care about whether their requests succeed and are fast.

Better:

  • Service-level SLOs for critical paths
  • Platform SLOs as supporting indicators (like “pod scheduling latency”)

Example 3: Data pipeline SLO (batch)

Journey: “Daily report available by 7 AM”

SLI:

  • % of days report is available by 7 AM

SLO:

  • 99% of days on-time over 90 days

Error budget:

  • 1% late days allowed → about 1 day late per quarter

This is perfect for batch systems where request-level SLIs don’t match reality.


How many SLOs should you create? (the answer surprises people)

Start with 2–5 SLOs per service, max.

Too many SLOs:

  • confuse teams
  • create alert fatigue
  • reduce trust

The best starting set

  1. Availability / success rate for the main user journey
  2. Latency for the same journey (optional but strong)
  3. A key dependency SLO (optional)

Then expand slowly.


Common SLO mistakes (and the fixes that keep readers from repeating them)

Mistake 1: “We picked 99.99% because it sounds good”

Fix: start with baseline performance, then improve.

Mistake 2: Measuring the wrong thing (health checks, internal endpoints)

Fix: measure user journey events only.

Mistake 3: Latency SLO using average latency

Fix: use percentile or % under threshold.

Mistake 4: No consequences

Fix: define release rules tied to error budget burn.

Mistake 5: Blame games over dependencies

Fix: separate user SLO vs service-controlled SLO.


The easiest 30-day rollout plan (works for most teams)

Week 1: Choose + define

  • pick 1–2 key journeys
  • define good/bad events
  • validate data quality

Week 2: Measure baseline

  • run SLI measurement
  • confirm it matches user reality

Week 3: Set targets + error budget

  • pick achievable targets
  • calculate budget
  • agree on burn rules

Week 4: Operate

  • add alerting on burn rate
  • define release rules
  • publish one “SLO report” internally

By day 30, your SLO isn’t a document—it’s a system.


The final secret: SLOs are a tool for curiosity

The best teams don’t use SLOs to prove they’re good.

They use SLOs to ask better questions:

  • Why did we burn budget this week?
  • Which release changed behavior?
  • Which dependency is hurting us most?
  • What reliability work gives the biggest improvement per engineer-week?

When SLOs create curiosity, they keep people reading, improving, and trusting the system.


Quick “SLO that works” checklist (save this)

  • Based on a real user journey
  • SLI definition matches user success
  • Has a clear time window
  • Target is realistic (baseline-informed)
  • Error budget is calculated and understood
  • Burn rules exist and influence releases
  • Ownership is clear
  • Reviewed regularly and improved over time

Category: 
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments