Mohammad Gufran Jahangir February 7, 2026 0

At 10:03 AM your CEO posts a campaign on LinkedIn.

At 10:07 AM traffic triples.

At 10:10 AM your API is “up”… but every request takes 9 seconds, carts fail, and support tickets explode.

Your dashboards show CPU at 35%. “So why is it slow?”

That’s what capacity planning is really about.

Not “how many servers do we have?”
But how much user demand can we handle at the latency we promised—and how quickly we can adapt when reality changes.

This guide is designed so you can plan capacity even if you’re a beginner, and still feel confident when traffic becomes unpredictable.


Table of Contents

What capacity planning actually means (simple definition)

Capacity planning = ensuring your system can handle expected demand (QPS/throughput) while meeting performance goals (latency), with safe headroom, at reasonable cost.

In the cloud, you do this with three levers:

  1. CPU & memory (how much power each instance/pod has)
  2. Scaling (how many instances/pods you run)
  3. Architecture (caching, queues, databases, limits, retries)

Most outages happen when people only adjust lever #2.


The 4 numbers you must understand first

1) QPS (Queries Per Second)

How many requests hit your service each second.

  • Average QPS is almost never the problem.
  • Peak QPS is what breaks systems.

2) Latency (p50 vs p95 vs p99)

Latency isn’t one number. Use percentiles:

  • p50: typical user experience
  • p95: “bad but common” experience
  • p99: “rare but painful” tail

If you plan with averages, you’ll get surprised. Always capacity-plan with p95 or p99.

3) CPU & memory utilization

  • CPU tells you if you’re compute-bound.
  • Memory tells you if you’re risk-bound (OOM kills, GC pauses, cache blow-ups).

CPU can look fine while your service is dying if your bottleneck is:

  • database connections
  • locks/contention
  • queue backlog
  • IO waits
  • slow downstream dependencies

4) Scaling behavior

Scaling is not instant.

  • Cold starts, image pulls, JIT warmup, cache warmup—these are real.
  • If your traffic spike grows faster than you scale, you still melt.

The capacity planning loop (what you’ll do step-by-step)

You’ll do this like an engineer:

  1. Set the performance target (what latency you must meet)
  2. Translate business demand into peak QPS
  3. Measure how much QPS one instance/pod can handle at the target latency
  4. Calculate how many instances/pods you need (with headroom)
  5. Choose scaling signals (CPU? RPS? latency? queue depth?)
  6. Validate with load tests
  7. Operate continuously (because every release changes capacity)

Now let’s do it with real numbers.


Step 1 — Define your SLO (the “promise” you plan around)

Pick one sentence:

  • p95 latency ≤ 200ms for /checkout at peak traffic”
  • p99 latency ≤ 500ms for all API requests”
  • “Workers must keep queue delay < 60 seconds

If you don’t define this, you’ll plan for the wrong thing and optimize blindly.

Beginner tip: start with p95. Move to p99 later.


Step 2 — Convert users/traffic into peak QPS (the part everyone skips)

You usually know one of these:

  • daily active users
  • requests per day
  • transactions per minute
  • marketing event size

Example: you have “3 million requests/day”

Average QPS is:

  • 3,000,000 requests/day ÷ 86,400 seconds/day
  • 34.7 QPS average

But your system doesn’t run on “average.”
Let’s assume:

  • traffic is 5× higher during business hours
  • and you get 3× spikes during campaigns

Peak QPS estimate:

  • 34.7 × 5 × 3 = 520 QPS peak

That number (520) is where capacity planning begins.

Rule of thumb for beginners:
If you don’t know your peak, assume 10× average until you measure real traffic patterns.


Step 3 — Find your bottleneck (because scaling the wrong thing is expensive)

Before math, answer this:

What will hit its limit first when QPS rises?

Common bottlenecks:

  • API CPU (serialization, encryption, heavy computation)
  • Memory (caches, large payloads, leaks, GC)
  • Database CPU/IOPS
  • Database connections (connection pool exhaustion)
  • External dependency rate limits
  • Message broker partitions / consumer lag
  • Lock contention inside your app
  • Network egress or NAT throughput/cost

Quick method: draw the request path:

Client → CDN → LB → API → Cache → DB → External service

The slowest or most limited hop defines your real capacity.


Step 4 — Measure single-instance capacity at your target latency

Here’s the capacity planning secret:

You don’t need perfect forecasting.
You need to know: “How much QPS can one instance handle while staying within SLO?”

Do a simple load test (staging or perf env)

Measure:

  • QPS
  • p95 latency
  • CPU
  • memory
  • error rate
  • downstream saturation (DB connections, cache hit rate, etc.)

You’re looking for the “knee” in the curve:

  • latency stays flat… flat… then suddenly rises.
    That’s saturation.

Real example result

You test one API instance (2 vCPU, 4GB RAM) and observe:

  • At 100 QPS → p95 latency 120ms, CPU 45%
  • At 150 QPS → p95 latency 160ms, CPU 62%
  • At 180 QPS → p95 latency 240ms (SLO violated), CPU 72%

If your SLO is p95 ≤ 200ms, then 150 QPS per instance is a safe working capacity.


Step 5 — Calculate required instances/pods (with headroom)

The simple formula engineers love

Required instances = Peak QPS ÷ Per-instance QPS (at SLO) × Headroom

Continuing the example:

  • Peak QPS = 520
  • Per-instance QPS at SLO = 150
  • Headroom = 1.3 (30% buffer)

Instances needed:

  • 520panic ÷ 150 = 3.47
  • × 1.3 = 4.51
    Round up → 5 instances

That’s your baseline capacity for that endpoint/service.

Why headroom matters

Because real life includes:

  • noisy neighbors
  • uneven load balancing
  • deployments rolling pods
  • cache misses after restarts
  • downstream hiccups
  • spikes that exceed your “peak estimate”

Step 6 — Translate QPS into concurrency (this is where latency becomes real)

A very practical relationship:

Concurrency ≈ QPS × Latency(seconds)

If:

  • Peak QPS = 520
  • p95 latency target = 200ms = 0.2s

Concurrency needed:

  • 520 × 0.2 = 104 concurrent in-flight requests

Now you can sanity check:

  • do you have enough worker threads?
  • do you have enough DB connections?
  • is your connection pool sized correctly?
  • can your LB and runtime handle this concurrency?

This is why a service can have low CPU and still be slow:
it’s choking on concurrency limits (threads, sockets, DB connections).


Step 7 — Plan CPU & memory (the safe, practical way)

CPU planning: pick a utilization target

For steady services, many teams aim for:

  • 50–70% CPU at peak (not 95%)

Why not 95%?

  • CPU spikes create queueing delays → tail latency blows up.
  • 95% CPU “looks efficient” but feels slow.

Memory planning: plan for the worst 5 minutes, not the average day

Memory failure is ugly: OOM kills, restarts, cascading overload.

Plan memory with:

  • baseline usage
  • peak usage during bursts
  • cache growth
  • GC overhead / runtime overhead
  • per-request memory allocations

Beginner-friendly rule:
If you don’t understand memory behavior yet, keep 30–50% memory headroom.

Example: memory gotcha

Your pod uses:

  • 800MB steady
  • 1.2GB under peak load
  • 1.6GB during a cache warmup

If you set limit = 1.3GB, you’ll crash during warmups and deployments—right when traffic is high.


Step 8 — Choose scaling signals (CPU isn’t always the right trigger)

Here are the most useful scaling signals, and when to use them.

1) CPU-based scaling

Best when: CPU is your real bottleneck (compute-heavy services)

Bad when: you’re DB-bound or IO-bound—CPU stays low but latency spikes.

2) RPS/QPS-based scaling

Best when: traffic drives work linearly and each request costs similar CPU

Very predictable for APIs.

3) Latency-based scaling

Best when: you care about user experience and latency rises early

But careful: latency can rise due to downstream problems scaling won’t fix.

4) Queue depth / consumer lag (workers)

Best when: async workloads, pipelines, background jobs

This is often the cleanest scaling metric for workers.

Real worker example

You have a queue where each job takes 200ms.

One worker can do:

  • 1 / 0.2 = 5 jobs/sec

Peak incoming jobs = 200 jobs/sec
Workers needed (no headroom):

  • 200 / 5 = 40 workers
    Add headroom 1.25 → 50 workers

Then scale using queue lag:

  • “If queue delay > 60s, add workers”
  • “If delay < 10s for 10 minutes, scale down”

Step 9 — Plan scaling speed (the part that prevents “we scaled but still died”)

Even perfect scaling rules fail if scaling is slow.

Ask:

  • How long does a new instance/pod take to become ready?
    • image pull
    • app startup
    • cache warmup
    • readiness checks

Example

Traffic spike doubles in 2 minutes.
Your pods take 3 minutes to become ready.

Result: you fall behind, latency climbs, retries amplify load, meltdown.

Fix options:

  • keep higher baseline capacity
  • pre-scale on schedules (known peaks)
  • use predictive scaling
  • reduce startup time (smaller images, faster boot, warm caches)
  • keep minimum replicas above 1–2

Step 10 — Validate with two load tests (so you trust your numbers)

Do two tests:

Test A: “SLO test”

  • ramp to peak QPS
  • confirm p95 and p99 latency meet target
  • confirm error rate stays low

Test B: “Failure test”

  • degrade a dependency (DB slower, cache disabled, one AZ down)
  • confirm system degrades gracefully:
    • timeouts work
    • retries don’t explode
    • circuit breakers protect you
    • queue builds but doesn’t crash

Capacity planning without failure thinking is how you build fragile systems.


The most common capacity planning traps (and what to do instead)

Trap 1: Planning on average QPS

Do this instead: plan on peak + bursts + headroom.

Trap 2: Using p50 latency

Do this instead: plan on p95/p99.

Trap 3: Scaling only the API, not the database

Do this instead: capacity-plan the entire request path.

Trap 4: CPU looks fine so you assume capacity is fine

Do this instead: also track saturation signals:

  • DB connection pool usage
  • queue wait time
  • thread pool saturation
  • cache hit rate drops
  • downstream timeouts

Trap 5: “We’ll autoscale, so we don’t need planning”

Reality: autoscaling is a tool, not a strategy.
You still need:

  • baseline replicas
  • correct metrics
  • safe limits
  • fast startup
  • dependency capacity

A complete real-world example (API service end-to-end)

Scenario

You run a search API.

SLO: p95 ≤ 250ms at peak.

Known data

  • average QPS: 60
  • expected marketing peak: 600 QPS
  • load test shows 1 pod handles 170 QPS at p95 ≤ 250ms
  • startup time: 90 seconds
  • you want 30% headroom

Capacity

  • 600 / 170 = 3.53
  • × 1.3 = 4.59
    → baseline 5 pods

Concurrency sanity check

  • 600 QPS × 0.25s = 150 concurrent requests total
  • per pod: 150 / 5 = 30 concurrent
    If your app can handle 30 concurrent requests per pod, you’re good. If not, tune threads/pools.

Scaling

  • min replicas = 5 (baseline)
  • scale up if RPS per pod > 140 for 2 minutes
  • scale down slowly (avoid thrash)
  • scheduled pre-scale for known events (because startup = 90s)

Dependencies

  • cache hit rate should stay > 85%
  • DB connection pool per pod capped
  • rate limits for external dependency
  • circuit breaker and timeouts enabled

Now you’ve planned capacity like a system, not a guess.


The “cheat sheet” checklist (copy this into your runbook)

Capacity planning checklist

  • Define SLO (p95/p99 latency target + error target)
  • Estimate peak QPS (not average)
  • Map request path (API → cache → DB → external)
  • Load test to find per-instance QPS at SLO
  • Compute instances needed with headroom
  • Validate concurrency (QPS × latency)
  • Choose scaling metric (CPU/RPS/latency/queue)
  • Account for scaling speed (startup/warmup)
  • Validate with SLO + failure tests
  • Set alerts for saturation signals (not just CPU)

Capacity Planning in Kubernetes: CPU/Memory, QPS, Latency, Scaling (the practical playbook)

Kubernetes makes scaling feel easy: “just add replicas.”

But the first time you hit real load, you discover a painful truth:

Kubernetes can scale your pods… while your latency still explodes.

Why? Because in Kubernetes, capacity planning is not just “how many pods.” It’s a full chain:

Pods → Nodes → Network → Dependencies (DB/cache/queues) → User latency

This guide shows you how to capacity-plan in Kubernetes step-by-step, with real examples you can apply immediately.


The 5 metrics that decide your real capacity in Kubernetes

1) QPS (requests per second)

Traffic volume to your service.

2) Latency percentiles (p95/p99)

This is your user experience. Plan with p95 or p99.

3) Pod CPU throttling (not just CPU usage)

Your service may show “CPU usage is fine” but still get slower if it’s being throttled because requests/limits are wrong.

4) Memory headroom + OOM kills

OOM restarts cause cascading failures during spikes.

5) Saturation signals (usually the real bottleneck)

Examples:

  • thread pool saturation
  • DB connection pool exhaustion
  • queue lag
  • cache hit rate drop
  • timeouts to dependencies

CPU is often the last thing to go red.


Step 1 — Define the SLO you’ll plan around (one sentence)

Pick one:

  • “p95 latency ≤ 200ms for /api/search
  • “p99 latency ≤ 500ms for all endpoints”
  • “Queue delay ≤ 60s for background jobs”

No SLO = no capacity plan. You’ll just guess.


Step 2 — Find the peak QPS you must survive

Don’t use average. Use peak.

Example

You see 3M requests/day.

Average QPS:

  • 3,000,000 ÷ 86,400 ≈ 35 QPS

But peaks exist. Assume:

  • business hour concentration 5×
  • spikes 3×

Peak estimate:

  • 35 × 5 × 3 ≈ 525 QPS

That’s the number you capacity plan around.

Beginner shortcut: if unsure, start with 10× average and refine later.


Step 3 — Measure “pod capacity” (the single most important Kubernetes concept)

You need to know:

How much QPS can one pod handle while meeting the SLO?

This becomes your planning unit.

How to get it (simple and reliable)

Run a load test against a single replica (or isolate one pod) and increase QPS until:

  • p95 latency breaks your SLO, or
  • errors rise, or
  • a dependency saturates.

Real example result

You test one pod of search-api:

  • 120 QPS → p95 140ms (good)
  • 160 QPS → p95 190ms (still good)
  • 190 QPS → p95 260ms (SLO broken)

If SLO is p95 ≤ 200ms, pod capacity is 160 QPS per pod.


Step 4 — Calculate required replicas (with headroom)

Formula:

Replicas = Peak QPS ÷ Pod QPS-at-SLO × Headroom

Example:

  • Peak QPS = 525
  • Pod capacity = 160 QPS
  • Headroom = 1.3 (30%)

Replicas:

  • 525/160 = 3.28
  • × 1.3 = 4.26
    Round up → 5 replicas

That’s your steady baseline for that service at peak.


Step 5 — Convert QPS + latency into concurrency (this explains “low CPU but slow”)

Important relationship:

Concurrency ≈ QPS × Latency(seconds)

Example:

  • 525 QPS
  • p95 target 200ms = 0.2s

Concurrency:

  • 525 × 0.2 = 105 concurrent requests in flight

Per pod (5 pods):

  • 105 / 5 = 21 concurrent per pod

Now check:

  • does your app runtime allow that concurrency?
  • are thread pools or async loops sized correctly?
  • do you have enough DB connections per pod?

If your DB pool is 10 connections per pod but you need 21 concurrent, you’ll queue internally → latency rises → retries → meltdown.

That’s how CPU stays “fine” while users suffer.


Step 6 — Set CPU & memory requests/limits correctly (or HPA lies to you)

In Kubernetes, requests drive scheduling and limits can cause throttling.

CPU: plan for performance, not “max efficiency”

A common safe target:

  • Aim for peak CPU usage around 50–70% of the pod’s effective CPU capacity.

If you set CPU limit too low:

  • Kubernetes throttles CPU → latency increases → QPS per pod drops → you need more pods than expected.

Memory: plan for the worst 5 minutes (deployments & warmups)

Memory is different from CPU:

  • you don’t want to “ride the edge”
  • OOM kills are catastrophic during spikes

Beginner rule:

  • keep 30–50% memory headroom, especially if you use caches or JVM/GC languages.

Example (realistic)

Pod steady memory: 800MB
Under peak: 1.2GB
During cache warmup: 1.6GB

If you set limit = 1.3GB, your pods will restart during deploy/scale events—exactly when traffic is high.


Step 7 — HPA: pick the right scaling signal (CPU is not always the best)

Option A: HPA on CPU

Good when: service is CPU-bound
Bad when: service is DB-bound or IO-bound (CPU stays low but latency spikes)

Option B: HPA on RPS/QPS (recommended for many APIs)

Good when: requests are similar cost and scale linearly
More predictable than CPU.

Option C: HPA on latency (advanced)

Good when: you scale to protect user experience
Risk: latency might be caused by downstream issues scaling won’t solve.

Option D: KEDA on queue depth/lag (best for workers)

For async workloads:

  • scale based on backlog delay or message lag
    This is often the cleanest capacity model.

Step 8 — Don’t forget the other half: Cluster capacity (pods need nodes)

This is where many teams fail:

They compute “need 20 pods,” but forget:

  • do we have enough nodes to schedule them?
  • will Cluster Autoscaler / Karpenter add nodes fast enough?
  • do we have IP capacity?
  • do we have AZ balance?
  • do we have headroom for rollouts?

The two layers of scaling

  1. Pod autoscaling (HPA/KEDA)
  2. Node autoscaling (Cluster Autoscaler / Karpenter)

If pods scale faster than nodes:

  • pods go Pending
  • traffic keeps coming
  • latency explodes

Practical rule

If your traffic can spike in 2 minutes but new nodes take 4–6 minutes to come online:

You must keep more baseline node headroom or pre-scale.


Step 9 — Plan rollout capacity (deployments are hidden traffic spikes)

During rollouts, you temporarily run extra pods (or lose some).

If your cluster is “just enough,” a rollout causes:

  • Pending pods
  • uneven load
  • cache cold starts
  • latency spikes

Plan capacity with rollout in mind:

  • Ensure spare capacity for at least one extra replica per service (or more for large services)
  • Avoid scaling down aggressively right before deploys

Step 10 — Validate with two tests (the ones that prevent surprises)

Test 1: SLO load test

Ramp to peak QPS and confirm:

  • p95/p99 latency meets target
  • error rate acceptable
  • no dependency saturation

Test 2: Dependency wobble test

Simulate one of:

  • DB slower
  • cache down
  • one AZ unavailable
  • external API throttling

Confirm:

  • timeouts are set
  • retries don’t explode
  • circuit breakers protect
  • service degrades gracefully

Capacity planning that ignores failures builds fragile systems.


The Kubernetes capacity planning “cheat sheet” (copy to your runbook)

For API services

  1. Define SLO (p95/p99)
  2. Estimate peak QPS (not average)
  3. Load test to find QPS per pod at SLO
  4. Replicas = Peak QPS / Pod QPS × 1.3 headroom
  5. Concurrency = QPS × latency → check thread pools & DB pools
  6. Set CPU/memory requests/limits to avoid throttling/OOM
  7. HPA on RPS (often best), CPU if compute-bound
  8. Ensure node autoscaling can keep up (or keep baseline headroom)
  9. Validate with SLO test + failure test

For worker services (queues)

  1. Jobs/sec incoming at peak
  2. One pod throughput (jobs/sec) at safe latency
  3. Pods needed = incoming / per-pod × headroom
  4. Scale with KEDA on lag/backlog delay
  5. Ensure node autoscaler keeps up

A full real example (Kubernetes API)

You run checkout-api.

SLO: p95 ≤ 250ms
Peak traffic estimate: 800 QPS
Load test: 1 pod can handle 180 QPS while keeping p95 ≤ 250ms
Headroom: 1.3

Replicas:

  • 800/180 = 4.44
  • × 1.3 = 5.77 → 6 pods baseline

Concurrency check:

  • 800 QPS × 0.25s = 200 concurrent requests total
  • 200/6 = 33 concurrent per pod

Now check:

  • DB connections per pod (must be ≥ peak concurrency or you need async/queueing strategy)
  • timeouts/retries configured to prevent retry storms
  • node capacity available for 6 pods + rollout headroom

Scaling:

  • HPA on RPS per pod (e.g., target 150 RPS/pod)
  • minReplicas = 6
  • scale-up fast, scale-down slow
  • node autoscaler configured (or keep headroom)

That’s a capacity plan you can defend.


The biggest “aha” to keep forever

Kubernetes capacity planning is about “pod capacity at SLO,” not about CPU graphs.

Once you know:

  • QPS per pod at target latency,
    you can confidently plan replicas, nodes, budgets, and scaling rules.

Capacity Planning in Kubernetes (EKS + AKS + GKE): CPU/Memory, QPS, Latency, Scaling — the practical guide

Here’s the most common Kubernetes surprise:

You scale from 5 pods to 50 pods… and latency still gets worse.

Why? Because in Kubernetes, capacity isn’t “pods.” Capacity is a chain:

Pods (CPU/mem) → Scheduling → Nodes → Autoscaler speed → Network/IPs → Load balancer → Dependencies (DB/cache/queues) → Tail latency (p95/p99)

This blog gives you a step-by-step capacity planning method that works across EKS, AKS, and GKE, with real examples and platform-specific gotchas.


The one concept that makes capacity planning simple

Pod Capacity at SLO

Instead of guessing, you find:

How much QPS can one pod handle while meeting your SLO (p95/p99 latency)?

Once you know QPS-per-pod at SLO, the rest becomes math + safety margins.

If you remember only one thing, remember this:
CPU graphs don’t tell you capacity. “QPS-per-pod at SLO” tells you capacity.


The 6 numbers you should track (for every service)

  1. Peak QPS (not average)
  2. Latency SLO (p95 or p99)
  3. QPS-per-pod at SLO (measured via load test)
  4. Concurrency (QPS × latency)
  5. Requests/limits (CPU + memory) and throttling / OOMs
  6. Scale-up time (pods ready time + nodes ready time)

These 6 numbers alone can prevent most “we scaled but still melted” incidents.


Step 1 — Pick a clear SLO (your planning target)

Choose one sentence per service:

  • /search p95 ≤ 200ms at peak”
  • “All API calls p99 ≤ 500ms”
  • “Queue delay ≤ 60 seconds for background jobs”

Why this matters:

  • You can handle more QPS if you accept higher latency.
  • You can meet low latency if you cap QPS or add capacity.
    Capacity planning is trade-offs, and SLO is the contract.

Step 2 — Convert business demand into Peak QPS

Most teams accidentally plan on average traffic.

Example

You have 3,000,000 requests/day.

Average QPS:

  • 3,000,000 ÷ 86,400 ≈ 35 QPS

But real traffic has peaks. If:

  • business-hour concentration ≈ 5×
  • campaign spikes ≈ 3×

Peak QPS estimate:

  • 35 × 5 × 3 ≈ 525 QPS

Beginner shortcut: if you don’t know peaks yet, assume 10× average until you get real data.


Step 3 — Measure QPS-per-pod at SLO (the “truth number”)

Run a load test and ramp QPS until:

  • p95/p99 crosses your SLO, or
  • errors rise, or
  • a dependency saturates (DB pool, cache miss, queueing)

Real example result (API pod)

SLO: p95 ≤ 200ms

Load test a single pod:

  • 120 QPS → p95 130ms ✅
  • 150 QPS → p95 180ms ✅
  • 180 QPS → p95 240ms ❌

Pod capacity at SLO = 150 QPS per pod

Now you have a real unit of capacity.


Step 4 — Calculate replicas (with headroom)

Use this formula:

Replicas = Peak QPS ÷ (QPS-per-pod at SLO) × Headroom

Example:

  • Peak QPS = 525
  • Pod capacity = 150
  • Headroom = 1.3 (30%)

Replicas:

  • 525/150 = 3.5
  • × 1.3 = 4.55 → round up → 5 replicas baseline

Why headroom is non-negotiable

Because you’ll face:

  • uneven load balancing
  • pod restarts
  • deployments
  • cold caches
  • noisy nodes
  • dependency jitter

If you plan “perfectly,” production will still be imperfect.


Step 5 — Convert QPS + latency into concurrency (explains “CPU is fine but it’s slow”)

This relationship is pure gold:

Concurrency ≈ QPS × Latency(seconds)

Example:

  • Peak QPS = 525
  • Latency target p95 = 200ms = 0.2s

Concurrency:

  • 525 × 0.2 = 105 concurrent requests in flight

With 5 pods:

  • 105/5 = 21 concurrent per pod

Now check your hidden choke points:

  • app worker threads / event loop concurrency
  • DB connection pool per pod
  • HTTP client pool per pod
  • downstream rate limits

This is the #1 reason services slow down at low CPU:
they’re queueing behind limited concurrency (often DB connections).


Step 6 — Set CPU & memory requests/limits so Kubernetes doesn’t sabotage you

CPU: avoid throttling that silently increases latency

If CPU limits are too low, Linux throttles CPU, and your service becomes “slow but not obviously broken.”

Practical guidance:

  • size so that peak CPU stays around ~50–70% of effective capacity
  • keep enough buffer for bursts and GC pauses

Memory: plan for the worst 5 minutes (deploy + warmups)

Memory failures are brutal (OOM kills → restarts → cold caches → cascading failures).

Beginner rule:

  • keep 30–50% memory headroom over observed peak during stress tests.

Reality check: many “random latency spikes” are actually GC pressure or memory churn under load.


Step 7 — Choose the right autoscaling signals (HPA is not one-size-fits-all)

For APIs

Best signals (in order):

  1. RPS/QPS per pod (stable and predictable)
  2. Latency (protects user experience, but can be fooled by downstream issues)
  3. CPU (only if service is truly compute-bound)

For workers (queues)

Use queue-based scaling:

  • queue depth
  • consumer lag
  • queue delay

This is where KEDA shines, because it scales based on external/backlog metrics.


Step 8 — Plan the other half: node capacity (pods need somewhere to land)

This is where EKS/AKS/GKE differences start to matter.

Two layers must keep up:

  1. Pod scaling (HPA/KEDA)
  2. Node scaling (Cluster Autoscaler / Karpenter / VMSS scaling / Node Auto-Provisioning)

If pods scale faster than nodes:

  • pods go Pending
  • traffic keeps coming
  • latency climbs
  • retries amplify load
  • incident begins

The node math engineers actually use

At minimum, estimate:

  • Sum of pod CPU requests + overhead
  • Sum of pod memory requests + overhead
  • Add room for DaemonSets (logging, monitoring, CNI, security agents)
  • Add rollout headroom (deployments temporarily increase pods)

A safe beginner approach:

  • assume 10–20% node overhead plus DaemonSets
  • and keep one node worth of slack per critical node pool

Step 9 — Account for scale-up time (this is where spikes beat autoscaling)

Ask two questions:

  1. How long until a new pod is Ready?
    (image pull, start time, warm caches)
  2. How long until a new node is Ready?
    (provision VM, attach networking, join cluster)

If your traffic doubles in 2 minutes and node scale takes 5 minutes:

  • you must keep a higher baseline or pre-scale.

This is not “overprovisioning.” It’s paying for reaction time.


Step 10 — Validate with the two tests that prevent surprises

Test A: SLO test

Ramp to peak QPS and confirm:

  • p95/p99 within target
  • errors stable
  • no dependency saturation

Test B: wobble test (realistic failure)

Make one dependency worse:

  • DB slower
  • cache disabled
  • one zone impaired
  • external API rate-limits

Confirm:

  • timeouts work
  • retries don’t explode
  • circuit breakers protect
  • system degrades gracefully

Capacity planning without wobble testing builds fragile systems.


Platform-specific capacity planning: EKS vs AKS vs GKE (what changes)

The fundamentals above are identical. The differences are mostly in:

  • node scaling methods
  • networking/IP behavior
  • load balancer/ingress characteristics
  • operational defaults

1) EKS (AWS)

What to watch

  • Node scaling tool choice: Cluster Autoscaler vs Karpenter (Karpenter can scale faster and choose instance types more flexibly, but you must design constraints carefully)
  • Pod IP capacity: AWS VPC CNI means pod IPs come from VPC; you can hit IP exhaustion or per-node pod limits
  • Zone balancing: uneven AZ distribution can create “one AZ full” problems during spikes
  • EBS/volume attach rate: stateful workloads can scale slower due to volume attach/initialize time

Practical EKS habit

  • Treat “available pod IPs” as a first-class capacity metric (right alongside CPU/memory).

2) AKS (Azure)

What to watch

  • Node pools + VMSS scaling: scale speed depends on VMSS provisioning and image/boot time
  • Networking mode matters: Azure CNI vs kubenet affects pod IP planning and scaling characteristics
  • SNAT / outbound constraints: at high connection rates, outbound behavior can become a bottleneck (especially for chatty services)
  • Load balancing behavior: ensure your ingress/LB setup matches your traffic patterns and scale expectations

Practical AKS habit

  • Capacity plan outbound connections and not just CPU—especially for microservices that call many downstream services.

3) GKE (Google Cloud)

What to watch

  • Autopilot vs Standard: Autopilot changes how resources are provisioned and billed; capacity planning focuses more on requests/limits discipline and workload behavior
  • Node Auto-Provisioning: can create new node pools automatically to satisfy scheduling, which helps scale—but you must understand constraints so it doesn’t surprise you
  • Upgrade/maintenance behavior: plan rollout capacity because cluster maintenance can influence available resources

Practical GKE habit

  • Be strict about requests/limits and use them as your “contract,” because GKE automation works best when workloads declare needs correctly.

A single “universal” capacity planning template you can reuse (every service)

Fill this out for each service:

A) SLO

  • Endpoint/workload:
  • Latency target: p95/p99 =
  • Error target:

B) Demand

  • Average QPS:
  • Peak QPS:
  • Peak duration (minutes/hours):

C) Pod capacity (measured)

  • QPS-per-pod at SLO:
  • CPU usage at that point:
  • Memory usage at that point:
  • Any throttling? any GC spikes?
  • Main bottleneck observed:

D) Replica plan

  • Baseline replicas:
  • Max replicas:
  • Headroom %:

E) Scaling plan

  • Primary metric (RPS/CPU/latency/queue lag):
  • Scale-up aggressiveness:
  • Scale-down stability:

F) Node plan

  • Node pool type(s):
  • Time to add a node:
  • Slack strategy (baseline spare nodes / buffer %):
  • IP / outbound / LB constraints noted:

G) Failure mode notes

  • What happens if DB is 2× slower?
  • What happens if cache is down?
  • What happens if one zone is impaired?

This template alone will make your capacity planning “real” instead of “hopeful.”


Two complete real examples (API + worker) that work on EKS/AKS/GKE

Example 1: API service

SLO: p95 ≤ 250ms
Peak QPS: 800
Measured pod capacity: 180 QPS at SLO
Headroom: 30%

Replicas:

  • 800/180 = 4.44
  • × 1.3 = 5.77 → 6 baseline pods

Concurrency:

  • 800 × 0.25 = 200 in-flight
  • 200/6 ≈ 33 per pod
    Check DB pool / thread pools to ensure 33 doesn’t queue.

Scaling:

  • HPA on RPS per pod (target maybe 150–170)
  • minReplicas = 6
  • scale-up fast; scale-down slow

Node plan:

  • ensure cluster has room for 6 pods + rollout headroom + daemonsets
  • ensure node autoscaler can add capacity before spike overtakes you

Example 2: Queue workers

Peak incoming jobs: 240 jobs/sec
One worker pod throughput: 6 jobs/sec (measured safely)
Headroom: 25%

Pods:

  • 240/6 = 40
  • × 1.25 = 50 worker pods

Scaling:

  • KEDA on queue lag (e.g., keep delay < 60s)
  • prevent thrash with cooldown settings

Node plan:

  • ensure node autoscaler can add nodes for 50 pods quickly
  • keep baseline nodes if spikes are sudden

The “never get surprised again” checklist (EKS/AKS/GKE)

  • Plan on peak, not average
  • Use p95/p99, not p50
  • Measure QPS-per-pod at SLO (the truth number)
  • Convert to concurrency and check pools/limits
  • Avoid CPU throttling and memory OOMs
  • Choose scaling metrics that match your bottleneck
  • Validate node autoscaling speed and keep baseline slack if needed
  • Watch platform-specific constraints (IP, outbound, LB behavior)
  • Test “dependency wobble” before production traffic does

If you tell me your setup for each platform (just quick bullets):

  • Ingress type (ALB/NLB/NGINX/etc.),
  • Node autoscaler (Karpenter/Cluster Autoscaler/Node Auto-Provisioning),
  • Workload type (API heavy, workers, or mixed),

Capacity Planning in Kubernetes for EKS + AKS: CPU/Memory, QPS, Latency, Scaling (the practical, real-world playbook)

If you’ve ever said any of these…

  • “CPU is only 40%… why is latency 8 seconds?”
  • “HPA scaled pods but half are Pending.”
  • “We added nodes but the service still died.”

…then you already know the secret:

In Kubernetes, capacity planning is not “how many pods.”
It’s how much traffic you can handle at your latency target, considering:

Pods → Nodes → Networking/IPs → Load balancer → Dependencies → Tail latency (p95/p99)

This guide is built specifically for EKS and AKS, with step-by-step actions and examples you can actually use.


The 6 numbers that define capacity (write these on your whiteboard)

For every service, you need:

  1. Peak QPS (not average)
  2. Latency SLO (p95 or p99)
  3. Pod capacity at SLO = QPS-per-pod while meeting SLO
  4. Concurrency = QPS × latency(seconds)
  5. CPU throttling / Memory OOM risk (requests/limits reality)
  6. Scale-up time (pod-ready time + node-ready time)

Once you know these, capacity planning becomes predictable.


Step 1 — Define the SLO (the promise you’re planning around)

Pick a sentence that your users would notice:

  • /search p95 ≤ 200ms at peak”
  • /checkout p95 ≤ 250ms, error rate < 0.5%”
  • “Worker queue delay ≤ 60 seconds”

Why this matters:
You can always “handle more QPS” if you allow slower responses.
Capacity planning is the art of staying fast under load.


Step 2 — Convert “traffic” into Peak QPS (don’t plan on averages)

Example: you have 3,000,000 requests/day

Average QPS:

  • 3,000,000 ÷ 86,400 ≈ 35 QPS

But real systems spike. Assume:

  • business-hour concentration = 5×
  • campaign spike = 3×

Peak QPS:

  • 35 × 5 × 3 = 525 QPS peak

Beginner rule: if you don’t know peaks yet, start with 10× average and refine later.


Step 3 — Measure Pod Capacity at SLO (the “truth number”)

This is the most important step.

You load-test a pod and find:

“One pod can handle X QPS while keeping latency under SLO.”

Example result

SLO: p95 ≤ 200ms

Load test one pod:

  • 120 QPS → p95 130ms ✅
  • 150 QPS → p95 180ms ✅
  • 180 QPS → p95 240ms ❌

So your pod capacity at SLO = 150 QPS per pod

Now you’re no longer guessing.


Step 4 — Calculate required replicas (with headroom)

Formula:

Replicas = Peak QPS ÷ Pod QPS-at-SLO × Headroom

Example:

  • Peak QPS = 525
  • Pod capacity = 150
  • Headroom = 1.3 (30%)

Replicas:

  • 525/150 = 3.5
  • × 1.3 = 4.55 → round up → 5 baseline pods

Why headroom is non-negotiable

Because real life includes:

  • uneven load balancing
  • rolling deployments
  • cold caches
  • noisy nodes
  • dependency wobble
  • sudden micro-spikes inside “peak”

Step 5 — Convert QPS + latency into concurrency (this explains “CPU is fine but slow”)

Use this always:

Concurrency ≈ QPS × Latency(seconds)

Example:

  • Peak QPS = 525
  • SLO p95 latency = 200ms = 0.2s

Concurrency:

  • 525 × 0.2 = 105 concurrent in-flight requests

With 5 pods:

  • 105/5 = 21 concurrent per pod

Now check the hidden choke points:

  • thread pool size / async concurrency
  • DB connection pool per pod
  • HTTP client connection pool per pod
  • rate limits to downstream services

Most “mystery latency” is just queueing behind one of these limits.


Step 6 — Set CPU & memory requests/limits so Kubernetes doesn’t betray you

CPU: watch for throttling

If CPU limits are too tight, your service gets throttled and p95 latency creeps up even when CPU dashboards look “not terrible.”

Practical guidance:

  • design for peak CPU around 50–70% of effective capacity
  • avoid tiny CPU limits for latency-sensitive services

Memory: plan for the worst 5 minutes

OOM kills trigger restarts → cold caches → more load → cascading failure.

Beginner-safe guidance:

  • keep 30–50% memory headroom
  • test memory during spikes + rollouts + warmups

Step 7 — Choose scaling signals (HPA/KEDA) that match reality

For APIs (most common)

Best order for scaling signals:

  1. RPS/QPS per pod (predictable)
  2. Latency (protects user experience but can be fooled by downstream slowness)
  3. CPU (only if truly compute-bound)

For workers (queue-driven)

Scale on:

  • queue lag / delay
  • backlog depth
  • consumer lag

Queue-based scaling is usually cleaner than CPU scaling for workers.


Step 8 — The big one: node capacity & scale speed (pods need nodes)

You have two scaling layers:

  1. Pod scaling (HPA/KEDA)
  2. Node scaling (Cluster Autoscaler / Karpenter / VMSS scaling)

If pods scale faster than nodes:

  • pods go Pending
  • traffic keeps arriving
  • latency spikes
  • retries amplify load
  • your incident starts

Capacity planning is incomplete until you’ve planned node scale speed.


Now the platform reality: EKS vs AKS (what changes, what to watch)

Everything above applies to both.
What differs is where things usually break first.


EKS capacity planning: the “3 classic pain points”

1) Pod IP capacity (AWS VPC CNI)

In EKS, pod IPs are often allocated from the VPC network space. You can hit:

  • IP exhaustion
  • per-node pod limits (depending on instance networking)

Symptom: HPA scales, nodes scale, but pods still can’t come up (or scheduling gets weird).
Capacity action: treat “available pod IPs” as a first-class capacity metric (right beside CPU/memory).

2) Node autoscaling choice: Cluster Autoscaler vs Karpenter

  • Cluster Autoscaler: solid, but can be slower and more conservative.
  • Karpenter: often faster and more flexible, but you must design constraints carefully.

Symptom: pods are Pending too long during spikes.
Capacity action: measure “time to add ready node” and keep baseline slack if spikes are faster than node scale.

3) Stateful scale time (EBS volume attach / initialization)

Stateful pods often scale slower due to:

  • volume provisioning time
  • attach/detach time
  • filesystem initialization

Symptom: stateless scales fine; stateful services lag and become the bottleneck.
Capacity action: plan stateful scaling separately; keep baseline replicas higher for stateful services.


AKS capacity planning: the “3 classic pain points”

1) Networking mode affects IP planning (Azure CNI vs kubenet)

AKS networking choice changes:

  • how pod IPs are allocated
  • how large your subnets need to be
  • how fast scheduling can grow

Symptom: sudden scheduling failures or constraints during growth.
Capacity action: include subnet/IP planning in your capacity workbook.

2) Outbound SNAT / connection pressure

High QPS microservices often create huge outbound connection counts (to DBs, caches, other services, external APIs).
Outbound constraints can become a bottleneck before CPU does.

Symptom: timeouts to external services, weird intermittent failures under high concurrency.
Capacity action: capacity plan outbound connections and not just CPU.

3) VMSS scale speed + node image / boot time

AKS nodes scale through VMSS behavior. Startup time matters a lot.

Symptom: HPA reacts quickly, but nodes take too long → Pending pods → latency spike.
Capacity action: measure node readiness time and match it against how fast your traffic spikes.


Two complete, real examples (EKS/AKS compatible)

Example A — API service (RPS-driven)

Service: search-api
SLO: p95 ≤ 200ms
Peak QPS forecast: 900 QPS
Load test result: 1 pod holds 180 QPS at SLO
Headroom: 30%

Baseline replicas

  • 900/180 = 5
  • × 1.3 = 6.5 → 7 pods baseline

Concurrency check

  • 900 × 0.2 = 180 in-flight
  • 180/7 ≈ 26 per pod
    Now validate:
  • thread pool supports ~26 concurrent per pod
  • DB pool per pod won’t bottleneck

Scaling

  • HPA based on RPS per pod (target maybe 150–170 RPS/pod)
  • scale up fast; scale down slow
  • keep baseline replicas ≥ 7 for predictable p95

Platform watch-outs

  • EKS: confirm pod IP capacity for +20 pods during spikes
  • AKS: confirm outbound connection pressure won’t throttle dependencies

Example B — Worker service (queue-driven)

Workload: email-worker
Peak jobs: 240 jobs/sec
One worker pod safely processes: 6 jobs/sec
Headroom: 25%

Pods:

  • 240/6 = 40
  • × 1.25 = 50 worker pods

Scaling:

  • KEDA based on queue delay/lag
  • keep cooldown so it doesn’t thrash

Node planning:

  • ensure node autoscaler can add capacity for 50 pods quickly
  • if nodes take 6 minutes and spikes are immediate, keep baseline slack

The universal “Capacity Planning Worksheet” (use for every service)

Fill this once per service and you’ll be ahead of most teams.

A) SLO

  • Endpoint/workload:
  • Latency target (p95/p99):
  • Error target:

B) Demand

  • Avg QPS:
  • Peak QPS:
  • Peak duration:

C) Pod capacity (measured)

  • QPS-per-pod at SLO:
  • CPU usage at that QPS:
  • Memory usage at that QPS:
  • Any throttling? Any OOM risk?
  • Bottleneck observed (DB pool / CPU / cache miss / outbound):

D) Replica plan

  • Baseline replicas:
  • Max replicas:
  • Headroom %:

E) Scaling plan

  • Primary metric (RPS/latency/CPU/queue lag):
  • Scale-up behavior (fast):
  • Scale-down behavior (slow/stable):

F) Node plan

  • Node pool(s):
  • Time to add ready node:
  • Slack strategy (buffer nodes / buffer %):
  • Platform constraints (EKS IPs / AKS outbound):

The “don’t get surprised” checklist for EKS + AKS

  • Plan on peak, not average
  • Use p95/p99, not p50
  • Measure QPS-per-pod at SLO (truth number)
  • Convert to concurrency and validate pools/limits
  • Prevent CPU throttling + memory OOMs
  • Scale on the right signal (often RPS for APIs, lag for workers)
  • Ensure node scale speed can keep up (or keep baseline slack)
  • EKS: track pod IP capacity like CPU
  • AKS: track outbound connection pressure like CPU
  • Test a “dependency wobble” scenario before production does

Quick final takeaway

Capacity planning for EKS/AKS becomes easy when you stop asking:

“How many pods do we need?”

…and start asking:

“How much QPS can one pod handle at our p95/p99 target, and how fast can the cluster add nodes when reality spikes?”

Below is a single end-to-end “Capacity Planning Flow” that covers everything we discussed for Kubernetes on EKS + AKS (CPU/memory, QPS, latency, scaling, node capacity, and platform-specific checks). Use it as a runbook.


1) Master Flow (One-page, end-to-end)

START
  ↓
[Define SLO]
  - pick p95/p99 latency target + error target
  ↓
[Estimate Peak Demand]
  - peak QPS (not average) + burst factor + peak duration
  ↓
[Map Critical Path]
  Client → Ingress/LB → Service → Cache → DB → External deps
  ↓
[Load Test to Find “Pod Capacity at SLO”]
  - ramp QPS on 1 pod until SLO breaks or dependency saturates
  - record: QPS-per-pod @ SLO, CPU, memory, bottleneck
  ↓
[Compute Required Replicas]
  replicas = peakQPS / podQPS@SLO × headroom(1.2–1.5)
  ↓
[Concurrency Sanity Check]
  concurrency = peakQPS × latencySeconds
  perPodConcurrency = concurrency / replicas
  - validate thread pools / DB pool / HTTP pools / downstream limits
  ↓
[Set Pod Resources]
  - tune requests/limits to avoid CPU throttling + OOM
  - confirm no throttling at peak and memory headroom exists
  ↓
[Choose Scaling Signal]
  API: prefer RPS/QPS-per-pod → (CPU only if compute-bound)
  Workers: prefer queue lag/backlog (KEDA-style)
  ↓
[Node Capacity Plan]
  - can current nodes schedule baseline + rollout headroom + daemonsets?
  - can node autoscaler add nodes before spike overtakes you?
  ↓
[Platform Branch Checks]
  EKS: pod IP capacity + AZ balance + EBS attach time
  AKS: subnet/IP mode + outbound/SNAT + VMSS scale speed
  ↓
[Validate]
  Test A: peak SLO load test
  Test B: dependency wobble test (DB slower/cache down/zone issue)
  ↓
[Operate]
  dashboards + alerts + weekly review + re-baseline after releases
END

2) Detailed Flow for API Services (EKS + AKS)

API FLOW
  ↓
1) Set API SLO (p95/p99)
  ↓
2) Peak QPS forecast (include spikes)
  ↓
3) Load test → find QPS-per-pod @ SLO
     - also capture: CPU throttling? memory spikes? main bottleneck?
  ↓
4) Replicas baseline:
     baselinePods = ceil(peakQPS / podQPS@SLO × 1.3)
  ↓
5) Concurrency check:
     totalConcurrency = peakQPS × latencySeconds
     perPod = totalConcurrency / baselinePods
     - verify DB connections >= realistic concurrency OR add caching/queues
  ↓
6) Resource tuning:
     - set requests so scheduler places pods predictably
     - avoid tight CPU limits (throttling → tail latency rises)
     - keep memory headroom (OOM → restart → cold cache → cascading)
  ↓
7) HPA signal:
     - prefer RPS/QPS-per-pod target
     - CPU target only if truly CPU-bound
     - latency scaling is optional (advanced)
  ↓
8) Node readiness:
     - do nodes scale fast enough?
     - keep baseline node slack if spikes are sudden
  ↓
9) Platform checks:
     - EKS: pod IPs, AZ balance, LB behavior, EBS attach delays
     - AKS: subnet/IP sizing, outbound connections/SNAT, VMSS speed
  ↓
10) Validate:
     - peak load test + wobble test
END

3) Detailed Flow for Worker / Queue Services (EKS + AKS)

WORKER FLOW
  ↓
1) Define worker SLO:
     - queue delay target (e.g., <60s) or lag target
  ↓
2) Peak incoming rate:
     - jobs/sec at peak + burst factor
  ↓
3) Measure per-pod throughput safely:
     - jobs/sec per pod without error spikes or dependency saturation
  ↓
4) Pod count:
     pods = ceil(peakJobsPerSec / podJobsPerSec × 1.25)
  ↓
5) Scale signal:
     - queue lag/delay/backlog (best)
     - CPU is usually misleading for workers
  ↓
6) Node planning:
     - can nodes scale quickly enough to host the max pods?
     - keep baseline slack if jobs spike sharply
  ↓
7) Platform checks:
     - EKS: pod IP capacity for burst pods
     - AKS: outbound connection pressure if workers call many services
  ↓
8) Validate:
     - burst test + dependency wobble test
END

4) Platform Branch: what to check before you trust scaling

EKS Branch Checks

EKS CHECKS
  ↓
A) Pod IP capacity available?
   - enough subnet/IPs and per-node pod capacity
  ↓
B) Node autoscaling method?
   - can it add nodes fast enough for your spike speed?
  ↓
C) AZ balancing OK?
   - avoid “one AZ full” during bursts
  ↓
D) Stateful scaling delays?
   - EBS volume provisioning/attach time considered?
END

AKS Branch Checks

AKS CHECKS
  ↓
A) Networking mode/subnet sizing OK?
   - enough IP space for burst pods/nodes
  ↓
B) Outbound connection pressure OK?
   - high QPS can mean huge outbound connections → timeouts
  ↓
C) VMSS node scale speed OK?
   - node readiness time vs spike growth time
END

5) “What do I do when latency is high but CPU is low?” (Debug flow)

This is the most common real incident pattern.

LATENCY↑ CPU↓
  ↓
1) Check saturation signals:
   - DB pool usage? thread pool? HTTP client pool? queueing?
  ↓
2) Check dependency latency:
   - DB/cache/external API slowed? throttling/rate-limits?
  ↓
3) Check retries/timeouts:
   - retry storms can multiply traffic under latency
  ↓
4) Check pod throttling:
   - CPU limits too tight → throttling → tail latency rises
  ↓
5) Check node/pod pending:
   - pods scaling but Pending due to lack of nodes/IPs
  ↓
6) Platform-specific:
   - EKS: IP shortage/AZ imbalance
   - AKS: outbound/SNAT/connection pressure
END

6) The “Ready-to-Use” checklist flow (copy into your runbook)

  • ✅ SLO defined (p95/p99 + error target)
  • ✅ Peak QPS/jobs/sec estimated with burst factor
  • ✅ QPS-per-pod (or jobs-per-pod) measured at SLO
  • ✅ Baseline pods calculated with headroom
  • ✅ Concurrency computed and pools validated (DB/thread/HTTP)
  • ✅ CPU limits not causing throttling; memory has headroom
  • ✅ Scaling signal chosen (RPS/lag preferred; CPU only when right)
  • ✅ Node scale speed measured; baseline slack set if needed
  • ✅ EKS checks: pod IP capacity + AZ balance + stateful delays
  • ✅ AKS checks: subnet/IP + outbound connection pressure + VMSS speed
  • ✅ Peak test + wobble test passed
  • ✅ Operating cadence: re-baseline after major releases

Category: 
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments