Shipping code is easy. Shipping code without breaking production is the skill.
Most teams eventually end up asking the same question:
“Which deployment strategy should we standardize on—Blue/Green, Canary, or Rolling?”
The honest answer: all three are useful—but in different situations, with different risks, and different operational costs.
By the end of this guide, you’ll be able to:
- instantly recognize which strategy fits your release
- run it step-by-step
- avoid the classic “it worked in staging” trap
- design rollbacks that actually work under pressure
Let’s make this practical.
The simplest way to think about deployments (one mental model)
All three strategies are just different ways to answer two questions:
- How many versions run at the same time? (one or two)
- How do we shift traffic? (all-at-once, gradual, or replace-in-place)
Once you see that, the choice becomes obvious.
Quick definitions (beginner-friendly)
Rolling deployment
You replace servers/pods gradually: old → new, a few at a time.
Users hit a mix of versions during the rollout.
Traffic shifting: happens automatically as instances are replaced.
Blue/Green deployment
You run two complete environments:
- Blue = current production
- Green = new version (ready, warmed up)
Then you do a switch (usually at load balancer / routing layer).
Traffic shifting: mostly “flip” from blue to green.
Canary deployment
You release the new version to a small percentage of users first (the “canary”).
If metrics look good, you increase traffic gradually: 1% → 5% → 25% → 50% → 100%.
Traffic shifting: progressive and controlled.
The fastest cheat sheet (when to use what)
Choose Rolling when…
- you want the simplest standard approach
- your app is stateless (or mostly)
- you’re okay with a short period where users hit mixed versions
- rollback needs to be quick but not “instant flip”
Great for: most internal services, APIs, frequent small releases.
Choose Canary when…
- failures are expensive (checkout, auth, payments)
- you need early proof with real traffic
- you want controlled rollout with automated “stop if bad”
- you’re optimizing for safety over speed
Great for: high-impact customer flows, ML/feature behavior changes, performance-sensitive services.
Choose Blue/Green when…
- you need near-instant cutover/rollback
- you’re doing a big change (framework upgrade, infra change, config overhaul)
- you must test the “new world” in production-like conditions before switching
- you can afford running two environments briefly
Great for: major releases, migrations, risky changes, strict SLAs.
Decision matrix (practical, not theoretical)
Ask these 7 questions and you’ll know the answer:
- Can you run two environments at once?
- Yes → Blue/Green or Canary become easier
- No → Rolling is your default
- Do users tolerate mixed versions?
- Yes → Rolling is fine
- No → Prefer Canary or Blue/Green
- How fast must rollback be?
- Seconds/minutes → Blue/Green
- Minutes → Canary
- Minutes (with some disruption) → Rolling
- Is this change risky or user-facing?
- High risk → Canary / Blue-Green
- Low/medium → Rolling
- Do you have strong monitoring + alerts?
- Strong metrics → Canary works beautifully
- Weak metrics → Rolling/Blue-Green but you’re flying blind (fix observability first)
- Do you have DB schema changes?
- Most DB changes require special handling (we’ll cover this)
- Blue/Green is not “magic” if the DB breaks backward compatibility
- Do you need traffic shaping by user segment?
- Yes → Canary (by % or by cohort) is best
Strategy 1: Rolling deployments (step-by-step)
What rolling looks like in real life
Imagine you have 10 pods of orders-api running v1.
A rolling update might replace:
- 2 pods → v2
- then 2 more
- then 2 more
- until all are v2
For a short time, users hit both versions.
Rolling deployment step-by-step (safe version)
- Confirm backward compatibility
- v2 should work even if some calls still come from v1 components.
- Set rollout limits
- Replace only a small number at a time (avoid taking too much capacity).
- Deploy
- Let new instances come up healthy before killing old ones.
- Watch 4 signals
- Error rate, latency, saturation (CPU/mem), and logs for new exceptions.
- Pause or rollback if signals worsen
- Stop the rollout immediately if errors spike.
Real example: Rolling is perfect here
You run an internal “profile service” with frequent releases:
- small code changes
- good test coverage
- stateless API
- low blast radius
Rolling is the best default: simple, fast, cheap.
Rolling pros
- simplest operationally
- no need to run double capacity
- works well with frequent shipping
Rolling cons (the gotchas)
- users can hit mixed versions (hard when contracts change)
- rollback can be slower than a traffic flip
- if the new version is bad, it may already be partially spread
Rolling “failure mode” you must avoid
Long startup + high traffic
If new instances take time to warm up (JIT, cache, DB pools), replacing too fast can cause a temporary outage.
Fix: slow rollouts + readiness checks + warmup endpoints.
Strategy 2: Blue/Green deployments (step-by-step)
What Blue/Green looks like in real life
You run:
- Blue: 10 instances, v1 (serving 100% traffic)
- Green: 10 instances, v2 (serving 0% traffic)
You validate green. Then you switch traffic: 100% → green.
Blue/Green deployment step-by-step (the reliable way)
- Provision Green
- same capacity, same config, same routing rules, production-like.
- Deploy v2 to Green
- ensure health checks pass.
- Warm Green
- load caches, establish DB connections, compile templates, etc.
- Run “production smoke tests” against Green
- login, core endpoints, critical flows, synthetic tests.
- Cutover
- switch routing from Blue to Green (LB, DNS, gateway, service mesh).
- Watch metrics intensely for a short window
- if stable, keep Green as production.
- Keep Blue for fast rollback
- don’t destroy it immediately; keep it as the escape hatch.
Real example: Blue/Green shines here
You are upgrading payments-service:
- major framework upgrade
- changes in TLS settings
- new dependencies
- stricter latency SLO
You want:
- full validation before users see it
- instant rollback if anything feels off
Blue/Green is the calm, controlled approach.
Blue/Green pros
- near-instant rollback (flip back)
- clean separation between versions
- great for big changes and production-like validation
Blue/Green cons (the real costs)
- expensive (double capacity, even if briefly)
- requires stronger traffic switching control
- DB changes can ruin it (more on that next)
Strategy 3: Canary deployments (step-by-step)
What Canary looks like in real life
You start by sending:
- 99% traffic → v1
- 1% traffic → v2
Then gradually increase:
1% → 5% → 25% → 50% → 100%
Canary deployment step-by-step (the safe, modern way)
- Define canary success metrics (before deploying)
- Example thresholds:
- error rate not worse than baseline by X%
- latency p95 not worse by Y ms
- CPU not pegged
- Example thresholds:
- Deploy canary (small slice)
- 1% traffic or a small cohort (internal users, beta accounts).
- Observe
- watch for real user behavior + performance changes.
- Bake time
- don’t rush. Some bugs appear after caches fill or traffic patterns shift.
- Progressive rollout
- increase traffic gradually if stable.
- Automatic rollback
- if thresholds fail, return traffic to stable version.
Real example: Canary is the best here
You’re changing recommendation logic in an e-commerce app.
- Not a crash bug, but could reduce conversion rates.
- It might impact only certain segments.
- You want controlled exposure and fast stop.
Canary lets you test with real traffic while keeping risk contained.
Canary pros
- lowest risk for high-impact systems
- catches issues that staging never finds
- supports cohort-based rollouts (powerful for product changes)
Canary cons (what teams underestimate)
- you need excellent observability and alerting
- you must pick good metrics (not just CPU)
- it’s slower than rolling if you do it properly
- you need good traffic routing control (LB/gateway/mesh)
The “DB problem” (why deployments fail even with perfect strategy)
No deployment strategy can save you if your database migration is unsafe.
Here are the two rules that prevent 80% of deployment disasters:
Rule 1: Make DB changes backward compatible first
If v1 and v2 run simultaneously (Rolling/Canary), then:
- DB schema must support both versions during rollout.
Pattern: Expand → Migrate → Contract
- Expand: add new columns/tables without breaking old code
- Migrate: backfill data, dual-write if needed
- Contract: remove old columns only after all services use new schema
Rule 2: Avoid “destructive” changes during rollout
Examples:
- dropping a column immediately
- changing column meaning
- renaming fields without compatibility layer
If you must do risky schema changes:
- Canary + strong compatibility patterns
- or Blue/Green with separate DB strategy (but that’s advanced)
Real-world examples (which strategy should you pick?)
Example A: Checkout API (money is involved)
Pick: Canary
Why: a small bug has a huge cost. Canary gives safe exposure and controlled rollback.
Example B: Internal admin dashboard
Pick: Rolling
Why: low risk, fast iteration, minimal operational complexity.
Example C: Massive version upgrade + config overhaul
Pick: Blue/Green
Why: you want full validation and instant rollback.
Example D: High traffic service with long-lived connections (websockets)
Pick: Canary or Blue/Green (with drain/connection handling)
Avoid: aggressive rolling without proper draining.
Example E: Batch workers / async jobs
Pick: Rolling or Canary
Tip: consider job compatibility: old jobs + new workers + message formats.
The “hidden” factor: what’s your rollback plan?
Most teams say “rollback is easy” until an incident proves otherwise.
Here’s what “good rollback” looks like for each:
Rolling rollback
- pause rollout immediately
- roll back to previous version
- accept that some users might have seen partial impact
Canary rollback
- shift traffic back to stable instantly
- keep canary running for debugging (optional)
- prevent repeat by locking promotion until fixed
Blue/Green rollback
- flip traffic back to Blue
- Green stays for investigation
- safest “panic button” if switching is reliable
Common mistakes (and how to avoid them)
Mistake 1: Choosing Canary without good metrics
Fix: define success metrics before deploying:
- error rate + latency + saturation + business KPIs (if relevant)
Mistake 2: Rolling too fast
Fix: slow it down, limit concurrency, and bake longer for critical services.
Mistake 3: Blue/Green without warmup
Fix: pre-warm caches, DB pools, and run smoke tests before cutover.
Mistake 4: Forgetting dependency compatibility
Fix: assume other services will call you during rollout. Keep contracts stable.
Mistake 5: Thinking “deployment strategy” replaces testing
Fix: it’s a safety net, not a substitute. You still need unit/integration/e2e tests.
Practical “choose your default” recommendation (for most teams)
If you’re building standards for a platform team:
- Default: Rolling (simple + fast for most services)
- For critical services: Canary (gated promotions + automated rollback)
- For major risky releases: Blue/Green (clean cutover + instant rollback)
This “3-lane highway” works extremely well in real orgs.
Final one-page summary (save this)
- Rolling = replace gradually, simple, cheap, mixed versions
- Canary = small % first, safest for critical systems, needs strong metrics
- Blue/Green = two environments, instant cutover/rollback, costs more, DB needs care
The best teams don’t argue which one is “best.”
They build the ability to use the right one at the right time—and make it repeatable.
1) Kubernetes (most flexible: Rolling + Canary + Blue/Green all common)
Best default
✅ Rolling (default for most services)
Use when:
- stateless services
- frequent releases
- you can tolerate mixed versions briefly
How it’s typically done:
- Kubernetes Deployment with readiness/liveness probes
- Gradual pod replacement via rolling update settings
Safe rolling settings mindset
- replace few pods at a time
- ensure readiness is strict (don’t send traffic early)
- use PodDisruptionBudgets to keep capacity
When Kubernetes should use Canary
✅ Canary (best for critical APIs: auth, payments, checkout)
Use when:
- risk is high
- you want early detection using real traffic
- you have good metrics (errors, latency, saturation)
How it’s typically done:
- weighted traffic split via:
- service mesh (Istio/Linkerd)
- gateway/ingress controller with traffic weights
- rollout controllers (progressive delivery)
Common traffic steps
- 1% → 5% → 25% → 50% → 100%
Auto rollback trigger - error rate, latency p95, or specific app KPIs
When Kubernetes should use Blue/Green
✅ Blue/Green (best for big risky releases)
Use when:
- you want near-instant rollback
- big upgrades, config overhaul, runtime change
- strict SLA, low tolerance for partial rollout
How it’s typically done:
- Two versions deployed side-by-side (blue + green)
- Flip the Service selector / routing to green
- Keep blue for fast rollback
Key requirement
- your warm-up and smoke tests must run against green before cutover
Kubernetes quick rule
- Default: Rolling
- Critical: Canary
- Big risky: Blue/Green
2) VMs (works best with Immutable deployments + Load Balancers)
With VMs, “rolling” usually means instance replacement, not in-place patching.
Best default
✅ Rolling (using instance replacement)
Use when:
- you’re using Auto Scaling / instance groups
- you can replace VMs gradually without downtime
How it’s typically done:
- Create new VM images (immutable build)
- Replace VMs gradually behind a load balancer
- Drain connections before terminating old instances
Strong tip
- Avoid in-place upgrades on long-lived VMs for production
- Prefer “bake image → replace instances”
When VMs should use Canary
✅ Canary (great when you can weight traffic)
Use when:
- you can route a small % of traffic to a new pool
- you want proof before broad rollout
How it’s typically done:
- Create a small “canary” group of VMs
- Route 1–5% traffic to that group via load balancer weights
- Promote gradually based on metrics
When VMs should use Blue/Green
✅ Blue/Green (very common + very effective on VMs)
Use when:
- you can afford two stacks temporarily
- you want instant rollback
- you’re doing a major change
How it’s typically done:
- Blue ASG (old) + Green ASG (new)
- Flip load balancer target group from blue → green
- Roll back by flipping back
VM quick rule
- If you have a load balancer + autoscaling: Blue/Green is easiest
- If you have weighted routing: Canary is safest
- If you’re just replacing instances gradually: Rolling is fine
3) Serverless (Lambda / Functions): Canary is king
Serverless doesn’t “roll” instances like pods/VMs. It’s mainly about versions + traffic shifting.
Best default
✅ Canary (best overall)
Use when:
- you want safer deploys without needing two environments
- you rely on monitoring + automatic rollback
How it’s typically done:
- Publish a new function version
- Shift traffic gradually using an alias/router:
- 1% → 5% → 25% → 50% → 100%
- Rollback = shift alias back to previous version
This is the cleanest serverless model.
Serverless Blue/Green (also common)
✅ Blue/Green (instant cutover)
Use when:
- you’re doing a bigger change and want a hard switch
- you still want instant rollback
How it’s typically done:
- Old version = blue
- New version = green
- Alias switch from blue → green in one step
What “Rolling” means in serverless
Rolling is not the same concept here. Serverless rolling is basically:
- “deploy new version + shift traffic”
So in practice: rolling = canary-style traffic shifting.
The DB reality (applies to Kubernetes + VMs + Serverless)
If old and new versions can run at the same time (Rolling/Canary), you must do:
Expand → Migrate → Contract
- Expand: add new fields/tables safely
- Migrate: backfill + dual-write if needed
- Contract: remove old fields only after full cutover
This prevents the most common “deployment strategy didn’t save us” failures.
Practical “standard policy” you can adopt today
Kubernetes standard
- Rolling for normal services
- Canary for tier-1 critical services
- Blue/Green for risky major upgrades
VM standard
- Blue/Green for most production releases (fast rollback)
- Canary for high-risk changes (if weighted routing exists)
- Rolling for low-risk replacements (instance refresh)
Serverless standard
- Canary by default
- Blue/Green when you need instant flip
- “Rolling” = traffic shifting anyway