You’ve seen it happen.
Everything looks fine… until users start complaining:
- “The app is slow.”
- “Checkout failed.”
- “I can’t login.”
- “It works for me.”
And then the worst part: you don’t know where to look first.
That’s what observability solves.
Not “more dashboards.”
Not “more logs.”
Not “more tools.”
Observability is the ability to answer new, unexpected questions about your system by looking at the data it produces.
This guide will make it simple and practical:
- What logs, metrics, and traces really are
- When each one is best
- What to instrument first (so you get real value fast)
- Real examples you can copy into your system today

The simplest way to understand observability
Imagine your system is a busy airport.
- Metrics tell you how the airport is doing overall (number of flights, average delays).
- Logs tell you what happened in specific moments (flight AA123 gate change, baggage error).
- Traces tell you the journey of one passenger (security → gate → boarding → takeoff), including every stop and delay.
All three matter. But they answer different questions.
1) Metrics: the “vital signs” of your system
What metrics are
Metrics are numbers over time.
Examples:
- Requests per second (RPS)
- Error rate (%)
- Latency (p50, p95, p99)
- CPU / Memory
- Queue length
- DB connections
- Cache hit rate
What metrics are best for
Metrics are best for:
- Detecting that something is wrong fast
- Seeing trends
- Setting alerts
- Measuring SLOs (availability, latency targets)
A real example (you’ll recognize this)
Users complain “it’s slow.”
Your metrics show:
- p95 latency jumped from 200ms → 2,500ms
- Error rate stayed at 0.2%
- RPS stayed the same
That tells you:
- It’s not a traffic spike
- It’s not a crash
- It’s a performance problem
Metrics answered “what’s happening” quickly.
But not “why”.
2) Logs: the “black box recorder” of events
What logs are
Logs are detailed event records.
Examples:
- “User 8321 login failed: invalid password”
- “Payment API timeout after 3 retries”
- “DB query returned 0 rows”
- “OutOfMemoryError while parsing file”
Logs often include context:
- timestamps
- user id
- request id
- error message
- stack trace
- payload metadata (careful with sensitive data)
What logs are best for
Logs are best for:
- Understanding what exactly happened
- Debugging error details
- Investigating “weird” edge cases
- Auditing important events
A real example
Your metrics alert: error rate spikes from 0.2% → 7%.
Logs show:
payment_gateway: 401 unauthorized- right after a deployment
- with message: “token missing”
Now you know:
- The bug is in auth header handling
- It started after the new release
Logs answered “what happened and where”.
But logs still might not show:
- Which upstream call caused the failure
- Which microservice introduced latency
- The full path of one user request
That’s where traces win.
3) Traces: the “story of one request across services”
What traces are
A trace follows one request end-to-end, across services.
A trace is made of spans.
Each span is a timed operation like:
- API Gateway → Service A
- Service A → Service B
- Service B → Database query
- Service B → Cache
- Service A → Payment provider
You can see:
- total request time
- each hop time
- where time is being spent
- where errors happened
- dependencies involved
What traces are best for
Traces are best for:
- Microservices debugging (“where is latency coming from?”)
- Dependency issues (DB, cache, external APIs)
- Understanding request flow
- Finding the true bottleneck quickly
A real example
Users complain checkout is slow.
Metrics show:
- p95 checkout latency = 3 seconds
- CPU/memory normal
- DB normal
Traces show:
checkout-servicetotal 3sinventory-service40mspricing-service60mspayment-provider2.7s ← culpritdb-write80ms
Now you know:
- Slow external dependency is dominating latency
- Fix is not in Kubernetes scaling
- Fix is timeouts, retries, fallbacks, payment provider performance, or caching
Traces answered “why”.
Logs vs Metrics vs Traces: when to use what (quick cheat sheet)
When you should use metrics
- “Is something wrong right now?”
- “Is performance getting worse?”
- “Should we alert?”
- “Are we meeting SLOs?”
When you should use logs
- “What was the exact error?”
- “Which user / input triggered it?”
- “What happened during that time?”
- “What is the stack trace?”
When you should use traces
- “Where is the latency coming from?”
- “Which dependency caused the failure?”
- “How does this request travel across services?”
- “Why are only some users affected?”
The big truth: most teams instrument in the wrong order
They start like this:
✅ “Let’s log everything!”
Then: logs become huge, expensive, noisy, and still hard to use.
Or they start like this:
✅ “Let’s make dashboards!”
Then: dashboards look nice, but don’t answer “why”.
The best approach is:
Instrument in this order:
- Golden Metrics (fast detection)
- Structured Logs (fast diagnosis)
- Traces (fast root cause, especially for microservices)
Let’s do it step-by-step.
What to instrument first (a practical roadmap)
Step 1: Instrument the 4 Golden Signals (Day 1 win)
These 4 metrics make you effective immediately:
- Latency (p50/p95/p99)
- Traffic (RPS / requests)
- Errors (rate, count, 4xx/5xx)
- Saturation (CPU/memory/queue/DB connections)
Example for a web API
http_requests_total{service="api",route="/checkout"}http_request_duration_ms_p95{service="api",route="/checkout"}http_errors_total{service="api",route="/checkout",code="5xx"}cpu_utilization{service="api"}memory_utilization{service="api"}db_connections_in_use{service="api"}
If you only do this, you already get:
- meaningful alerting
- health visibility
- trend analysis
This is the fastest “time-to-value” in observability.
Step 2: Add just enough logs (but make them structured)
The mistake
Most logs look like this:
ERROR something broke
That is useless under pressure.
What you want instead: structured logs
Structured logs are logs with consistent fields.
Example (conceptually):
- level: ERROR
- service: checkout-service
- route: /checkout
- user_id: 8321 (only if safe)
- error_code: PAY_401
- message: “payment token missing”
- request_id: abc-123
- trace_id: t-987 (more on this soon)
- duration_ms: 240
Now you can:
- filter by route
- group by error_code
- find top causes
- connect logs to traces
“Log only what you’ll search”
The best logs answer these:
- what failed?
- for which route/function?
- for which dependency?
- what error code?
- what request id / trace id?
- how long did it take?
Keep logs high-signal, not high-volume.
Step 3: Instrument traces where they matter most (not everywhere first)
Tracing everything at 100% sampling is expensive and noisy.
Instead, instrument strategically.
Start tracing in 3 places
- Ingress / API gateway (start of the request)
- Service-to-service calls (HTTP/gRPC)
- Database + external calls (where latency usually hides)
The “aha” moment
Once you add tracing, you can answer:
- “Is the slowdown internal or external?”
- “Which dependency dominates p95?”
- “Which service is introducing retries?”
That’s why tracing becomes your fastest root cause tool in microservices.
Real-life incident walkthrough (how you use all three)
Scenario: “Checkout is slow for some users”
Step A — Metrics (detect)
- p95 latency for
/checkoutwent from 300ms → 3,000ms - error rate unchanged
- traffic unchanged
Conclusion:
- performance regression, not outage
Step B — Traces (locate)
Traces show:
payment-providerspan is 2.6s for slow requests- internal services remain ~100ms
Conclusion:
- external dependency is slow
Step C — Logs (confirm details)
Logs show:
- retry count increased
- timeouts happening
- fallback disabled due to config change
Conclusion:
- config change increased retries + no fallback
Fix: - reduce retry attempts
- add timeout
- enable fallback/circuit breaker
- potentially cache idempotent calls
Metrics detect → traces locate → logs confirm.
That’s the ideal flow.
What “good instrumentation” looks like (the minimum standard)
1) Every request must have an ID
Call it:
- request_id
- correlation_id
This lets you find all logs for one user’s request.
2) Every request should have a trace_id (if tracing is on)
Then logs + traces connect.
3) Every service should expose a health and performance view
At minimum:
- latency by route
- error rate by route
- traffic
- saturation
4) Every dependency call should be measured
Because dependencies cause most pain:
- DB query time
- cache hit rate
- external API latency + error rate
Common beginner mistakes (and how to avoid them)
Mistake 1: Logging too much
Symptom: you can’t find anything, costs explode
Fix: structured logs + sampling + only log what you search
Mistake 2: Only using averages
Symptom: average latency looks fine while users suffer
Fix: always track p95/p99, not just avg
Mistake 3: Dashboards with no decisions
Symptom: dashboards exist but nobody acts
Fix: each dashboard should answer “what changed?” and “what do we do?”
Mistake 4: Tracing without context
Symptom: traces exist but don’t show business value
Fix: add key attributes: route, tenant, region, feature flag (carefully)
Mistake 5: No ownership
Symptom: alerts fire, but nobody knows who should fix
Fix: alerts should map to services and on-call ownership
Observability for different architectures (quick guidance)
If you have a monolith
Start with:
- golden metrics
- slow query logs
- request logs with duration + status
Then add traces later if needed.
If you have microservices
Start with:
- golden metrics per service
- distributed tracing early (even sampled)
- structured logs with trace_id
If you run batch jobs / pipelines
Start with:
- job duration
- success/failure count
- queue/backlog
- resource usage
Then add logs for failures and traces for pipeline stages.
The fastest “first week” implementation plan
Day 1–2: Metrics you can alert on
- latency p95
- error rate
- traffic
- saturation
Day 3–4: Structured logging
- consistent fields
- request_id
- clear error codes
- dependency error logging
Day 5–7: Tracing (targeted)
- gateway → services
- DB + external calls
- sampling strategy (start small)
You’ll be more effective in one week than many teams are in six months.
Final takeaway (the one you should remember)
If observability feels overwhelming, remember this order:
Metrics tell you something is wrong.
Traces tell you where it’s wrong.
Logs tell you what exactly happened.
Start with the golden signals.
Then add structured logs.
Then add traces where you have distributed complexity.
That combination keeps you calm when production isn’t.
Observability 101 for AWS, Azure, and GCP
Logs vs Metrics vs Traces (and what to instrument first)
You know that feeling when something breaks in production and everyone asks:
- “Is it down or just slow?”
- “Which service?”
- “Only some users or everyone?”
- “Did the last deployment do this?”
And you open your monitoring tool… and it’s a maze.
Observability is how you turn that chaos into a calm, repeatable process.
In this guide, you’ll learn (in plain English):
- What logs, metrics, and traces actually are
- Which one answers which question (fast)
- What to instrument first to get real value quickly
- Exactly how this maps to AWS, Azure, and GCP
- Real examples (with copy-paste friendly structures)
No fluff. No theory-only talk. Just the stuff that makes you effective.
The 10-second definition (engineer-friendly)
Observability = being able to explain what your system is doing from the outside, using the data it produces.
Think of it like this:
- Metrics = “How bad is it?” (numbers over time)
- Logs = “What exactly happened?” (events and details)
- Traces = “Where did time go?” (a request’s journey across services)
If you remember only one line, remember this:
Metrics detect. Traces locate. Logs explain.
1) Metrics: your system’s vital signs
What metrics are
Metrics are numbers measured over time, usually aggregated.
Examples:
- Requests per second
- Error rate (%)
- Latency (p50 / p95 / p99)
- CPU / memory
- Queue depth
- DB connections
- Cache hit rate
When metrics are the best tool
Metrics win when you need to answer:
- “Is something wrong right now?”
- “Is it getting worse?”
- “Should I alert?”
- “Are we meeting our reliability targets?”
The classic real-world moment
Users say “Checkout is slow.”
Metrics show:
- p95 latency jumped from 250ms → 2,800ms
- traffic stayed normal
- error rate stayed normal
That tells you:
✅ It’s real
✅ It’s performance
✅ It’s not a traffic spike
✅ It’s not (yet) a full outage
But metrics don’t tell you why.
That’s the next pillar.
2) Logs: the detailed “what happened” story
What logs are
Logs are event records with context.
Examples:
- “Payment request timed out”
- “DB query failed”
- “Token missing”
- “Retry attempt 2”
- “User creation succeeded”
Logs are powerful only if they’re searchable
If your logs are:
- inconsistent
- unstructured
- missing identifiers
…then they become a noisy diary nobody can use under stress.
What logs are best for
Logs win when you need to answer:
- “What is the exact error message?”
- “Which input caused it?”
- “Which dependency failed?”
- “Was this a known exception or a new one?”
The most useful logging upgrade you can make (today)
Switch from “random text logs” to structured logs.
Instead of:ERROR payment failed
Prefer:
{
"level": "ERROR",
"service": "checkout-service",
"route": "/checkout",
"error_code": "PAY_TIMEOUT",
"message": "Payment provider timeout",
"duration_ms": 2400,
"request_id": "req-8f1c",
"trace_id": "tr-91ab",
"user_tier": "premium"
}
Now you can filter, group, and correlate.
But logs still don’t show you the end-to-end path of a request.
That’s what traces do.
3) Traces: the “where time went” microscope
What traces are
A trace follows one request across services.
A trace is made of spans (timed steps), like:
- API Gateway → checkout-service
- checkout-service → inventory-service
- checkout-service → payment-provider
- checkout-service → database write
What traces are best for
Traces win when you need to answer:
- “Which service is slow?”
- “Which dependency is the bottleneck?”
- “Why is p95 bad but p50 is fine?”
- “Where did the errors start?”
Real example (microservices reality)
Metrics say checkout p95 is 3s.
Traces reveal:
- inventory call: 40ms
- pricing call: 70ms
- payment provider: 2.7s ← culprit
- DB write: 90ms
Now you know scaling pods won’t fix it.
You need timeouts, retries, fallbacks, caching, or vendor escalation.
Logs vs Metrics vs Traces: the practical cheat sheet
Use metrics when…
- you need fast detection
- you need alerting
- you need trends and SLO tracking
Use logs when…
- you need error details
- you need debugging context
- you need audit trails and event evidence
Use traces when…
- you need root cause in distributed systems
- you need dependency breakdown
- you need latency bottleneck identification
What to instrument first (the best order for beginners)
Most teams do this wrong by starting with “log everything.”
The best order (especially for AWS/Azure/GCP teams) is:
Step 1: Metrics (Golden Signals)
Step 2: Structured Logs (with correlation IDs)
Step 3: Traces (targeted, then expanded)
Why this order works:
- Metrics give you instant visibility
- Logs give you instant diagnosis
- Traces give you instant root cause when systems are distributed
Now let’s make it concrete.
The Golden Signals (instrument these first)
These 4 signals are the fastest path to useful observability:
- Latency (p50/p95/p99)
- Traffic (requests/sec, throughput)
- Errors (rate, count, 4xx/5xx)
- Saturation (CPU/memory/queue/DB connections)
Minimal starter metrics for an HTTP API
requests_totalby route/statusrequest_duration_msp95/p99 by routeerrors_totalby route/status- CPU/memory utilization
- dependency latency/error (DB, cache, external API)
If you do only this, you’ll already:
- catch incidents faster
- stop guessing
- create useful alerts
Correlation: the glue that makes everything “click”
If you want your observability to feel magical, do this:
Every request should carry:
request_id(correlation ID)trace_id(if tracing enabled)
Then:
- Metrics tell you which route/service is failing
- You click a slow request trace
- You jump to the exact logs for that trace/request
This is the difference between “tools” and “superpowers.”
Cloud mapping: AWS vs Azure vs GCP (simple and practical)
Below is the common, beginner-friendly mapping of the three pillars in each cloud.
AWS (common choices)
- Metrics: CloudWatch Metrics
- Logs: CloudWatch Logs
- Traces: AWS X-Ray (and/or OpenTelemetry-based tracing pipelines)
- Audit logs: CloudTrail (important for “who changed what”)
Typical environments:
- EKS: metrics + logs + traces from pods/nodes
- ECS: similar concept, different deployment model
- Lambda: metrics and logs are very natural; traces are extremely valuable for cold starts and dependency time
Azure (common choices)
- Metrics: Azure Monitor Metrics
- Logs: Log Analytics (Azure Monitor Logs)
- Traces/APM: Application Insights (application-level telemetry)
- Audit logs: Activity logs and resource logs (for change tracking)
Typical environments:
- AKS: cluster + app telemetry, plus dependency tracking
- App Service / Functions: APM shines here
GCP (common choices)
- Metrics: Cloud Monitoring
- Logs: Cloud Logging
- Traces: Cloud Trace
- Audit logs: Cloud Audit Logs
Typical environments:
- GKE: strong “platform + app telemetry” story if you set it up cleanly
- Cloud Run / Functions: quick wins with request metrics + traces
Important note: The names differ, but the goal is the same: metrics detect, traces locate, logs explain.
The “what to instrument first” plan — per cloud (fast wins)
1) Start with service-level golden metrics (all clouds)
Pick your top 5 user-facing services and instrument:
- latency (p95/p99)
- error rate
- traffic
- saturation
Example: Checkout service
/checkoutlatency p95/checkouterror rate- request count (per minute)
- CPU/memory (or container resource usage)
- payment dependency latency/error
This creates immediate, meaningful dashboards.
2) Add structured logs (your future self will thank you)
Do these four things:
- Log in JSON (or consistent key/value format)
- Always include
service,env,version,route - Always include
request_id - Include
trace_idonce tracing exists
“Best-practice log fields” (copy this list)
timestamplevelserviceenvversion(commit SHA or build number)routeoroperationstatus_codeduration_msrequest_idtrace_iderror_code(stable codes, not just messages)dependency(db/cache/provider)region/zone(optional but helpful)
And one rule:
Don’t log secrets. Don’t log raw sensitive payloads.
3) Add tracing where it matters most
Start tracing in three places:
- Entry point (API gateway / ingress / function handler)
- Service-to-service calls (HTTP/gRPC)
- Dependencies (DB, cache, external APIs)
Sampling tip (beginner-safe)
- Trace 100% of errors
- Trace a smaller % of successful requests (start with 1–10%)
- Increase sampling temporarily during incidents
This keeps cost and noise under control.
A full incident walkthrough (AWS/Azure/GCP-friendly)
Problem: “Checkout is slow for some users”
Step A — Metrics detect
/checkoutp95 latency rises sharply- traffic steady
- error rate low
Conclusion: performance regression, not overload.
Step B — Traces locate
Traces show slow span:
payment-provideris 2.6s on slow requests- internal spans normal
Conclusion: dependency latency, not compute shortage.
Step C — Logs explain
Logs show:
- retries increased
- timeouts too high
- fallback disabled after last config change
Fix:
- tighten timeouts
- reduce retries
- add circuit breaker/fallback
- optionally cache idempotent operations
Metrics detect → traces locate → logs explain.
That is observability working exactly as intended.
What “good” looks like (minimum standard for teams)
✅ 1) You can answer these 5 questions in under 5 minutes
- Is it an outage or slowness?
- Which service/route is impacted?
- Is it internal or a dependency?
- When did it start and what changed?
- What’s the fastest safe mitigation?
If you can’t do this yet, your next instrumentation step is obvious.
✅ 2) Every alert has an owner and a playbook
An alert without ownership becomes background noise.
Even a tiny playbook helps:
- what it means
- what to check first
- common causes
- safe mitigations
✅ 3) You track tail latency (p95/p99), not averages
Averages lie.
Tail latency tells you what users feel.
Beginner traps (and how to avoid them)
Trap 1: “Log everything”
Result: huge bills, noisy data, slower debugging
Fix: structured logs + purposeful fields + sampling where needed
Trap 2: “All dashboards, no decisions”
Result: pretty graphs, no action
Fix: dashboards must answer “what changed?” and “what do we do next?”
Trap 3: High-cardinality metrics explosion
Result: metrics become expensive and unusable
Fix: be careful with labels like user_id, request_id, raw URLs
Use route templates like /users/:id instead of /users/8321
Trap 4: Traces without consistent propagation
Result: broken traces, missing hops
Fix: ensure trace context flows across services (especially async/queues)
The best 7-day implementation plan (works for AWS, Azure, GCP)
Days 1–2: Golden metrics + basic alerts
- latency p95/p99 per key route
- error rate per key route
- traffic
- saturation
Days 3–4: Structured logging + correlation IDs
- JSON logs
- request_id everywhere
- error codes
- dependency failure logging
Days 5–7: Tracing (targeted) + log/trace correlation
- trace entry points
- trace dependency calls
- sample successes, keep all errors
- add trace_id into logs
After 7 days, you’ll stop guessing and start knowing.
Quick FAQs people always wonder (and your answers)
“Do I need all three pillars?”
Eventually yes — but start with metrics, then logs, then traces.
“If I only pick one to start, which one?”
Metrics. They give the fastest, most reliable signal that something is wrong.
“Why not start with logs?”
Because logs tell stories, but they don’t show you the system’s health at a glance. During incidents, you need fast detection first.
“Are traces only for microservices?”
Traces help monoliths too, but they become critical in distributed systems.
“What’s the biggest win I can get quickly?”
Add:
- latency p95/p99
- error rate
- request_id everywhere
Then connect logs + traces using that correlation.
Final takeaway (bookmark this in your brain)
Metrics tell you something is wrong.
Traces tell you where it’s wrong.
Logs tell you what exactly happened.
For AWS, Azure, and GCP, the tools may have different names—but the winning approach stays the same.