Mohammad Gufran Jahangir January 26, 2026 0

You’ve seen it happen.

Everything looks fine… until users start complaining:

“The app is slow.”
“Checkout failed.”
“I can’t login.”
“It works for me.”

And then the worst part: you don’t know where to look first.

That’s what observability solves.

Not “more dashboards.”
Not “more logs.”
Not “more tools.”

Observability is the ability to answer new, unexpected questions about your system by looking at the data it produces.

This guide will make it simple and practical:

What logs, metrics, and traces really are
When each one is best
What to instrument first (so you get real value fast)
Real examples you can copy into your system today

Table of Contents

The simplest way to understand observability

Imagine your system is a busy airport.

Metrics tell you how the airport is doing overall (number of flights, average delays).
Logs tell you what happened in specific moments (flight AA123 gate change, baggage error).
Traces tell you the journey of one passenger (security → gate → boarding → takeoff), including every stop and delay.

All three matter. But they answer different questions.

1) Metrics: the “vital signs” of your system

What metrics are

Metrics are numbers over time.

Examples:

Requests per second (RPS)
Error rate (%)
Latency (p50, p95, p99)
CPU / Memory
Queue length
DB connections
Cache hit rate

What metrics are best for

Metrics are best for:

Detecting that something is wrong fast
Seeing trends
Setting alerts
Measuring SLOs (availability, latency targets)

A real example (you’ll recognize this)

Users complain “it’s slow.”

Your metrics show:

p95 latency jumped from 200ms → 2,500ms
Error rate stayed at 0.2%
RPS stayed the same

That tells you:

It’s not a traffic spike
It’s not a crash
It’s a performance problem

Metrics answered “what’s happening” quickly.

But not “why”.

2) Logs: the “black box recorder” of events

What logs are

Logs are detailed event records.

Examples:

“User 8321 login failed: invalid password”
“Payment API timeout after 3 retries”
“DB query returned 0 rows”
“OutOfMemoryError while parsing file”

Logs often include context:

timestamps
user id
request id
error message
stack trace
payload metadata (careful with sensitive data)

What logs are best for

Logs are best for:

Understanding what exactly happened
Debugging error details
Investigating “weird” edge cases
Auditing important events

A real example

Your metrics alert: error rate spikes from 0.2% → 7%.

Logs show:

payment_gateway: 401 unauthorized
right after a deployment
with message: “token missing”

Now you know:

The bug is in auth header handling
It started after the new release

Logs answered “what happened and where”.

But logs still might not show:

Which upstream call caused the failure
Which microservice introduced latency
The full path of one user request

That’s where traces win.

3) Traces: the “story of one request across services”

What traces are

A trace follows one request end-to-end, across services.

A trace is made of spans.
Each span is a timed operation like:

API Gateway → Service A
Service A → Service B
Service B → Database query
Service B → Cache
Service A → Payment provider

You can see:

total request time
each hop time
where time is being spent
where errors happened
dependencies involved

What traces are best for

Traces are best for:

Microservices debugging (“where is latency coming from?”)
Dependency issues (DB, cache, external APIs)
Understanding request flow
Finding the true bottleneck quickly

A real example

Users complain checkout is slow.

Metrics show:

p95 checkout latency = 3 seconds
CPU/memory normal
DB normal

Traces show:

checkout-service total 3s
- inventory-service 40ms
- pricing-service 60ms
- payment-provider 2.7s ← culprit
- db-write 80ms

Now you know:

Slow external dependency is dominating latency
Fix is not in Kubernetes scaling
Fix is timeouts, retries, fallbacks, payment provider performance, or caching

Traces answered “why”.

Logs vs Metrics vs Traces: when to use what (quick cheat sheet)

When you should use metrics

“Is something wrong right now?”
“Is performance getting worse?”
“Should we alert?”
“Are we meeting SLOs?”

When you should use logs

“What was the exact error?”
“Which user / input triggered it?”
“What happened during that time?”
“What is the stack trace?”

When you should use traces

“Where is the latency coming from?”
“Which dependency caused the failure?”
“How does this request travel across services?”
“Why are only some users affected?”

The big truth: most teams instrument in the wrong order

They start like this:
✅ “Let’s log everything!”
Then: logs become huge, expensive, noisy, and still hard to use.

Or they start like this:
✅ “Let’s make dashboards!”
Then: dashboards look nice, but don’t answer “why”.

The best approach is:

Instrument in this order:

Golden Metrics (fast detection)
Structured Logs (fast diagnosis)
Traces (fast root cause, especially for microservices)

Let’s do it step-by-step.

What to instrument first (a practical roadmap)

Step 1: Instrument the 4 Golden Signals (Day 1 win)

These 4 metrics make you effective immediately:

Latency (p50/p95/p99)
Traffic (RPS / requests)
Errors (rate, count, 4xx/5xx)
Saturation (CPU/memory/queue/DB connections)

Example for a web API

http_requests_total{service="api",route="/checkout"}
http_request_duration_ms_p95{service="api",route="/checkout"}
http_errors_total{service="api",route="/checkout",code="5xx"}
cpu_utilization{service="api"}
memory_utilization{service="api"}
db_connections_in_use{service="api"}

If you only do this, you already get:

meaningful alerting
health visibility
trend analysis

This is the fastest “time-to-value” in observability.

Step 2: Add just enough logs (but make them structured)

The mistake

Most logs look like this:

ERROR something broke

That is useless under pressure.

What you want instead: structured logs

Structured logs are logs with consistent fields.

Example (conceptually):

level: ERROR
service: checkout-service
route: /checkout
user_id: 8321 (only if safe)
error_code: PAY_401
message: “payment token missing”
request_id: abc-123
trace_id: t-987 (more on this soon)
duration_ms: 240

Now you can:

filter by route
group by error_code
find top causes
connect logs to traces

“Log only what you’ll search”

The best logs answer these:

what failed?
for which route/function?
for which dependency?
what error code?
what request id / trace id?
how long did it take?

Keep logs high-signal, not high-volume.

Step 3: Instrument traces where they matter most (not everywhere first)

Tracing everything at 100% sampling is expensive and noisy.
Instead, instrument strategically.

Start tracing in 3 places

Ingress / API gateway (start of the request)
Service-to-service calls (HTTP/gRPC)
Database + external calls (where latency usually hides)

The “aha” moment

Once you add tracing, you can answer:

“Is the slowdown internal or external?”
“Which dependency dominates p95?”
“Which service is introducing retries?”

That’s why tracing becomes your fastest root cause tool in microservices.

Real-life incident walkthrough (how you use all three)

Scenario: “Checkout is slow for some users”

Step A — Metrics (detect)

p95 latency for /checkout went from 300ms → 3,000ms
error rate unchanged
traffic unchanged

Conclusion:

performance regression, not outage

Step B — Traces (locate)

Traces show:

payment-provider span is 2.6s for slow requests
internal services remain ~100ms

Conclusion:

external dependency is slow

Step C — Logs (confirm details)

Logs show:

retry count increased
timeouts happening
fallback disabled due to config change

Conclusion:

config change increased retries + no fallback
Fix:
reduce retry attempts
add timeout
enable fallback/circuit breaker
potentially cache idempotent calls

Metrics detect → traces locate → logs confirm.

That’s the ideal flow.

What “good instrumentation” looks like (the minimum standard)

1) Every request must have an ID

Call it:

request_id
correlation_id

This lets you find all logs for one user’s request.

2) Every request should have a trace_id (if tracing is on)

Then logs + traces connect.

3) Every service should expose a health and performance view

At minimum:

latency by route
error rate by route
traffic
saturation

4) Every dependency call should be measured

Because dependencies cause most pain:

DB query time
cache hit rate
external API latency + error rate

Common beginner mistakes (and how to avoid them)

Mistake 1: Logging too much

Symptom: you can’t find anything, costs explode
Fix: structured logs + sampling + only log what you search

Mistake 2: Only using averages

Symptom: average latency looks fine while users suffer
Fix: always track p95/p99, not just avg

Mistake 3: Dashboards with no decisions

Symptom: dashboards exist but nobody acts
Fix: each dashboard should answer “what changed?” and “what do we do?”

Mistake 4: Tracing without context

Symptom: traces exist but don’t show business value
Fix: add key attributes: route, tenant, region, feature flag (carefully)

Mistake 5: No ownership

Symptom: alerts fire, but nobody knows who should fix
Fix: alerts should map to services and on-call ownership

Observability for different architectures (quick guidance)

If you have a monolith

Start with:

golden metrics
slow query logs
request logs with duration + status
Then add traces later if needed.

If you have microservices

Start with:

golden metrics per service
distributed tracing early (even sampled)
structured logs with trace_id

If you run batch jobs / pipelines

Start with:

job duration
success/failure count
queue/backlog
resource usage
Then add logs for failures and traces for pipeline stages.

The fastest “first week” implementation plan

Day 1–2: Metrics you can alert on

latency p95
error rate
traffic
saturation

Day 3–4: Structured logging

consistent fields
request_id
clear error codes
dependency error logging

Day 5–7: Tracing (targeted)

gateway → services
DB + external calls
sampling strategy (start small)

You’ll be more effective in one week than many teams are in six months.

Final takeaway (the one you should remember)

If observability feels overwhelming, remember this order:

Metrics tell you something is wrong.
Traces tell you where it’s wrong.
Logs tell you what exactly happened.

Start with the golden signals.
Then add structured logs.
Then add traces where you have distributed complexity.

That combination keeps you calm when production isn’t.

Observability 101 for AWS, Azure, and GCP

Logs vs Metrics vs Traces (and what to instrument first)

You know that feeling when something breaks in production and everyone asks:

“Is it down or just slow?”
“Which service?”
“Only some users or everyone?”
“Did the last deployment do this?”

And you open your monitoring tool… and it’s a maze.

Observability is how you turn that chaos into a calm, repeatable process.

In this guide, you’ll learn (in plain English):

What logs, metrics, and traces actually are
Which one answers which question (fast)
What to instrument first to get real value quickly
Exactly how this maps to AWS, Azure, and GCP
Real examples (with copy-paste friendly structures)

No fluff. No theory-only talk. Just the stuff that makes you effective.

The 10-second definition (engineer-friendly)

Observability = being able to explain what your system is doing from the outside, using the data it produces.

Think of it like this:

Metrics = “How bad is it?” (numbers over time)
Logs = “What exactly happened?” (events and details)
Traces = “Where did time go?” (a request’s journey across services)

If you remember only one line, remember this:

Metrics detect. Traces locate. Logs explain.

1) Metrics: your system’s vital signs

What metrics are

Metrics are numbers measured over time, usually aggregated.

Examples:

Requests per second
Error rate (%)
Latency (p50 / p95 / p99)
CPU / memory
Queue depth
DB connections
Cache hit rate

When metrics are the best tool

Metrics win when you need to answer:

“Is something wrong right now?”
“Is it getting worse?”
“Should I alert?”
“Are we meeting our reliability targets?”

The classic real-world moment

Users say “Checkout is slow.”

Metrics show:

p95 latency jumped from 250ms → 2,800ms
traffic stayed normal
error rate stayed normal

That tells you:
✅ It’s real
✅ It’s performance
✅ It’s not a traffic spike
✅ It’s not (yet) a full outage

But metrics don’t tell you why.

That’s the next pillar.

2) Logs: the detailed “what happened” story

What logs are

Logs are event records with context.

Examples:

“Payment request timed out”
“DB query failed”
“Token missing”
“Retry attempt 2”
“User creation succeeded”

Logs are powerful only if they’re searchable

If your logs are:

inconsistent
unstructured
missing identifiers
…then they become a noisy diary nobody can use under stress.

What logs are best for

Logs win when you need to answer:

“What is the exact error message?”
“Which input caused it?”
“Which dependency failed?”
“Was this a known exception or a new one?”

The most useful logging upgrade you can make (today)

Switch from “random text logs” to structured logs.

Instead of:
ERROR payment failed

Prefer:

{
  "level": "ERROR",
  "service": "checkout-service",
  "route": "/checkout",
  "error_code": "PAY_TIMEOUT",
  "message": "Payment provider timeout",
  "duration_ms": 2400,
  "request_id": "req-8f1c",
  "trace_id": "tr-91ab",
  "user_tier": "premium"
}

Now you can filter, group, and correlate.

But logs still don’t show you the end-to-end path of a request.

That’s what traces do.

3) Traces: the “where time went” microscope

What traces are

A trace follows one request across services.

A trace is made of spans (timed steps), like:

API Gateway → checkout-service
checkout-service → inventory-service
checkout-service → payment-provider
checkout-service → database write

What traces are best for

Traces win when you need to answer:

“Which service is slow?”
“Which dependency is the bottleneck?”
“Why is p95 bad but p50 is fine?”
“Where did the errors start?”

Real example (microservices reality)

Metrics say checkout p95 is 3s.

Traces reveal:

inventory call: 40ms
pricing call: 70ms
payment provider: 2.7s ← culprit
DB write: 90ms

Now you know scaling pods won’t fix it.
You need timeouts, retries, fallbacks, caching, or vendor escalation.

Logs vs Metrics vs Traces: the practical cheat sheet

Use metrics when…

you need fast detection
you need alerting
you need trends and SLO tracking

Use logs when…

you need error details
you need debugging context
you need audit trails and event evidence

Use traces when…

you need root cause in distributed systems
you need dependency breakdown
you need latency bottleneck identification

What to instrument first (the best order for beginners)

Most teams do this wrong by starting with “log everything.”

The best order (especially for AWS/Azure/GCP teams) is:

Step 1: Metrics (Golden Signals)

Step 2: Structured Logs (with correlation IDs)

Step 3: Traces (targeted, then expanded)

Why this order works:

Metrics give you instant visibility
Logs give you instant diagnosis
Traces give you instant root cause when systems are distributed

Now let’s make it concrete.

The Golden Signals (instrument these first)

These 4 signals are the fastest path to useful observability:

Latency (p50/p95/p99)
Traffic (requests/sec, throughput)
Errors (rate, count, 4xx/5xx)
Saturation (CPU/memory/queue/DB connections)

Minimal starter metrics for an HTTP API

requests_total by route/status
request_duration_ms p95/p99 by route
errors_total by route/status
CPU/memory utilization
dependency latency/error (DB, cache, external API)

If you do only this, you’ll already:

catch incidents faster
stop guessing
create useful alerts

Correlation: the glue that makes everything “click”

If you want your observability to feel magical, do this:

Every request should carry:

request_id (correlation ID)
trace_id (if tracing enabled)

Then:

Metrics tell you which route/service is failing
You click a slow request trace
You jump to the exact logs for that trace/request

This is the difference between “tools” and “superpowers.”

Cloud mapping: AWS vs Azure vs GCP (simple and practical)

Below is the common, beginner-friendly mapping of the three pillars in each cloud.

AWS (common choices)

Metrics: CloudWatch Metrics
Logs: CloudWatch Logs
Traces: AWS X-Ray (and/or OpenTelemetry-based tracing pipelines)
Audit logs: CloudTrail (important for “who changed what”)

Typical environments:

EKS: metrics + logs + traces from pods/nodes
ECS: similar concept, different deployment model
Lambda: metrics and logs are very natural; traces are extremely valuable for cold starts and dependency time

Azure (common choices)

Metrics: Azure Monitor Metrics
Logs: Log Analytics (Azure Monitor Logs)
Traces/APM: Application Insights (application-level telemetry)
Audit logs: Activity logs and resource logs (for change tracking)

Typical environments:

AKS: cluster + app telemetry, plus dependency tracking
App Service / Functions: APM shines here

GCP (common choices)

Metrics: Cloud Monitoring
Logs: Cloud Logging
Traces: Cloud Trace
Audit logs: Cloud Audit Logs

Typical environments:

GKE: strong “platform + app telemetry” story if you set it up cleanly
Cloud Run / Functions: quick wins with request metrics + traces

Important note: The names differ, but the goal is the same: metrics detect, traces locate, logs explain.

The “what to instrument first” plan — per cloud (fast wins)

1) Start with service-level golden metrics (all clouds)

Pick your top 5 user-facing services and instrument:

latency (p95/p99)
error rate
traffic
saturation

Example: Checkout service

/checkout latency p95
/checkout error rate
request count (per minute)
CPU/memory (or container resource usage)
payment dependency latency/error

This creates immediate, meaningful dashboards.

2) Add structured logs (your future self will thank you)

Do these four things:

Log in JSON (or consistent key/value format)
Always include service, env, version, route
Always include request_id
Include trace_id once tracing exists

“Best-practice log fields” (copy this list)

timestamp
level
service
env
version (commit SHA or build number)
route or operation
status_code
duration_ms
request_id
trace_id
error_code (stable codes, not just messages)
dependency (db/cache/provider)
region / zone (optional but helpful)

And one rule:

Don’t log secrets. Don’t log raw sensitive payloads.

3) Add tracing where it matters most

Start tracing in three places:

Entry point (API gateway / ingress / function handler)
Service-to-service calls (HTTP/gRPC)
Dependencies (DB, cache, external APIs)

Sampling tip (beginner-safe)

Trace 100% of errors
Trace a smaller % of successful requests (start with 1–10%)
Increase sampling temporarily during incidents

This keeps cost and noise under control.

A full incident walkthrough (AWS/Azure/GCP-friendly)

Problem: “Checkout is slow for some users”

Step A — Metrics detect

/checkout p95 latency rises sharply
traffic steady
error rate low

Conclusion: performance regression, not overload.

Step B — Traces locate

Traces show slow span:

payment-provider is 2.6s on slow requests
internal spans normal

Conclusion: dependency latency, not compute shortage.

Step C — Logs explain

Logs show:

retries increased
timeouts too high
fallback disabled after last config change

Fix:

tighten timeouts
reduce retries
add circuit breaker/fallback
optionally cache idempotent operations

Metrics detect → traces locate → logs explain.

That is observability working exactly as intended.

What “good” looks like (minimum standard for teams)

✅ 1) You can answer these 5 questions in under 5 minutes

Is it an outage or slowness?
Which service/route is impacted?
Is it internal or a dependency?
When did it start and what changed?
What’s the fastest safe mitigation?

If you can’t do this yet, your next instrumentation step is obvious.

✅ 2) Every alert has an owner and a playbook

An alert without ownership becomes background noise.

Even a tiny playbook helps:

what it means
what to check first
common causes
safe mitigations

✅ 3) You track tail latency (p95/p99), not averages

Averages lie.
Tail latency tells you what users feel.

Beginner traps (and how to avoid them)

Trap 1: “Log everything”

Result: huge bills, noisy data, slower debugging
Fix: structured logs + purposeful fields + sampling where needed

Trap 2: “All dashboards, no decisions”

Result: pretty graphs, no action
Fix: dashboards must answer “what changed?” and “what do we do next?”

Trap 3: High-cardinality metrics explosion

Result: metrics become expensive and unusable
Fix: be careful with labels like user_id, request_id, raw URLs
Use route templates like /users/:id instead of /users/8321

Trap 4: Traces without consistent propagation

Result: broken traces, missing hops
Fix: ensure trace context flows across services (especially async/queues)

The best 7-day implementation plan (works for AWS, Azure, GCP)

Days 1–2: Golden metrics + basic alerts

latency p95/p99 per key route
error rate per key route
traffic
saturation

Days 3–4: Structured logging + correlation IDs

JSON logs
request_id everywhere
error codes
dependency failure logging

Days 5–7: Tracing (targeted) + log/trace correlation

trace entry points
trace dependency calls
sample successes, keep all errors
add trace_id into logs

After 7 days, you’ll stop guessing and start knowing.

Quick FAQs people always wonder (and your answers)

“Do I need all three pillars?”

Eventually yes — but start with metrics, then logs, then traces.

“If I only pick one to start, which one?”

Metrics. They give the fastest, most reliable signal that something is wrong.

“Why not start with logs?”

Because logs tell stories, but they don’t show you the system’s health at a glance. During incidents, you need fast detection first.

“Are traces only for microservices?”

Traces help monoliths too, but they become critical in distributed systems.

“What’s the biggest win I can get quickly?”

Add:

latency p95/p99
error rate
request_id everywhere
Then connect logs + traces using that correlation.

Final takeaway (bookmark this in your brain)

Metrics tell you something is wrong.
Traces tell you where it’s wrong.
Logs tell you what exactly happened.

For AWS, Azure, and GCP, the tools may have different names—but the winning approach stays the same.

Mohammad Gufran Jahangir

Category:

Observability 101: Logs vs Metrics vs Traces (and what to instrument first)

The simplest way to understand observability

1) Metrics: the “vital signs” of your system

What metrics are

What metrics are best for

A real example (you’ll recognize this)

2) Logs: the “black box recorder” of events

What logs are

What logs are best for

A real example

3) Traces: the “story of one request across services”

What traces are

What traces are best for

A real example

Logs vs Metrics vs Traces: when to use what (quick cheat sheet)

When you should use metrics

When you should use logs

When you should use traces

The big truth: most teams instrument in the wrong order

Instrument in this order:

What to instrument first (a practical roadmap)

Step 1: Instrument the 4 Golden Signals (Day 1 win)

Example for a web API

Step 2: Add just enough logs (but make them structured)

The mistake

What you want instead: structured logs

“Log only what you’ll search”

Step 3: Instrument traces where they matter most (not everywhere first)

Start tracing in 3 places

The “aha” moment

Real-life incident walkthrough (how you use all three)

Scenario: “Checkout is slow for some users”

Step A — Metrics (detect)

Step B — Traces (locate)

Step C — Logs (confirm details)

What “good instrumentation” looks like (the minimum standard)

1) Every request must have an ID

2) Every request should have a trace_id (if tracing is on)

3) Every service should expose a health and performance view

4) Every dependency call should be measured

Common beginner mistakes (and how to avoid them)

Mistake 1: Logging too much

Mistake 2: Only using averages

Mistake 3: Dashboards with no decisions

Mistake 4: Tracing without context

Mistake 5: No ownership

Observability for different architectures (quick guidance)

If you have a monolith

If you have microservices

If you run batch jobs / pipelines

The fastest “first week” implementation plan

Day 1–2: Metrics you can alert on

Day 3–4: Structured logging

Day 5–7: Tracing (targeted)

Final takeaway (the one you should remember)

Observability 101 for AWS, Azure, and GCP

Logs vs Metrics vs Traces (and what to instrument first)

The 10-second definition (engineer-friendly)

1) Metrics: your system’s vital signs

What metrics are

When metrics are the best tool

The classic real-world moment

2) Logs: the detailed “what happened” story

What logs are

Logs are powerful only if they’re searchable

What logs are best for

The most useful logging upgrade you can make (today)

3) Traces: the “where time went” microscope

What traces are

What traces are best for

Real example (microservices reality)

Logs vs Metrics vs Traces: the practical cheat sheet

Use metrics when…

Use logs when…

Use traces when…

What to instrument first (the best order for beginners)

Step 1: Metrics (Golden Signals)

Step 2: Structured Logs (with correlation IDs)

Step 3: Traces (targeted, then expanded)

The Golden Signals (instrument these first)