Mohammad Gufran Jahangir February 3, 2026 0

Most dashboards fail for one simple reason: they look impressive but don’t help you answer a real question under pressure.

Engineers don’t open Grafana to admire graphs. They open it when:

a customer says “it’s slow”
an alert fires at 2:17 AM
a deploy just happened
a KPI suddenly dips

So let’s build dashboards that do what engineers actually need: tell you what’s broken, where it is, why it’s happening, and what to do next.

This guide is beginner-friendly, but it goes deep enough that you can build a dashboard your team will keep pinned for years.

Table of Contents

1) The mental model (so everything clicks fast)

Prometheus = time-series database + query engine

Prometheus stores metrics like:

http_requests_total{service="api", status="200"} 1289301
container_cpu_usage_seconds_total{pod="payments-7f8c"} 9812.2

Each metric is a time series: a value changing over time, labeled with dimensions (labels).

Grafana = visualization + exploration

Grafana asks Prometheus questions (queries), then turns results into:

charts
tables
heatmaps
alerts (optional)

Prometheus is your source of truth. Grafana is your microscope.

2) What “good dashboards” do (the 5 engineer questions)

Every dashboard panel should help answer one of these questions:

Is the service healthy right now?
Is it getting better or worse?
Is the problem traffic, errors, latency, or saturation?
Which instances/pods/regions are the worst offenders?
What changed recently? (deploys, config, infra events)

If a panel doesn’t support one of these, it’s probably noise.

3) Metrics you must understand (Counter, Gauge, Histogram)

Counter (only goes up)

Example: http_requests_total
Use it for rates: requests/sec, errors/sec

✅ You almost always query counters with rate().

Gauge (goes up/down)

Example: container_memory_working_set_bytes
Use it for current state: memory usage, queue depth, CPU throttling

Histogram (the secret to useful latency)

If you want p95/p99 latency, you need histograms like:

http_request_duration_seconds_bucket{le="0.1"}
http_request_duration_seconds_bucket{le="0.2"}

Then you compute percentiles with histogram_quantile().

If your app only exposes “average latency”, you’ll get misleading dashboards.

4) PromQL fundamentals (the minimum you need)

PromQL feels scary until you realize it’s mostly three moves:

Move A — pick a metric and filter labels

http_requests_total{service="api", env="prod"}

Move B — turn counters into rates

rate(http_requests_total{service="api"}[5m])

Move C — aggregate by a label

sum by (status) (rate(http_requests_total{service="api"}[5m]))

That’s enough to build 70% of useful dashboards.

5) The “Golden Signals” dashboard pattern engineers trust

When on-call engineers say “show me the dashboard,” they usually want four things:

1) Traffic

“How much load is the system handling?”

sum(rate(http_requests_total{service="$service"}[5m]))

2) Errors

“Are users failing?”

sum(rate(http_requests_total{service="$service", status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="$service"}[5m]))

(That’s an error ratio; multiply by 100 in Grafana for %.)

3) Latency (p50/p95/p99)

Assuming histogram buckets http_request_duration_seconds_bucket:

histogram_quantile(
  0.95,
  sum by (le) (rate(http_request_duration_seconds_bucket{service="$service"}[5m]))
)

4) Saturation

“Are we running out of something?” (CPU, memory, threads, connections, queue)
Examples later—this is where real outages hide.

This pattern works for microservices, APIs, queues, and databases.

6) Step-by-step: build a “Service Overview” dashboard (the one engineers actually use)

Step 1 — Decide the dashboard scope

Start with one service. Not the whole universe.

Dashboard name:
Service / $service / Overview

It should cover:

traffic, errors, latency, saturation
top offenders (pods/instances)
key dependencies (DB, cache, queue)
recent changes (deploy annotations)

Step 2 — Add Grafana variables (this is what makes dashboards reusable)

Add dropdown variables so the dashboard works across environments:

$env (prod/stage/dev)
$service
$instance (or $pod)
$namespace (if Kubernetes)
$cluster (if multi-cluster)

Even beginners feel “wow” when a dashboard becomes interactive.

Practical tip: Keep variables in a consistent order across dashboards.

Step 3 — Add panels in the order engineers read during incidents

Engineers scan left-to-right and top-to-bottom.

Row 1: “Am I on fire?”

Panel 1 — Request rate (RPS)

sum(rate(http_requests_total{service="$service", env="$env"}[5m]))

Panel 2 — Error rate %

100 *
sum(rate(http_requests_total{service="$service", env="$env", status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="$service", env="$env"}[5m]))

Panel 3 — Latency p95

histogram_quantile(
  0.95,
  sum by (le) (rate(http_request_duration_seconds_bucket{service="$service", env="$env"}[5m]))
)

✅ These three panels alone solve a huge chunk of “is it broken?” questions.

Row 2: “Where is it broken?”

Panel 4 — Top 10 instances/pods by error rate

topk(
  10,
  sum by (instance) (rate(http_requests_total{service="$service", env="$env", status=~"5.."}[5m]))
)

Panel 5 — Top 10 by latency (p95 per instance)
If your histogram includes instance label:

topk(
  10,
  histogram_quantile(
    0.95,
    sum by (le, instance) (rate(http_request_duration_seconds_bucket{service="$service", env="$env"}[5m]))
  )
)

Engineers love TopK tables because they immediately point to suspects.

Row 3: “Is it resource saturation?”

Pick resource panels that match how your service runs.

If running on Kubernetes

CPU usage per pod

sum by (pod) (rate(container_cpu_usage_seconds_total{namespace="$namespace", pod=~"$pod"}[5m]))

Memory working set per pod

sum by (pod) (container_memory_working_set_bytes{namespace="$namespace", pod=~"$pod"})

CPU throttling (very useful)

sum by (pod) (rate(container_cpu_cfs_throttled_seconds_total{namespace="$namespace", pod=~"$pod"}[5m]))

If throttling rises while latency rises, that’s an “aha” moment.

Row 4: “Did something change?”

Engineers always ask: “Was there a deploy?”

Add annotations:

deployment events
config changes
node restarts
autoscaling events

Even a simple “deploy marker” makes your dashboard feel alive.

Step 4 — Make it readable (the design rules people skip)

Rule 1: Use consistent units

latency: milliseconds
memory: bytes → show as GiB
CPU: cores or %
error: %

Rule 2: Avoid 50-line spaghetti graphs
Use:

topk(10, ...)
or aggregate by one dimension

Rule 3: Put “current value” first
Most panels should show:

last value
trend
max (optional)

Rule 4: Add one sentence under each row
Example:

“If p95 latency rises with throttling, check CPU limits or node pressure.”

These micro-hints turn dashboards into teaching tools.

7) Dashboards engineers love (templates you can copy)

A) “RED” dashboard for APIs (Requests, Errors, Duration)

If you run web services, build this first.

RPS (total + by route if possible)
4xx/5xx split
p50/p95/p99 latency
Top endpoints by latency
Top endpoints by error rate

If your metrics include route:

topk(10, sum by (route) (rate(http_requests_total{service="$service", status=~"5.."}[5m])))

B) “USE” dashboard for infrastructure (Utilization, Saturation, Errors)

For nodes, databases, queues.

CPU utilization
memory utilization
disk I/O
network throughput
saturation indicators (queue length, connection pool usage)

C) “Kubernetes Cluster Overview”

Engineers use this when “everything is slow”:

Node CPU/memory pressure
Pod restarts (topk)
Pending pods
Image pull failures
API server saturation (if available)
Cluster autoscaler behavior

Example: Pod restarts topk:

topk(20, increase(kube_pod_container_status_restarts_total[1h]))

8) The two performance tricks that separate “okay” from “great”

Trick 1: Recording rules (make dashboards fast)

If a query is heavy and used everywhere (like error ratio), precompute it in Prometheus as a new time series.

You’ll get:

faster dashboards
less load on Prometheus
fewer timeouts during incidents

Trick 2: Label hygiene (avoid cardinality explosions)

High-cardinality labels can destroy Prometheus performance.

Be careful with labels like:

user_id
request_id
full URL paths with IDs
random strings

Instead, label by stable dimensions:

service, route, method, status, namespace, pod

Dashboards should be reliable during outages, not the first thing to crash.

9) Common mistakes (and the simple fixes)

Mistake: Graphing counters directly

You’ll see a line that only goes up. Useless.
✅ Fix: use rate(counter[5m])

Mistake: Only showing average latency

Averages hide pain.
✅ Fix: use p95/p99 from histograms

Mistake: Dashboards that don’t help you find “which instance”

You see a problem but not the culprit.
✅ Fix: add TopK panels by pod/instance

Mistake: Alerting on “CPU > 80%”

High CPU isn’t always bad; low CPU isn’t always good.
✅ Fix: alert on symptoms (errors, latency, saturation) and use CPU as a clue.

10) A practical “dashboard checklist” (use this before publishing)

A dashboard that engineers use typically has:

A clear title and scope (service/cluster/env)
Variables for env/service/instance
Traffic + errors + latency + saturation
At least 2 TopK panels for “who is worst”
Annotations for deploys/changes
Panels have correct units and sane time ranges
Queries don’t time out under load
A short “how to read this dashboard” note

If you hit these, your dashboard will get pinned.

Final takeaway

Prometheus gives you raw truth. Grafana makes it usable.

But the difference between “random graphs” and “dashboards engineers use” is simple:

Great dashboards reduce time-to-answer.
They don’t just show data—they show decisions.

Absolutely — for Kubernetes, here’s a ready-to-publish version that focuses on the dashboards engineers actually use in real clusters.

Prometheus + Grafana for Kubernetes: dashboards engineers actually use

If you run Kubernetes in production, you already know the feeling:

Pods restart for “no reason”
A deploy looks fine… then latency spikes
Nodes are “healthy” but requests time out
CPU looks okay, yet the service crawls

This is exactly why Prometheus + Grafana are popular in Kubernetes: they help you answer the real on-call questions:

What’s broken? Where is it broken? Why now? What do I do next?

This guide gives you practical dashboard building blocks (with PromQL examples) that you can implement immediately.

1) First, define what “good Kubernetes dashboards” mean

A Kubernetes dashboard should help you:

Detect: something is wrong (symptom)
Localize: which namespace/pod/node (where)
Explain: why it’s happening (cause)
Act: what to do next (next action)

If a dashboard doesn’t help you do at least one of these, it’s just decoration.

2) The 4 dashboards every Kubernetes team needs

Engineers don’t need 40 dashboards.
They need 4 great ones:

Cluster Overview (is the cluster unhealthy?)
Namespace Overview (which team/app is hurting?)
Workload / Service Overview (why is this deployment slow?)
Node & Capacity (are we running out of resources?)

Let’s build each one like a checklist.

Dashboard 1: Cluster Overview (the “Are we on fire?” page)

This is what you open first when alerts are noisy or “everything is slow”.

A) Workload health signals

1) Pods Pending

sum(kube_pod_status_phase{phase="Pending"})

2) Pods Failed

sum(kube_pod_status_phase{phase="Failed"})

3) Top pods restarting (last 1h)

topk(20, increase(kube_pod_container_status_restarts_total[1h]))

Why engineers love this: it instantly points to the biggest pain.

B) Node pressure signals (hidden cause of “random” issues)

4) Nodes not ready

count(kube_node_status_condition{condition="Ready", status="true"} == 0)

5) Memory pressure nodes

count(kube_node_status_condition{condition="MemoryPressure", status="true"} == 1)

6) Disk pressure nodes

count(kube_node_status_condition{condition="DiskPressure", status="true"} == 1)

C) Scheduling capacity signals

7) Unschedulable nodes

count(kube_node_spec_unschedulable == 1)

8) Pending pods by namespace (table)

topk(20, sum by (namespace) (kube_pod_status_phase{phase="Pending"}))

How engineers read this dashboard

If Pending is rising → capacity / scheduling / quota issue
If restarts spike → crash loops, OOMKills, bad config, failing dependencies
If node pressure appears → saturation causing downstream failures

Dashboard 2: Namespace Overview (the “Which team is impacted?” page)

This dashboard is for platform + teams.
It shows which namespace is consuming resources and suffering instability.

A) Resource usage by namespace

1) CPU usage by namespace

sum by (namespace) (
  rate(container_cpu_usage_seconds_total{container!="", image!=""}[5m])
)

2) Memory usage by namespace

sum by (namespace) (
  container_memory_working_set_bytes{container!="", image!=""}
)

B) Top offenders (tables)

3) Top pods by CPU

topk(20, sum by (namespace, pod) (
  rate(container_cpu_usage_seconds_total{container!="", image!=""}[5m])
))

4) Top pods by memory

topk(20, sum by (namespace, pod) (
  container_memory_working_set_bytes{container!="", image!=""}
))

C) Instability signals

5) Restarting pods by namespace (last 1h)

topk(20, sum by (namespace) (increase(kube_pod_container_status_restarts_total[1h])))

Why this dashboard is powerful

It shifts conversations from:

“Kubernetes is slow”

to:

“Namespace X has 3 pods restarting and is consuming 40% memory.”

That’s actionable.

Dashboard 3: Workload / Service Overview (the “Why is my app slow?” page)

This dashboard is per deployment/service (engineers live here).

Must-have variables (so it’s reusable)

$namespace
$workload or $deployment
$pod

A) Health & scaling

1) Desired vs available replicas

kube_deployment_spec_replicas{namespace="$namespace", deployment="$deployment"}

kube_deployment_status_replicas_available{namespace="$namespace", deployment="$deployment"}

2) HPA current vs desired

kube_horizontalpodautoscaler_status_current_replicas{namespace="$namespace", horizontalpodautoscaler="$hpa"}

kube_horizontalpodautoscaler_status_desired_replicas{namespace="$namespace", horizontalpodautoscaler="$hpa"}

B) CPU, Memory, Throttling (the core triad)

3) CPU usage per pod

sum by (pod) (
  rate(container_cpu_usage_seconds_total{namespace="$namespace", pod=~"$pod", container!="", image!=""}[5m])
)

4) Memory working set per pod

sum by (pod) (
  container_memory_working_set_bytes{namespace="$namespace", pod=~"$pod", container!="", image!=""}
)

5) CPU throttling per pod (super important)

sum by (pod) (
  rate(container_cpu_cfs_throttled_seconds_total{namespace="$namespace", pod=~"$pod", container!="", image!=""}[5m])
)

How to interpret throttling (real example)

If:

p95 latency rises
error rate rises
throttling rises

Then the “CPU looks fine” lie appears — because throttling is the real bottleneck.

C) OOMKills and crash loops (the fastest root-cause wins)

6) OOMKilled containers

sum by (namespace, pod) (
  kube_pod_container_status_last_terminated_reason{reason="OOMKilled", namespace="$namespace"} == 1
)

7) Restart reasons (table)

topk(20, increase(kube_pod_container_status_restarts_total{namespace="$namespace"}[1h]))

Dashboard 4: Node & Capacity (the “Can the cluster handle this?” page)

This is what you use for:

scaling decisions
bin-packing issues
“why are pods pending?”
cost + capacity tradeoffs

A) Node CPU/Memory saturation

1) CPU utilization per node

sum by (node) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
/
sum by (node) (rate(node_cpu_seconds_total[5m]))

2) Memory utilization per node

(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))

B) Node filesystem pressure

3) Disk usage

1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})

C) Allocatable vs requested (bin-packing visibility)

If you have request metrics available, show:

Requested CPU vs allocatable CPU
Requested memory vs allocatable memory

This is the difference between:

“We need more nodes”
and
“We just over-requested resources; requests are blocking scheduling.”

3) The design rules that make Kubernetes dashboards “sticky”

Rule 1: Always include TopK tables

Engineers don’t want averages — they want the worst offenders.

Rule 2: Add “symptom + cause” pairs

Example pairs:

Latency ↑ + CPU throttling ↑
Pending pods ↑ + node allocatable exhausted
Restarts ↑ + OOMKilled ↑
Error rate ↑ + dependency latency ↑

Rule 3: Keep labels under control

Avoid exploding labels like:

full URL paths (with IDs)
request IDs
user IDs

High cardinality can slow Prometheus and make dashboards unreliable during incidents.

4) A practical “first week” plan (so you actually ship this)

Day 1–2: Cluster Overview

Pending/Failed
Node pressure
Top restarts

Day 3–4: Namespace Overview

CPU/memory by namespace
Top pods by CPU/memory
Restarts by namespace

Day 5–7: Workload Overview (for your top 3 services)

CPU/memory/throttling
replicas + HPA
OOMKilled / restarts

By the end of week 1, your team will already trust Grafana more.

Final takeaway

For Kubernetes, dashboards engineers use have one job:

Reduce time-to-root-cause.

If your dashboard helps an on-call engineer go from:

“Something is wrong”

to:

“This pod is OOMKilled because memory request/limit is too low after the last deploy”

…then your dashboard is a success.

Top 20 Kubernetes PromQL queries every SRE should bookmark

If you’re on-call for Kubernetes, you don’t need “more graphs.”
You need fast answers.

This list is built for real incident flow:

Symptom → where → why → who is worst → what to do next

All queries assume you have the common kube metrics (kube-state-metrics + node-exporter + cAdvisor via Prometheus). If some metrics don’t exist in your setup, treat them as “optional” and use the closest equivalent.

1) Cluster health: “Are we on fire?”

1) Pods Pending (cluster-wide)

sum(kube_pod_status_phase{phase="Pending"})

2) Pods Failed (cluster-wide)

sum(kube_pod_status_phase{phase="Failed"})

3) Top restarting pods in the last 1 hour

topk(20, increase(kube_pod_container_status_restarts_total[1h]))

4) Nodes NotReady

count(kube_node_status_condition{condition="Ready", status="true"} == 0)

2) Scheduling & capacity: “Why won’t pods schedule?”

5) Pending pods by namespace (find who’s blocked)

topk(20, sum by (namespace) (kube_pod_status_phase{phase="Pending"}))

6) Unschedulable nodes

count(kube_node_spec_unschedulable == 1)

7) Nodes under MemoryPressure

count(kube_node_status_condition{condition="MemoryPressure", status="true"} == 1)

8) Nodes under DiskPressure

count(kube_node_status_condition{condition="DiskPressure", status="true"} == 1)

3) Restarts & OOM: “Why are pods crashing?”

9) Top pods restarting in the last 15 minutes (faster signal)

topk(20, increase(kube_pod_container_status_restarts_total[15m]))

10) OOMKilled containers (recent)

topk(
  20,
  sum by (namespace, pod, container) (
    increase(kube_pod_container_status_restarts_total[1h])
  )
)

Use this with #11 to confirm OOMKilled specifically.

11) Containers whose last termination reason was OOMKilled

sum by (namespace, pod, container) (
  kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1
)

12) Pods stuck in CrashLoopBackOff (direct symptom)

sum by (namespace, pod) (
  kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} == 1
)

4) CPU & memory: “Who is hot? Who is starving?”

13) Top pods by CPU usage (cores)

topk(
  20,
  sum by (namespace, pod) (
    rate(container_cpu_usage_seconds_total{container!="", image!=""}[5m])
  )
)

14) Top pods by memory working set (bytes)

topk(
  20,
  sum by (namespace, pod) (
    container_memory_working_set_bytes{container!="", image!=""}
  )
)

15) CPU throttling by pod (the hidden performance killer)

topk(
  20,
  sum by (namespace, pod) (
    rate(container_cpu_cfs_throttled_seconds_total{container!="", image!=""}[5m])
  )
)

How to use it:
If p95 latency rises but CPU usage looks “fine,” check throttling. Throttling rising means CPU limits are actively constraining runtime.

5) Node saturation: “Is the cluster out of capacity?”

16) Node CPU utilization (fraction 0–1)

sum by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
/
sum by (instance) (rate(node_cpu_seconds_total[5m]))

17) Node memory utilization (fraction 0–1)

1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

18) Root filesystem usage (fraction 0–1)

1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})

6) Kubernetes “quality of life” queries: the ones that save time

19) Pods not Ready (find unreliable workloads)

sum by (namespace, pod) (kube_pod_status_ready{condition="true"} == 0)

20) Top namespaces by total CPU usage (blame-free, just facts)

topk(
  20,
  sum by (namespace) (
    rate(container_cpu_usage_seconds_total{container!="", image!=""}[5m])
  )
)

Bonus: how to use these like an on-call playbook (quick flow)

When “service is slow”

Check Node CPU/mem (#16, #17)
Check CPU throttling (#15)
Check restarts + crashloops (#3, #12)
Check Pending pods (#1, #5)

When “pods won’t schedule”

Pending by namespace (#5)
Node pressure (#7, #8)
Disk usage (#18)
Unschedulable nodes (#6)

When “pods keep restarting”

Top restarts (#3 / #9)
CrashLoopBackOff (#12)
OOMKilled (#11)
Memory usage (#14)

Make these queries “bookmarkable” in Grafana (small tips)

Put them in a table panel (TopK queries shine in tables)
Always include namespace and pod in grouping for triage
Use time ranges intentionally:
- [15m] for active incidents
- [1h] for stability investigation
- [24h] for trend + regression checks

Mohammad Gufran Jahangir

Category:

Prometheus + Grafana fundamentals: dashboards that engineers use

1) The mental model (so everything clicks fast)

Prometheus = time-series database + query engine

Grafana = visualization + exploration

2) What “good dashboards” do (the 5 engineer questions)

3) Metrics you must understand (Counter, Gauge, Histogram)

Counter (only goes up)

Gauge (goes up/down)

Histogram (the secret to useful latency)

4) PromQL fundamentals (the minimum you need)

Move A — pick a metric and filter labels

Move B — turn counters into rates

Move C — aggregate by a label

5) The “Golden Signals” dashboard pattern engineers trust

1) Traffic

2) Errors

3) Latency (p50/p95/p99)

4) Saturation

6) Step-by-step: build a “Service Overview” dashboard (the one engineers actually use)

Step 1 — Decide the dashboard scope

Step 2 — Add Grafana variables (this is what makes dashboards reusable)

Step 3 — Add panels in the order engineers read during incidents

Row 1: “Am I on fire?”

Row 2: “Where is it broken?”

Row 3: “Is it resource saturation?”

If running on Kubernetes

Row 4: “Did something change?”

Step 4 — Make it readable (the design rules people skip)

7) Dashboards engineers love (templates you can copy)

A) “RED” dashboard for APIs (Requests, Errors, Duration)

B) “USE” dashboard for infrastructure (Utilization, Saturation, Errors)

C) “Kubernetes Cluster Overview”

8) The two performance tricks that separate “okay” from “great”

Trick 1: Recording rules (make dashboards fast)

Trick 2: Label hygiene (avoid cardinality explosions)

9) Common mistakes (and the simple fixes)

Mistake: Graphing counters directly

Mistake: Only showing average latency

Mistake: Dashboards that don’t help you find “which instance”

Mistake: Alerting on “CPU > 80%”

10) A practical “dashboard checklist” (use this before publishing)

Final takeaway

Prometheus + Grafana for Kubernetes: dashboards engineers actually use

1) First, define what “good Kubernetes dashboards” mean

2) The 4 dashboards every Kubernetes team needs

Dashboard 1: Cluster Overview (the “Are we on fire?” page)

A) Workload health signals

B) Node pressure signals (hidden cause of “random” issues)

C) Scheduling capacity signals

How engineers read this dashboard

Dashboard 2: Namespace Overview (the “Which team is impacted?” page)

A) Resource usage by namespace

B) Top offenders (tables)

C) Instability signals

Why this dashboard is powerful

Dashboard 3: Workload / Service Overview (the “Why is my app slow?” page)

Must-have variables (so it’s reusable)

A) Health & scaling

B) CPU, Memory, Throttling (the core triad)

How to interpret throttling (real example)

C) OOMKills and crash loops (the fastest root-cause wins)

Dashboard 4: Node & Capacity (the “Can the cluster handle this?” page)

A) Node CPU/Memory saturation

B) Node filesystem pressure

C) Allocatable vs requested (bin-packing visibility)

3) The design rules that make Kubernetes dashboards “sticky”

Rule 1: Always include TopK tables

Rule 2: Add “symptom + cause” pairs

Rule 3: Keep labels under control

4) A practical “first week” plan (so you actually ship this)

Day 1–2: Cluster Overview

Day 3–4: Namespace Overview

Day 5–7: Workload Overview (for your top 3 services)

Final takeaway

Top 20 Kubernetes PromQL queries every SRE should bookmark

1) Cluster health: “Are we on fire?”

1) Pods Pending (cluster-wide)

2) Pods Failed (cluster-wide)

3) Top restarting pods in the last 1 hour

4) Nodes NotReady