Most dashboards fail for one simple reason: they look impressive but don’t help you answer a real question under pressure.
Engineers don’t open Grafana to admire graphs. They open it when:
- a customer says “it’s slow”
- an alert fires at 2:17 AM
- a deploy just happened
- a KPI suddenly dips
So let’s build dashboards that do what engineers actually need: tell you what’s broken, where it is, why it’s happening, and what to do next.
This guide is beginner-friendly, but it goes deep enough that you can build a dashboard your team will keep pinned for years.

1) The mental model (so everything clicks fast)
Prometheus = time-series database + query engine
Prometheus stores metrics like:
http_requests_total{service="api", status="200"} 1289301container_cpu_usage_seconds_total{pod="payments-7f8c"} 9812.2
Each metric is a time series: a value changing over time, labeled with dimensions (labels).
Grafana = visualization + exploration
Grafana asks Prometheus questions (queries), then turns results into:
- charts
- tables
- heatmaps
- alerts (optional)
Prometheus is your source of truth. Grafana is your microscope.
2) What “good dashboards” do (the 5 engineer questions)
Every dashboard panel should help answer one of these questions:
- Is the service healthy right now?
- Is it getting better or worse?
- Is the problem traffic, errors, latency, or saturation?
- Which instances/pods/regions are the worst offenders?
- What changed recently? (deploys, config, infra events)
If a panel doesn’t support one of these, it’s probably noise.
3) Metrics you must understand (Counter, Gauge, Histogram)
Counter (only goes up)
Example: http_requests_total
Use it for rates: requests/sec, errors/sec
✅ You almost always query counters with rate().
Gauge (goes up/down)
Example: container_memory_working_set_bytes
Use it for current state: memory usage, queue depth, CPU throttling
Histogram (the secret to useful latency)
If you want p95/p99 latency, you need histograms like:
http_request_duration_seconds_bucket{le="0.1"}http_request_duration_seconds_bucket{le="0.2"}
Then you compute percentiles with histogram_quantile().
If your app only exposes “average latency”, you’ll get misleading dashboards.
4) PromQL fundamentals (the minimum you need)
PromQL feels scary until you realize it’s mostly three moves:
Move A — pick a metric and filter labels
http_requests_total{service="api", env="prod"}
Move B — turn counters into rates
rate(http_requests_total{service="api"}[5m])
Move C — aggregate by a label
sum by (status) (rate(http_requests_total{service="api"}[5m]))
That’s enough to build 70% of useful dashboards.
5) The “Golden Signals” dashboard pattern engineers trust
When on-call engineers say “show me the dashboard,” they usually want four things:
1) Traffic
“How much load is the system handling?”
sum(rate(http_requests_total{service="$service"}[5m]))
2) Errors
“Are users failing?”
sum(rate(http_requests_total{service="$service", status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="$service"}[5m]))
(That’s an error ratio; multiply by 100 in Grafana for %.)
3) Latency (p50/p95/p99)
Assuming histogram buckets http_request_duration_seconds_bucket:
histogram_quantile(
0.95,
sum by (le) (rate(http_request_duration_seconds_bucket{service="$service"}[5m]))
)
4) Saturation
“Are we running out of something?” (CPU, memory, threads, connections, queue)
Examples later—this is where real outages hide.
This pattern works for microservices, APIs, queues, and databases.
6) Step-by-step: build a “Service Overview” dashboard (the one engineers actually use)
Step 1 — Decide the dashboard scope
Start with one service. Not the whole universe.
Dashboard name:Service / $service / Overview
It should cover:
- traffic, errors, latency, saturation
- top offenders (pods/instances)
- key dependencies (DB, cache, queue)
- recent changes (deploy annotations)
Step 2 — Add Grafana variables (this is what makes dashboards reusable)
Add dropdown variables so the dashboard works across environments:
$env(prod/stage/dev)$service$instance(or$pod)$namespace(if Kubernetes)$cluster(if multi-cluster)
Even beginners feel “wow” when a dashboard becomes interactive.
Practical tip: Keep variables in a consistent order across dashboards.
Step 3 — Add panels in the order engineers read during incidents
Engineers scan left-to-right and top-to-bottom.
Row 1: “Am I on fire?”
Panel 1 — Request rate (RPS)
sum(rate(http_requests_total{service="$service", env="$env"}[5m]))
Panel 2 — Error rate %
100 *
sum(rate(http_requests_total{service="$service", env="$env", status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="$service", env="$env"}[5m]))
Panel 3 — Latency p95
histogram_quantile(
0.95,
sum by (le) (rate(http_request_duration_seconds_bucket{service="$service", env="$env"}[5m]))
)
✅ These three panels alone solve a huge chunk of “is it broken?” questions.
Row 2: “Where is it broken?”
Panel 4 — Top 10 instances/pods by error rate
topk(
10,
sum by (instance) (rate(http_requests_total{service="$service", env="$env", status=~"5.."}[5m]))
)
Panel 5 — Top 10 by latency (p95 per instance)
If your histogram includes instance label:
topk(
10,
histogram_quantile(
0.95,
sum by (le, instance) (rate(http_request_duration_seconds_bucket{service="$service", env="$env"}[5m]))
)
)
Engineers love TopK tables because they immediately point to suspects.
Row 3: “Is it resource saturation?”
Pick resource panels that match how your service runs.
If running on Kubernetes
CPU usage per pod
sum by (pod) (rate(container_cpu_usage_seconds_total{namespace="$namespace", pod=~"$pod"}[5m]))
Memory working set per pod
sum by (pod) (container_memory_working_set_bytes{namespace="$namespace", pod=~"$pod"})
CPU throttling (very useful)
sum by (pod) (rate(container_cpu_cfs_throttled_seconds_total{namespace="$namespace", pod=~"$pod"}[5m]))
If throttling rises while latency rises, that’s an “aha” moment.
Row 4: “Did something change?”
Engineers always ask: “Was there a deploy?”
Add annotations:
- deployment events
- config changes
- node restarts
- autoscaling events
Even a simple “deploy marker” makes your dashboard feel alive.
Step 4 — Make it readable (the design rules people skip)
Rule 1: Use consistent units
- latency: milliseconds
- memory: bytes → show as GiB
- CPU: cores or %
- error: %
Rule 2: Avoid 50-line spaghetti graphs
Use:
topk(10, ...)- or aggregate by one dimension
Rule 3: Put “current value” first
Most panels should show:
- last value
- trend
- max (optional)
Rule 4: Add one sentence under each row
Example:
- “If p95 latency rises with throttling, check CPU limits or node pressure.”
These micro-hints turn dashboards into teaching tools.
7) Dashboards engineers love (templates you can copy)
A) “RED” dashboard for APIs (Requests, Errors, Duration)
If you run web services, build this first.
- RPS (total + by route if possible)
- 4xx/5xx split
- p50/p95/p99 latency
- Top endpoints by latency
- Top endpoints by error rate
If your metrics include route:
topk(10, sum by (route) (rate(http_requests_total{service="$service", status=~"5.."}[5m])))
B) “USE” dashboard for infrastructure (Utilization, Saturation, Errors)
For nodes, databases, queues.
- CPU utilization
- memory utilization
- disk I/O
- network throughput
- saturation indicators (queue length, connection pool usage)
C) “Kubernetes Cluster Overview”
Engineers use this when “everything is slow”:
- Node CPU/memory pressure
- Pod restarts (topk)
- Pending pods
- Image pull failures
- API server saturation (if available)
- Cluster autoscaler behavior
Example: Pod restarts topk:
topk(20, increase(kube_pod_container_status_restarts_total[1h]))
8) The two performance tricks that separate “okay” from “great”
Trick 1: Recording rules (make dashboards fast)
If a query is heavy and used everywhere (like error ratio), precompute it in Prometheus as a new time series.
You’ll get:
- faster dashboards
- less load on Prometheus
- fewer timeouts during incidents
Trick 2: Label hygiene (avoid cardinality explosions)
High-cardinality labels can destroy Prometheus performance.
Be careful with labels like:
user_idrequest_id- full URL paths with IDs
- random strings
Instead, label by stable dimensions:
service,route,method,status,namespace,pod
Dashboards should be reliable during outages, not the first thing to crash.
9) Common mistakes (and the simple fixes)
Mistake: Graphing counters directly
You’ll see a line that only goes up. Useless.
✅ Fix: use rate(counter[5m])
Mistake: Only showing average latency
Averages hide pain.
✅ Fix: use p95/p99 from histograms
Mistake: Dashboards that don’t help you find “which instance”
You see a problem but not the culprit.
✅ Fix: add TopK panels by pod/instance
Mistake: Alerting on “CPU > 80%”
High CPU isn’t always bad; low CPU isn’t always good.
✅ Fix: alert on symptoms (errors, latency, saturation) and use CPU as a clue.
10) A practical “dashboard checklist” (use this before publishing)
A dashboard that engineers use typically has:
- A clear title and scope (service/cluster/env)
- Variables for env/service/instance
- Traffic + errors + latency + saturation
- At least 2 TopK panels for “who is worst”
- Annotations for deploys/changes
- Panels have correct units and sane time ranges
- Queries don’t time out under load
- A short “how to read this dashboard” note
If you hit these, your dashboard will get pinned.
Final takeaway
Prometheus gives you raw truth. Grafana makes it usable.
But the difference between “random graphs” and “dashboards engineers use” is simple:
Great dashboards reduce time-to-answer.
They don’t just show data—they show decisions.
Absolutely — for Kubernetes, here’s a ready-to-publish version that focuses on the dashboards engineers actually use in real clusters.
Prometheus + Grafana for Kubernetes: dashboards engineers actually use
If you run Kubernetes in production, you already know the feeling:
- Pods restart for “no reason”
- A deploy looks fine… then latency spikes
- Nodes are “healthy” but requests time out
- CPU looks okay, yet the service crawls
This is exactly why Prometheus + Grafana are popular in Kubernetes: they help you answer the real on-call questions:
What’s broken? Where is it broken? Why now? What do I do next?
This guide gives you practical dashboard building blocks (with PromQL examples) that you can implement immediately.
1) First, define what “good Kubernetes dashboards” mean
A Kubernetes dashboard should help you:
- Detect: something is wrong (symptom)
- Localize: which namespace/pod/node (where)
- Explain: why it’s happening (cause)
- Act: what to do next (next action)
If a dashboard doesn’t help you do at least one of these, it’s just decoration.
2) The 4 dashboards every Kubernetes team needs
Engineers don’t need 40 dashboards.
They need 4 great ones:
- Cluster Overview (is the cluster unhealthy?)
- Namespace Overview (which team/app is hurting?)
- Workload / Service Overview (why is this deployment slow?)
- Node & Capacity (are we running out of resources?)
Let’s build each one like a checklist.
Dashboard 1: Cluster Overview (the “Are we on fire?” page)
This is what you open first when alerts are noisy or “everything is slow”.
A) Workload health signals
1) Pods Pending
sum(kube_pod_status_phase{phase="Pending"})
2) Pods Failed
sum(kube_pod_status_phase{phase="Failed"})
3) Top pods restarting (last 1h)
topk(20, increase(kube_pod_container_status_restarts_total[1h]))
Why engineers love this: it instantly points to the biggest pain.
B) Node pressure signals (hidden cause of “random” issues)
4) Nodes not ready
count(kube_node_status_condition{condition="Ready", status="true"} == 0)
5) Memory pressure nodes
count(kube_node_status_condition{condition="MemoryPressure", status="true"} == 1)
6) Disk pressure nodes
count(kube_node_status_condition{condition="DiskPressure", status="true"} == 1)
C) Scheduling capacity signals
7) Unschedulable nodes
count(kube_node_spec_unschedulable == 1)
8) Pending pods by namespace (table)
topk(20, sum by (namespace) (kube_pod_status_phase{phase="Pending"}))
How engineers read this dashboard
- If Pending is rising → capacity / scheduling / quota issue
- If restarts spike → crash loops, OOMKills, bad config, failing dependencies
- If node pressure appears → saturation causing downstream failures
Dashboard 2: Namespace Overview (the “Which team is impacted?” page)
This dashboard is for platform + teams.
It shows which namespace is consuming resources and suffering instability.
A) Resource usage by namespace
1) CPU usage by namespace
sum by (namespace) (
rate(container_cpu_usage_seconds_total{container!="", image!=""}[5m])
)
2) Memory usage by namespace
sum by (namespace) (
container_memory_working_set_bytes{container!="", image!=""}
)
B) Top offenders (tables)
3) Top pods by CPU
topk(20, sum by (namespace, pod) (
rate(container_cpu_usage_seconds_total{container!="", image!=""}[5m])
))
4) Top pods by memory
topk(20, sum by (namespace, pod) (
container_memory_working_set_bytes{container!="", image!=""}
))
C) Instability signals
5) Restarting pods by namespace (last 1h)
topk(20, sum by (namespace) (increase(kube_pod_container_status_restarts_total[1h])))
Why this dashboard is powerful
It shifts conversations from:
“Kubernetes is slow”
to:
“Namespace X has 3 pods restarting and is consuming 40% memory.”
That’s actionable.
Dashboard 3: Workload / Service Overview (the “Why is my app slow?” page)
This dashboard is per deployment/service (engineers live here).
Must-have variables (so it’s reusable)
$namespace$workloador$deployment$pod
A) Health & scaling
1) Desired vs available replicas
kube_deployment_spec_replicas{namespace="$namespace", deployment="$deployment"}
kube_deployment_status_replicas_available{namespace="$namespace", deployment="$deployment"}
2) HPA current vs desired
kube_horizontalpodautoscaler_status_current_replicas{namespace="$namespace", horizontalpodautoscaler="$hpa"}
kube_horizontalpodautoscaler_status_desired_replicas{namespace="$namespace", horizontalpodautoscaler="$hpa"}
B) CPU, Memory, Throttling (the core triad)
3) CPU usage per pod
sum by (pod) (
rate(container_cpu_usage_seconds_total{namespace="$namespace", pod=~"$pod", container!="", image!=""}[5m])
)
4) Memory working set per pod
sum by (pod) (
container_memory_working_set_bytes{namespace="$namespace", pod=~"$pod", container!="", image!=""}
)
5) CPU throttling per pod (super important)
sum by (pod) (
rate(container_cpu_cfs_throttled_seconds_total{namespace="$namespace", pod=~"$pod", container!="", image!=""}[5m])
)
How to interpret throttling (real example)
If:
- p95 latency rises
- error rate rises
- throttling rises
Then the “CPU looks fine” lie appears — because throttling is the real bottleneck.
C) OOMKills and crash loops (the fastest root-cause wins)
6) OOMKilled containers
sum by (namespace, pod) (
kube_pod_container_status_last_terminated_reason{reason="OOMKilled", namespace="$namespace"} == 1
)
7) Restart reasons (table)
topk(20, increase(kube_pod_container_status_restarts_total{namespace="$namespace"}[1h]))
Dashboard 4: Node & Capacity (the “Can the cluster handle this?” page)
This is what you use for:
- scaling decisions
- bin-packing issues
- “why are pods pending?”
- cost + capacity tradeoffs
A) Node CPU/Memory saturation
1) CPU utilization per node
sum by (node) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
/
sum by (node) (rate(node_cpu_seconds_total[5m]))
2) Memory utilization per node
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))
B) Node filesystem pressure
3) Disk usage
1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})
C) Allocatable vs requested (bin-packing visibility)
If you have request metrics available, show:
- Requested CPU vs allocatable CPU
- Requested memory vs allocatable memory
This is the difference between:
- “We need more nodes”
and - “We just over-requested resources; requests are blocking scheduling.”
3) The design rules that make Kubernetes dashboards “sticky”
Rule 1: Always include TopK tables
Engineers don’t want averages — they want the worst offenders.
Rule 2: Add “symptom + cause” pairs
Example pairs:
- Latency ↑ + CPU throttling ↑
- Pending pods ↑ + node allocatable exhausted
- Restarts ↑ + OOMKilled ↑
- Error rate ↑ + dependency latency ↑
Rule 3: Keep labels under control
Avoid exploding labels like:
- full URL paths (with IDs)
- request IDs
- user IDs
High cardinality can slow Prometheus and make dashboards unreliable during incidents.
4) A practical “first week” plan (so you actually ship this)
Day 1–2: Cluster Overview
- Pending/Failed
- Node pressure
- Top restarts
Day 3–4: Namespace Overview
- CPU/memory by namespace
- Top pods by CPU/memory
- Restarts by namespace
Day 5–7: Workload Overview (for your top 3 services)
- CPU/memory/throttling
- replicas + HPA
- OOMKilled / restarts
By the end of week 1, your team will already trust Grafana more.
Final takeaway
For Kubernetes, dashboards engineers use have one job:
Reduce time-to-root-cause.
If your dashboard helps an on-call engineer go from:
“Something is wrong”
to:
“This pod is OOMKilled because memory request/limit is too low after the last deploy”
…then your dashboard is a success.
Top 20 Kubernetes PromQL queries every SRE should bookmark
If you’re on-call for Kubernetes, you don’t need “more graphs.”
You need fast answers.
This list is built for real incident flow:
Symptom → where → why → who is worst → what to do next
All queries assume you have the common kube metrics (kube-state-metrics + node-exporter + cAdvisor via Prometheus). If some metrics don’t exist in your setup, treat them as “optional” and use the closest equivalent.
1) Cluster health: “Are we on fire?”
1) Pods Pending (cluster-wide)
sum(kube_pod_status_phase{phase="Pending"})
2) Pods Failed (cluster-wide)
sum(kube_pod_status_phase{phase="Failed"})
3) Top restarting pods in the last 1 hour
topk(20, increase(kube_pod_container_status_restarts_total[1h]))
4) Nodes NotReady
count(kube_node_status_condition{condition="Ready", status="true"} == 0)
2) Scheduling & capacity: “Why won’t pods schedule?”
5) Pending pods by namespace (find who’s blocked)
topk(20, sum by (namespace) (kube_pod_status_phase{phase="Pending"}))
6) Unschedulable nodes
count(kube_node_spec_unschedulable == 1)
7) Nodes under MemoryPressure
count(kube_node_status_condition{condition="MemoryPressure", status="true"} == 1)
8) Nodes under DiskPressure
count(kube_node_status_condition{condition="DiskPressure", status="true"} == 1)
3) Restarts & OOM: “Why are pods crashing?”
9) Top pods restarting in the last 15 minutes (faster signal)
topk(20, increase(kube_pod_container_status_restarts_total[15m]))
10) OOMKilled containers (recent)
topk(
20,
sum by (namespace, pod, container) (
increase(kube_pod_container_status_restarts_total[1h])
)
)
Use this with #11 to confirm OOMKilled specifically.
11) Containers whose last termination reason was OOMKilled
sum by (namespace, pod, container) (
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1
)
12) Pods stuck in CrashLoopBackOff (direct symptom)
sum by (namespace, pod) (
kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} == 1
)
4) CPU & memory: “Who is hot? Who is starving?”
13) Top pods by CPU usage (cores)
topk(
20,
sum by (namespace, pod) (
rate(container_cpu_usage_seconds_total{container!="", image!=""}[5m])
)
)
14) Top pods by memory working set (bytes)
topk(
20,
sum by (namespace, pod) (
container_memory_working_set_bytes{container!="", image!=""}
)
)
15) CPU throttling by pod (the hidden performance killer)
topk(
20,
sum by (namespace, pod) (
rate(container_cpu_cfs_throttled_seconds_total{container!="", image!=""}[5m])
)
)
How to use it:
If p95 latency rises but CPU usage looks “fine,” check throttling. Throttling rising means CPU limits are actively constraining runtime.
5) Node saturation: “Is the cluster out of capacity?”
16) Node CPU utilization (fraction 0–1)
sum by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
/
sum by (instance) (rate(node_cpu_seconds_total[5m]))
17) Node memory utilization (fraction 0–1)
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
18) Root filesystem usage (fraction 0–1)
1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})
6) Kubernetes “quality of life” queries: the ones that save time
19) Pods not Ready (find unreliable workloads)
sum by (namespace, pod) (kube_pod_status_ready{condition="true"} == 0)
20) Top namespaces by total CPU usage (blame-free, just facts)
topk(
20,
sum by (namespace) (
rate(container_cpu_usage_seconds_total{container!="", image!=""}[5m])
)
)
Bonus: how to use these like an on-call playbook (quick flow)
When “service is slow”
- Check Node CPU/mem (#16, #17)
- Check CPU throttling (#15)
- Check restarts + crashloops (#3, #12)
- Check Pending pods (#1, #5)
When “pods won’t schedule”
- Pending by namespace (#5)
- Node pressure (#7, #8)
- Disk usage (#18)
- Unschedulable nodes (#6)
When “pods keep restarting”
- Top restarts (#3 / #9)
- CrashLoopBackOff (#12)
- OOMKilled (#11)
- Memory usage (#14)
Make these queries “bookmarkable” in Grafana (small tips)
- Put them in a table panel (TopK queries shine in tables)
- Always include
namespaceandpodin grouping for triage - Use time ranges intentionally:
[15m]for active incidents[1h]for stability investigation[24h]for trend + regression checks