Quick Definition (30–60 words)
Four golden signals are four core metrics—latency, traffic, errors, and saturation—used to assess system health. Analogy: like a car dashboard with speed, fuel, engine light, and temperature. Formal: a minimal SRE observability model mapping SLIs to system reliability and operational thresholds.
What is Four golden signals?
The “Four golden signals” are a focused observability model originating from SRE practices to help incidents be detected, prioritized, diagnosed, and resolved faster. It is a minimal set of telemetry that represents user-facing and resource pressure aspects of services.
What it is NOT
- Not a complete observability solution.
- Not a replacement for business metrics, security logs, or deep application tracing.
- Not a one-size SLO; it’s a starting point.
Key properties and constraints
- Focused: only four signal categories.
- User-centric: emphasizes latency and errors experienced by users.
- System pressure-aware: includes saturation as a resource-level indicator.
- Requires context: needs SLIs, SLOs, and service topology for actionability.
- Scalable: suitable across monoliths, microservices, and serverless, but implementations differ.
Where it fits in modern cloud/SRE workflows
- Incident detection: primary alerting SLIs map to these signals.
- Triage and diagnosis: quickly narrows down where to look.
- SLO/SLA design: forms the basis for SLIs.
- Automation: triggers automation playbooks and auto-remediation.
- Capacity planning and cost optimization.
Diagram description (text-only)
- User requests flow into edge/load balancer; track traffic and latency at ingress.
- Requests routed to services; track latency and errors per service.
- Services consume resources; monitor saturation on CPU, memory, DB connections.
- Alerts derived from SLOs on latency and errors; autoscaling reacts to saturated metrics.
- Tracing spans link high-latency requests to downstream services and resource metrics.
Four golden signals in one sentence
Latency, traffic, errors, and saturation together provide a compact, user-focused view to detect and diagnose reliability issues across distributed systems.
Four golden signals vs related terms (TABLE REQUIRED)
| ID | Term | How it differs from Four golden signals | Common confusion |
|---|---|---|---|
| T1 | SLIs | SLIs are specific measurements; golden signals are categories | People conflate category with concrete SLI |
| T2 | SLOs | SLOs are targets built from SLIs not signals themselves | Assuming signals equal objectives |
| T3 | Metrics | Generic telemetry; golden signals are prioritized set | All metrics are not golden signals |
| T4 | Tracing | Tracing shows request paths; golden signals are high-level KPIs | Using traces instead of signals to alert |
| T5 | Logs | Logs are detailed events; signals are aggregated indicators | Thinking logs replace metrics |
| T6 | Business KPIs | Business KPIs map to user impact; golden signals are technical | Mistaking business metrics for signals |
| T7 | Health checks | Liveness checks are boolean; golden signals are continuous | Relying solely on health checks |
| T8 | Uptime SLAs | SLA is contractual; golden signals inform SLIs/SLOs | Confusing SLA compliance with observability |
Row Details (only if any cell says “See details below”)
- None
Why does Four golden signals matter?
Business impact
- Revenue: slow or errored flows directly reduce conversions and transactions.
- Trust: repeated performance incidents erode customer confidence.
- Risk: undetected saturation can cause cascading failures and downtime.
Engineering impact
- Incident frequency: focusing on core signals reduces undetected failures.
- Velocity: predictable SLOs enable safe deployments and feature velocity.
- Toil reduction: standardized signals enable automation for common fixes.
SRE framing
- SLIs: measure user experience; golden signals are common SLI categories.
- SLOs: set targets on SLIs using latency and error budgets.
- Error budgets: drive release decisions and prioritize reliability work.
- Toil/on-call: use signals to reduce manual diagnosis and unnecessary paging.
What breaks in production (realistic examples)
- A new release increases median latency due to a slow DB query plan change.
- Connection pool exhaustion in a microservice causes cascading 503 errors.
- CPU saturation in an autoscaling group leads to increased request queuing and timeouts.
- Edge load balancer misconfiguration drops traffic spikes causing blackouts.
- Cost-driven scaling rules scale down too aggressively and induce cold-start latency.
Where is Four golden signals used? (TABLE REQUIRED)
| ID | Layer/Area | How Four golden signals appears | Typical telemetry | Common tools |
|---|---|---|---|---|
| L1 | Edge – CDN | Traffic and latency at ingress | request count latency | Observability platform LB logs |
| L2 | Network | Latency and errors on network paths | packet loss latency | Network telemetry |
| L3 | Service | Primary service SLIs mapped to signals | request latency errors | APM metrics traces |
| L4 | App | User-experienced latency and error rates | endpoint latency error rate | Application metrics |
| L5 | Data layer | Saturation and latency of DBs | query latency connection usage | DB exporter |
| L6 | Infra | Host CPU memory saturation | CPU mem disk IO | Infrastructure metrics |
| L7 | Kubernetes | Pod latency errors and node saturation | pod latency restarts | Kubernetes metrics |
| L8 | Serverless | Invocation latency errors and concurrency | cold start latency errors | Managed platform metrics |
| L9 | CI/CD | Deployment traffic and error spikes | deployment traffic failures | CI telemetry |
| L10 | Security | Errors related to auth and rate limits | auth failures latency | Security logs |
Row Details (only if needed)
- None
When should you use Four golden signals?
When it’s necessary
- Start with the four signals on any user-facing service.
- Required when you need SLO-driven operations and automated paging.
- Essential during incident triage to narrow search scope quickly.
When it’s optional
- Internal batch jobs where user experience is not immediate.
- Highly experimental services in early dev where rapid iteration matters more than resilience.
When NOT to use / overuse it
- Not sufficient as the sole observability strategy for security, billing, or compliance.
- Do not reduce all telemetry to only the four signals; detailed metrics and traces remain vital.
Decision checklist
- If service has user requests and latency matters -> implement all four.
- If service is background batch with no user-facing latency -> focus on saturation and errors.
- If you need cost optimization but also reliability -> combine signals + cost metrics.
Maturity ladder
- Beginner: collect request latency and error rate per service.
- Intermediate: add traffic and basic saturation (CPU/memory/DB connections).
- Advanced: SLI/SLO lifecycle with automated alerting, burn-rate, tracing-linked SLIs, and AI-assisted anomaly detection.
How does Four golden signals work?
Components and workflow
- Instrumentation: application exposes request metrics and errors; infra exports resource metrics.
- Aggregation: metrics aggregated at service, endpoint, and host levels.
- SLIs: compute SLIs from aggregated metrics (e.g., p95 latency success rate).
- SLOs and alerts: define SLOs and alert rules with thresholds and burn-rate.
- Triage: dashboards and traces used to diagnose alerts.
- Automation: runbooks or remediation scripts triggered by alerts.
Data flow and lifecycle
- Event -> Metric emission -> Collector -> Aggregation/rollups -> Storage -> Alerts/Dashboards -> Incident lifecycle -> SLI/SLO review.
Edge cases and failure modes
- Missing instrumentation: gaps cause blindspots.
- High cardinality: causes storage and query blowup.
- Metric delays: delayed metrics cause false negatives.
- Aggregation masking: rollups hide tail latency.
Typical architecture patterns for Four golden signals
- Sidecar metrics exporter pattern: use sidecar to export app metrics when language SDKs unavailable.
- Pushgateway for short-lived jobs: push batch job metrics to a collector for aggregation.
- Service mesh observability: collect metrics from mesh sidecars for consistent telemetry.
- Serverless managed metrics: rely on managed platform metrics augmented with custom traces.
- Agent-based infrastructure monitoring: node agent collects CPU/memory/disk metrics centrally.
- Hybrid cloud observability: federate metrics across clouds and centralize SLO evaluation.
Failure modes & mitigation (TABLE REQUIRED)
| ID | Failure mode | Symptom | Likely cause | Mitigation | Observability signal |
|---|---|---|---|---|---|
| F1 | Missing metrics | Blank dashboard panels | No instrumentation | Add exporters instrument code | No data alerts |
| F2 | High cardinality | Slow queries high cost | Unbounded tag values | Reduce labels use regex | Increased query latency |
| F3 | Metric delays | Late alerts | Scrape lag network | Improve scraping frequency | Increased latency spread |
| F4 | Aggregation loss | Hidden tail latency | Overaggressive rollups | Keep high-res for p99 | Rollup delta anomalies |
| F5 | Alert storm | Multiple simultaneous pages | Poor alert dedupe | Implement grouping filters | High alert rate |
| F6 | Saturation blindspot | CPU spikes unnoticed | No saturation metrics | Add resource exporters | Node CPU utilization |
| F7 | Cost blowup | Unexpected billing spike | Too many metrics retained | Retention policy adjust | Storage growth metric |
Row Details (only if needed)
- None
Key Concepts, Keywords & Terminology for Four golden signals
(40+ concise entries)
- SLI — A Service Level Indicator measuring a specific behavior — Drives SLOs — Pitfall: vague definitions.
- SLO — Service Level Objective target for an SLI — Guides reliability work — Pitfall: unrealistic targets.
- Error budget — Allowable error percentage under SLO — Enables releases — Pitfall: ignored budgets.
- Latency — Time for a request to complete — Key user experience measure — Pitfall: relying on mean instead of percentiles.
- Traffic — Request rate or throughput — Shows load patterns — Pitfall: not normalized per unit.
- Errors — Failure rate or error codes — Indicates user impact — Pitfall: noisy non-user-facing errors.
- Saturation — Resource utilization and pressure — Predicts capacity issues — Pitfall: single-metric assumption.
- P95/P99 — Percentile metrics for latency — Captures tail behavior — Pitfall: only monitoring median.
- Availability — Fraction of successful requests — Business-facing reliability — Pitfall: equating uptime with performance.
- Observability — Ability to infer system internal state — Enables debugging — Pitfall: collecting data without context.
- Instrumentation — Adding telemetry to code — Enables SLIs — Pitfall: inconsistent naming.
- Aggregation — Summarizing raw metrics — Enables scaling — Pitfall: losing granularity.
- Tagging — Labels on metrics — Enables slicing — Pitfall: high cardinality.
- Cardinality — Number of unique tag combinations — Affects storage & queries — Pitfall: unbounded tags.
- Scrape interval — How often metrics are collected — Affects freshness — Pitfall: too long intervals.
- Rollup — Summarized time-series data — Lowers cost — Pitfall: hides tails.
- Sampling — Partial tracing or metrics collection — Reduces overhead — Pitfall: misses rare events.
- Tracing — Distributed request traces — Helps root cause — Pitfall: heavy overhead if always on.
- Logging — Event records — Supports forensic analysis — Pitfall: unstructured noisy logs.
- Alerting — Notification based on rules — Drives incident response — Pitfall: alert fatigue.
- Burn-rate — Rate at which error budget is consumed — Triggers mitigations — Pitfall: complex tuning.
- Canary — Incremental rollout pattern — Limits blast radius — Pitfall: insufficient coverage.
- Rollback — Reverting to previous version — Fast mitigation — Pitfall: discards fixes.
- Autoscaling — Automatic capacity adjustment — Responds to traffic/saturation — Pitfall: reactive oscillation.
- Throttling — Limiting request rate — Protects downstream systems — Pitfall: poor UX.
- Backpressure — Flow control between services — Prevents overload — Pitfall: adds latency.
- Health check — Liveness/readiness probe — Quick gating — Pitfall: too permissive checks.
- Synthetic monitoring — Proactive user journey checks — Detects regressions — Pitfall: synthetic != real user.
- Real-user monitoring — Collects client-side metrics — Measures actual experience — Pitfall: privacy concerns.
- APM — Application Performance Monitoring — Deep app metrics and traces — Pitfall: high cost.
- Service mesh — Network layer for microservices — Adds observability hooks — Pitfall: complexity overhead.
- Exporter — Adapter to expose metrics — Standardizes telemetry — Pitfall: misconfigured metrics.
- Collector — Aggregates and forwards metrics — Centralizes data — Pitfall: single point of failure.
- Metric retention — How long data is stored — Balances cost vs analysis — Pitfall: losing historical trends.
- Anomaly detection — Automated pattern detection — Spots unseen issues — Pitfall: false positives.
- Correlation — Linking events across signals — Speeds diagnosis — Pitfall: correlation != causation.
- Runbook — Operational recipe to resolve incidents — Reduces toil — Pitfall: outdated playbooks.
- Postmortem — Incident retrospective — Drives improvements — Pitfall: blame-focused analysis.
- Service level — Logical unit for SLOs and ownership — Clarity for teams — Pitfall: ambiguous boundaries.
How to Measure Four golden signals (Metrics, SLIs, SLOs) (TABLE REQUIRED)
| ID | Metric/SLI | What it tells you | How to measure | Starting target | Gotchas |
|---|---|---|---|---|---|
| M1 | Latency p50 | Typical user latency | 50th percentile of request time | Service dependent See details below: M1 | Use percentiles not mean |
| M2 | Latency p95 | Tail latency impact | 95th percentile request time | 300ms for web APIs | P95 may hide p99 issues |
| M3 | Latency p99 | Worst tail latency | 99th percentile request time | 1s for critical paths | High noise and sparse data |
| M4 | Request rate | Load on service | Count requests per second | Baseline from traffic | Bursty traffic spikes |
| M5 | Error rate | Fraction of failed requests | Failed requests / total | 0.1% initial | Include non-user errors separately |
| M6 | Availability | Successful request ratio | Successful / total | 99.9% typical | Depends on SLA needs |
| M7 | CPU util | Host/machine load | CPU usage percent | Keep below 70% | Short spikes acceptable |
| M8 | Memory util | Memory pressure | Memory used percent | Keep below 75% | Leaks cause gradual growth |
| M9 | Connection usage | DB connection saturation | Open connections / max | <70% of pool | Pool exhaustion causes errors |
| M10 | Queue depth | Backlog in queues | Items in queue | See details below: M10 | Queue growth signals downstream issues |
| M11 | Throttle rate | Requests dropped by throttling | Dropped / attempted | Minimal | Can mask real errors |
| M12 | GC pause p95 | Impact from GC pauses | 95th of pause durations | <50ms | GC tuning required |
| M13 | Cold start latency | Serverless start delay | Time from invoke to ready | <200ms desired | Varies by runtime |
| M14 | Container restarts | Stability of pods | Restart count per hour | 0 expected | CrashLoopBackOff indicates bug |
| M15 | Disk IO latency | Storage delays | IO wait times | Low ms | Affects DB latency |
Row Details (only if needed)
- M1: Choose percentiles per endpoint; compute from latest 1m/5m windows.
- M10: For background jobs track both backlog and rate draining to avoid silent failures.
Best tools to measure Four golden signals
Describe tools individually.
Tool — Prometheus
- What it measures for Four golden signals: metrics collection for latency traffic errors saturation.
- Best-fit environment: Kubernetes, VMs, self-managed infra.
- Setup outline:
- Instrument apps with client libraries.
- Deploy node and app exporters.
- Configure scrape jobs and retention.
- Use PromQL to compute SLIs.
- Integrate with Alertmanager.
- Strengths:
- Flexible query language.
- Wide ecosystem and exporters.
- Limitations:
- Scaling and long-term storage need external solutions.
- High cardinality costs.
Tool — OpenTelemetry
- What it measures for Four golden signals: traces, metrics, and logs for comprehensive signals.
- Best-fit environment: Cloud-native and microservices.
- Setup outline:
- Add OTEL SDKs to services.
- Configure collectors and exporters.
- Export to chosen backend.
- Define metric views for SLIs.
- Strengths:
- Vendor-neutral standard.
- Unified telemetry.
- Limitations:
- Complex setup and evolving specs.
Tool — Grafana
- What it measures for Four golden signals: visualization dashboards and alerting front-end.
- Best-fit environment: Any metrics backend.
- Setup outline:
- Connect data sources.
- Build dashboards with panels for four signals.
- Configure alerts and notification channels.
- Strengths:
- Flexible dashboards.
- Rich panel plugins.
- Limitations:
- Alerting complexity at scale.
Tool — Datadog
- What it measures for Four golden signals: integrated metrics traces logs APM and infra.
- Best-fit environment: Cloud and hybrid with managed features.
- Setup outline:
- Install agents and APM libraries.
- Configure monitors for SLIs.
- Use dashboards and notebooks.
- Strengths:
- Unified managed platform.
- Easy onboarding.
- Limitations:
- Cost at high scale.
- Vendor lock-in concerns.
Tool — AWS CloudWatch
- What it measures for Four golden signals: managed metrics and logs for AWS services and custom metrics.
- Best-fit environment: AWS-hosted workloads and serverless.
- Setup outline:
- Emit custom metrics.
- Use CloudWatch metrics and logs insights.
- Create dashboards and alarms.
- Strengths:
- Native integration with AWS services.
- Managed scaling.
- Limitations:
- Metric granularity and cross-account complexity.
Recommended dashboards & alerts for Four golden signals
Executive dashboard
- Panels: global availability %, SLO burn-rate, top-5 services by error impact, traffic trend, cost trend.
- Why: gives leadership a quick health snapshot tied to business impact.
On-call dashboard
- Panels: p95/p99 latency per service, current error rate with top error types, saturation metrics per node/pod, recent deployment marker, active alerts and runbook links.
- Why: focused on fast triage for paged engineers.
Debug dashboard
- Panels: traces for slow endpoints, request waterfall, per-endpoint histogram, resource utilizations, DB query latency, recent logs correlated by trace ID.
- Why: deeper diagnostic data for resolving root cause.
Alerting guidance
- Page vs ticket: page on SLO burn-rate crossing emergency threshold or user-facing outage; create tickets for SLO degradation with no immediate user impact.
- Burn-rate guidance: page when burn-rate indicates error budget will exhaust within 1 hour for critical services; ticket at 24-hour burn-rate thresholds.
- Noise reduction tactics: dedupe alerts by group labels, suppress alerts during known maintenance, use grouping rules to reduce duplicate pages.
Implementation Guide (Step-by-step)
1) Prerequisites – Team ownership defined per service. – Baseline monitoring platform and storage. – Instrumentation libraries chosen and available.
2) Instrumentation plan – Identify user-facing endpoints and background jobs. – Define metric names and labels scheme. – Implement timing and success/error counters.
3) Data collection – Deploy collectors/exporters. – Set scrape/push intervals appropriate to SLA. – Ensure retention meets analysis needs.
4) SLO design – Choose SLIs for latency and error for each customer-facing flow. – Set SLO targets derived from business impact. – Define error budget policies and escalation paths.
5) Dashboards – Build executive, on-call, and debug dashboards. – Add annotations for deployments and maintenance windows.
6) Alerts & routing – Translate SLO breaches to alerts with burn-rate logic. – Configure dedupe, grouping, and responders. – Integrate with on-call and incident management.
7) Runbooks & automation – Create immediate remediation runbooks for typical signal patterns. – Implement safe automated mitigations for common saturations.
8) Validation (load/chaos/game days) – Run load tests to validate SLOs. – Execute chaos tests to ensure automation and paging behavior. – Conduct game days simulating incidents.
9) Continuous improvement – Weekly review of alert trends. – Postmortems after incidents and adjust SLOs and runbooks.
Checklists
Pre-production checklist
- Instrumented core endpoints for latency, errors, traffic.
- Exporters deployed to staging.
- Baseline dashboards populated.
- SLO draft created.
Production readiness checklist
- SLIs active and validated.
- Alerts configured and tested.
- Runbooks available and reachable.
- On-call ownership assigned.
Incident checklist specific to Four golden signals
- Verify which of the four signals alerted.
- Check SLO burn-rate and recent deploys.
- Correlate with traces and resource metrics.
- Execute runbook or escalate.
Use Cases of Four golden signals
(8–12 concise use cases)
1) Public API health monitoring – Context: High-traffic external API. – Problem: Latency spikes causing customer complaints. – Why it helps: P95 and error rate quickly reveal regressions. – What to measure: p95/p99 latency, error rate, CPU, DB connections. – Typical tools: APM, metrics platform, tracing.
2) Mobile app backend – Context: Mobile users sensitive to tail latency. – Problem: Intermittent slow responses for a subset of users. – Why it helps: Tail latencies reveal cold-start or edge issues. – What to measure: p99 latency, region traffic, cold starts. – Typical tools: Real-user monitoring, tracing.
3) Kubernetes microservices – Context: Many small services interacting. – Problem: Cascading failures due to connection exhaustion. – Why it helps: Saturation and errors pinpoint resource limits. – What to measure: pod restarts, connection pool usage, latency. – Typical tools: Prometheus, Grafana, mesh telemetry.
4) Serverless function performance – Context: Managed functions with cold starts. – Problem: Unexpected increase in cold start latency. – Why it helps: Tracks cold-start latency and error spikes. – What to measure: cold start p95, concurrency, errors. – Typical tools: Cloud provider metrics, traces.
5) Database scaling – Context: Heavy analytical queries during batch windows. – Problem: Increased query latency affecting OLTP. – Why it helps: Saturation and query latency reveal contention. – What to measure: query latency p95, connection usage, CPU. – Typical tools: DB exporter, APM.
6) CDN edge failures – Context: Edge cache misconfigurations causing origin hits. – Problem: Latency and traffic surge at origin. – Why it helps: Traffic and latency across edge layers highlight origin pressure. – What to measure: cache hit ratio, request latency at edge and origin. – Typical tools: CDN telemetry, synthetic checks.
7) CI/CD deployment safety – Context: Frequent deployments. – Problem: Deployments introduce regressions. – Why it helps: Immediate spike in errors or latency triggers rollback. – What to measure: error rate per deploy, latency change, traffic distribution. – Typical tools: Deployment automation + monitoring.
8) Cost-performance trade-offs – Context: Pressure to reduce infra spend. – Problem: Over-aggressive downsizing increases tail latency. – Why it helps: Saturation + latency shows where cost cuts hurt UX. – What to measure: CPU utilization, p95 latency, scaling events. – Typical tools: Cost monitoring integrated with metrics.
Scenario Examples (Realistic, End-to-End)
Scenario #1 — Kubernetes: Traffic surge causes 503s
Context: E-commerce service running on Kubernetes sees holiday traffic spike.
Goal: Detect and resolve 503 errors quickly and scale safely.
Why Four golden signals matters here: Traffic and saturation reveal load; latency and errors show user impact.
Architecture / workflow: Ingress -> Service A (frontend) -> Service B (payments) -> DB. Metrics from pods, nodes, and DB collected.
Step-by-step implementation:
- Instrument request latency and error counters at each service.
- Export pod CPU, memory, and container restarts.
- Set SLOs for payment latency and availability.
- Configure autoscaler based on CPU and custom request metrics.
- Create on-call dashboard and runbook for scaling and rollback.
What to measure: request rate, p95 latency, error rate, pod CPU, DB connections.
Tools to use and why: Prometheus for metrics, Grafana dashboards, HPA for autoscaling, APM for traces.
Common pitfalls: Relying solely on CPU for autoscale; missing DB connection pool limits.
Validation: Load test to expected peak; run chaos injection turning off nodes to validate autoscaling.
Outcome: Autoscaling combined with connection pool tuning prevents sustained 503s.
Scenario #2 — Serverless: Cold start impacting latency
Context: Notification service using managed functions experiences sporadic long delays.
Goal: Reduce cold-start tail latency and maintain SLO for notifications.
Why Four golden signals matters here: Cold start latency is the saturation/latency indicator for serverless.
Architecture / workflow: Event producer -> Function (managed) -> External API. Metrics from cloud provider + custom traces.
Step-by-step implementation:
- Instrument function duration and cold-start flag.
- Measure invocation rate and concurrency.
- Add provisioned concurrency or keepwarm strategy for critical functions.
- Set p95 latency SLO and alert on cold-start frequency.
What to measure: cold start p95, function errors, concurrency.
Tools to use and why: CloudWatch or provider metrics; OpenTelemetry traces.
Common pitfalls: Overprovisioning raising cost; inadequate sampling hiding cold starts.
Validation: Simulate low-traffic bursts and measure p99 latency.
Outcome: Provisioning reduces p95 by eliminating cold starts within cost targets.
Scenario #3 — Incident response / postmortem: Rollout introduces regression
Context: New release increases p95 latency and errors across services.
Goal: Rapid triage, rollback if necessary, and actionable postmortem.
Why Four golden signals matters here: Latency and errors are immediate indicators of regression.
Architecture / workflow: CI/CD deploy -> services updated -> monitoring detects increased error rate.
Step-by-step implementation:
- Alert triggers on error rate and burn-rate.
- On-call examines dashboards and traces to identify offending service.
- Rollback deployment if error budget exhaustion imminent.
- Postmortem collects timeline, golden signals graphs, root cause, and corrective actions.
What to measure: deployment timestamps, p95/p99 latency, error types.
Tools to use and why: CI system, Grafana, tracing system, incident management tool.
Common pitfalls: Lack of deployment markers in metrics; incomplete runbook.
Validation: Replay deployment in staging with synthetic load.
Outcome: Fast rollback reduces customer impact and postmortem drives fix.
Scenario #4 — Cost/performance trade-off: Rightsizing infra
Context: Platform needs to cut costs while preserving UX.
Goal: Identify safe areas to reduce capacity without SLO breaches.
Why Four golden signals matters here: Saturation and latency show where cuts would affect users.
Architecture / workflow: Multi-service platform with cloud VMs and managed DB.
Step-by-step implementation:
- Baseline current p95/p99 latency and CPU/memory saturation.
- Run controlled scaling experiments reducing instance counts.
- Monitor error rate and burn-rate closely.
- If SLOs remain acceptable, apply gradual rightsizing and monitor.
What to measure: p95, p99, CPU, queue depth, errors.
Tools to use and why: Metrics platform, cost analytics, APM.
Common pitfalls: Ignoring tail latency; not considering regional failover impact.
Validation: Canary rightsizing in non-critical region and monitor SLOs.
Outcome: Achieve cost savings while preserving user-facing SLOs.
Common Mistakes, Anti-patterns, and Troubleshooting
(15–25 items; each: Symptom -> Root cause -> Fix)
1) Symptom: Alerts but no traces available -> Root cause: Tracing not instrumented for that endpoint -> Fix: Add OpenTelemetry tracing and propagate context. 2) Symptom: High p95 but normal p50 -> Root cause: Tail latency from occasional blocking calls -> Fix: Identify slow path via traces and optimize or add timeouts. 3) Symptom: Sudden alert storm -> Root cause: Aggregated alerting on noisy metric -> Fix: Implement grouping and reduce sensitivity; add dedupe. 4) Symptom: No data on dashboard -> Root cause: Exporter misconfigured or collector down -> Fix: Validate collector health and scrape configs. 5) Symptom: Cost spike after metrics rollout -> Root cause: High-cardinality labels increased storage -> Fix: Remove unneeded labels and reduce retention. 6) Symptom: SLO breaches after deploys -> Root cause: Regression in code path -> Fix: Canary and automated rollback. 7) Symptom: Autoscaler doesn’t react -> Root cause: Wrong metric targeted or low update frequency -> Fix: Use request-based custom metrics and tune scaler. 8) Symptom: DB errors during high load -> Root cause: Connection pool exhausted -> Fix: Increase pool or add retry/backoff. 9) Symptom: Alerts during maintenance -> Root cause: Alerts not silenced -> Fix: Implement maintenance window suppression. 10) Symptom: False positives in anomaly detection -> Root cause: Poor baseline or seasonal patterns -> Fix: Use dynamic baselines and exclude expected patterns. 11) Symptom: Slow queries on metrics backend -> Root cause: Large queries with high cardinality -> Fix: Pre-aggregate and limit query ranges. 12) Symptom: Missing SLI definition -> Root cause: No clear customer-facing flow identified -> Fix: Map user journeys and define SLIs per journey. 13) Symptom: Overuse of health checks -> Root cause: Health check too permissive -> Fix: Add readiness that validates critical dependencies. 14) Symptom: Silent failures in background jobs -> Root cause: No traffic metric for batch jobs -> Fix: Add job success and backlog metrics. 15) Symptom: Noise from debug logs -> Root cause: High verbosity in production -> Fix: Adjust log levels and sampling. 16) Symptom: Partial outages without alerts -> Root cause: Aggregated metrics mask regional issues -> Fix: Add per-region slicing. 17) Symptom: Slow incident resolution -> Root cause: Missing runbooks -> Fix: Create and maintain runbooks for common signal patterns. 18) Symptom: Pager fatigue -> Root cause: Low signal-to-noise alerts -> Fix: Re-evaluate thresholds and escalation routing. 19) Symptom: Metrics drift after refactor -> Root cause: Metric name changes without migration -> Fix: Use compatibility labels and rename strategies. 20) Symptom: Inconsistent cardinality across services -> Root cause: Different tagging conventions -> Fix: Standardize metrics taxonomy. 21) Symptom: Security alerts from telemetry -> Root cause: Sensitive data in logs/labels -> Fix: Redact or avoid PII in telemetry. 22) Symptom: Slow historical analysis -> Root cause: Short retention or rollups -> Fix: Keep high-res for key SLIs and longer rollups for trends. 23) Symptom: Alert flooding from dependency -> Root cause: Dependency outage causes many downstream alerts -> Fix: Implement dependency-aware alert suppression.
Observability-specific pitfalls (subset included above)
- Missing context links between logs/traces/metrics.
- High-cardinality labels causing unusable dashboards.
- Over-aggregation hiding root cause latency.
- Lack of SLO-derived alerts leading to business-blind paging.
- Not annotating deploys causing blind correlational errors.
Best Practices & Operating Model
Ownership and on-call
- Assign service ownership and SLO responsibility.
- On-call rotation focused on fewer services per engineer.
- Escalation paths tied to SLO severity.
Runbooks vs playbooks
- Runbook: step-by-step remediation for common incidents.
- Playbook: higher-level decision process for complex incidents.
- Keep both versioned with owner and review cadence.
Safe deployments
- Canary and progressive rollouts with automated rollback on SLO breach.
- Feature flags to reduce blast radius.
- Automated integration tests for performance and failure injection.
Toil reduction and automation
- Automate common remediation for known saturations.
- Use runbooks that trigger automation for safe fixes.
- Measure toil and target automation for repetitive tasks.
Security basics
- Avoid PI in traces or metric labels.
- Secure metric pipelines; ensure collectors are authenticated.
- Audit and monitor telemetry storage access.
Weekly/monthly routines
- Weekly: review alerts, top 5 noisy alerts, and recent runbook use.
- Monthly: SLO review and tuning, cardinatity audit, cost review.
Postmortem review checklist
- Include which golden signal alerted and timeline.
- Confirm if SLIs/SLOs were appropriate.
- Action items to improve instrumentation, runbooks, or automation.
Tooling & Integration Map for Four golden signals (TABLE REQUIRED)
| ID | Category | What it does | Key integrations | Notes |
|---|---|---|---|---|
| I1 | Metrics store | Time-series storage and query | Scrapers APM collectors | Choose based on scale |
| I2 | Tracing | Distributed tracing and spans | OTEL APM | Correlates latency to code |
| I3 | Dashboards | Visualization and alerts | Metrics stores Traces | Central view for teams |
| I4 | Alerting | Notification and routing | Pager systems ChatOps | Must support grouping |
| I5 | Exporters | Translate telemetry formats | Various services | Standardize metric names |
| I6 | Collectors | Aggregate and forward data | Multiple backends | Centralize configuration |
| I7 | APM | Deep performance analysis | Logs Traces Metrics | Useful for app-level latency |
| I8 | Service mesh | Network observability | Sidecar proxies Tracing | Adds consistent metrics |
| I9 | CI/CD | Deployment automation | Metrics and annotations | Annotate deployments in metrics |
| I10 | Incident Mgmt | Runbooks paging postmortems | Alerts Dashboards | Integrate SLO context |
Row Details (only if needed)
- None
Frequently Asked Questions (FAQs)
What exactly are the four golden signals?
They are latency, traffic, errors, and saturation—the core categories used to gauge system health.
Are the golden signals enough for all observability?
No. They are a minimal, prioritized set; additional logs, traces, and business metrics are required.
How do I pick percentiles for latency?
Start with p95 for tail behavior and p99 for critical services; p50 helps track median but is less diagnostic.
How often should metrics be scraped?
Typical scrapes are 15s to 60s; use shorter intervals for critical SLIs and awareness of cost.
Can serverless use the four signals?
Yes. Map cold-starts and concurrency to saturation and latency metrics.
How do I avoid high-cardinality problems?
Limit labels, avoid user IDs as tags, and use aggregation keys that make sense for slicing.
Should SLIs be customer-facing metrics?
Prefer SLIs that reflect user experience, but combine with internal metrics for diagnosis.
How to set initial SLO targets?
Use historical baselines plus business tolerance; iterate as you learn.
What triggers a page vs a ticket?
Page on imminent error budget exhaustion or clear user-impacting outages; ticket for long-term degradations.
How do I correlate logs, traces, and metrics?
Adopt trace IDs propagated via headers and ensure metrics include those IDs in logs for correlation.
Can AI assist in four golden signals monitoring?
Yes. AI can detect anomalies, suggest root causes, and surface correlated signals, but review suggestions before automated actions.
How many dashboards do I need?
At least three: executive, on-call, and debug. Add team-specific dashboards as needed.
Should I monitor synthetic checks or rely on real-user metrics?
Use both: synthetics detect predictable regressions; RUM measures actual user impact.
How to handle noisy alerts during deployments?
Use deployment annotations and temporary suppressions; prefer canaries to limit noise.
How long should metric retention be?
Depends on analysis needs; keep high-res recent data (weeks) and lower-res for long-term trends.
What is the role of tracing with golden signals?
Traces link latency and errors to code paths and downstream services for root cause analysis.
How to measure saturation for managed services?
Use provider metrics like concurrency, throttles, and queue depths where host metrics aren’t available.
Can golden signals measure security incidents?
They can surface anomalies like sudden traffic spikes but are insufficient for detailed security investigations.
Conclusion
The Four golden signals remain a compact, practical observability foundation in 2026 cloud-native architectures. They provide rapid detection and meaningful triage paths when combined with SLIs, SLOs, and modern telemetry (traces, logs). Mature implementations include automation, canary deployments, and AI-assisted anomaly detection while preserving security and cost controls.
Next 7 days plan (practical)
- Day 1: Inventory services and identify owners and critical user flows.
- Day 2: Instrument one service with latency, traffic, error, and saturation metrics.
- Day 3: Build on-call and debug dashboards for that service.
- Day 4: Define SLIs and an initial SLO for latency and error rate.
- Day 5–7: Run a smoke load test, validate alerts, and create a basic runbook.
Appendix — Four golden signals Keyword Cluster (SEO)
- Primary keywords
- four golden signals
- golden signals SRE
- four golden signals latency traffic errors saturation
- four golden metrics
-
SRE golden signals guide
-
Secondary keywords
- SLIs and SLOs four golden signals
- observability four golden signals
- how to measure four golden signals
- four golden signals in Kubernetes
-
four golden signals serverless
-
Long-tail questions
- what are the four golden signals and why are they important
- how to implement four golden signals in kubernetes
- how do four golden signals relate to slos
- best tools to measure four golden signals in 2026
- how to avoid high cardinality with four golden signals
- four golden signals monitoring checklist
- example dashboards for four golden signals
- alerting strategy for four golden signals burn rate
- four golden signals for serverless cold start monitoring
- how to use tracing with four golden signals
- can AI help detect anomalies in four golden signals
- four golden signals and incident response runbooks
- metrics to track saturation in databases
- four golden signals versus full observability stack
- how to design slos from four golden signals
- four golden signals for edge and cdn monitoring
- four golden signals for microservices
- cost optimization with four golden signals
- common mistakes implementing four golden signals
-
four golden signals best practices 2026
-
Related terminology
- SLI
- SLO
- error budget
- percentile latency
- p95 p99
- traffic throughput
- request rate
- saturation metrics
- CPU utilization
- memory utilization
- connection pool
- queue depth
- cold start
- service mesh telemetry
- OpenTelemetry
- Prometheus exporter
- Grafana dashboards
- APM tracing
- synthetic monitoring
- real user monitoring
- canary deployment
- autoscaling
- burn-rate alerting
- trace correlation
- high cardinality
- observability pipeline
- metric retention
- alert dedupe
- deployment annotations
- runbook automation
- chaos engineering
- game day exercises
- postmortem analysis
- monitoring cost control
- telemetry security
- metric naming conventions
- service ownership
- incident management
- metrics aggregation
- rollup retention