Four Golden Signals of Monitoring
What are the four golden signals of monitoring and why are they important?
Four Golden Signals of Monitoring
What are the four golden signals of monitoring and why are they important?
The Four Golden Signals (from Google SRE) are: Latency (request response time), Traffic (requests per second), Errors (rate of failed requests), and Saturation (resource utilization). These metrics quickly indicate service health and help diagnose issues. They're the minimum metrics every service should track.
These signals come from Google's Site Reliability Engineering practices. Together they provide a comprehensive view of system health. Latency shows user experience, traffic shows demand, errors show reliability, and saturation shows capacity. Start with these before adding more specific metrics.
Prometheus alerting rules
- Only monitoring one signal (e.g., just CPU usage)
- Setting alerts too sensitive (alert fatigue) or too loose (missing issues)
- Not measuring latency at different percentiles (p50, p95, p99)
- How do you distinguish between latency for successful vs failed requests?
- What's the difference between metrics, logs, and traces?
- How do you set appropriate alert thresholds?