Skip to main content
mid
intermediate
Observability

Monitoring and Alerting Strategy

Question

How do you design a monitoring and alerting strategy? What metrics would you track and how do you avoid alert fatigue?

Answer

Start with the four golden signals: latency, traffic, errors, and saturation. Use RED method for services (Rate, Errors, Duration) and USE method for resources (Utilization, Saturation, Errors). To avoid alert fatigue: alert on symptoms not causes, set appropriate thresholds with historical data, use severity levels (page vs ticket), implement alert grouping and deduplication, require runbooks for every alert, and regularly review and tune alerts. Only page for actionable issues that require immediate human intervention.

Why This Matters

Effective monitoring enables proactive issue detection and faster incident resolution. Poor alerting leads to alert fatigue where important signals get ignored. This is a critical skill for maintaining reliable systems.

Code Examples

Prometheus alerting rules example

yaml
Common Mistakes
  • Alerting on every metric threshold crossing
  • Not including runbooks with alerts
  • Setting thresholds without historical baseline data
Follow-up Questions
Interviewers often ask these as follow-up questions
  • What is the difference between metrics, logs, and traces?
  • How do you decide between warning and critical alert severity?
  • What is an error budget and how does it relate to alerting?
Tags
monitoring
alerting
observability
sre
metrics