Skip to main content

Reducing On-Call Alert Fatigue

Your on-call engineers are burning out. They're getting 40 to 50 pages a shift and they tell you most of it is noise they just ack and ignore. How do you fix this?

senior
advanced
Incident Management
Question

Your on-call engineers are burning out. They're getting 40 to 50 pages a shift and they tell you most of it is noise they just ack and ignore. How do you fix this?

Answer

First I'd measure before touching anything. I'd pull pages grouped by alert name, by service, and by time of day, and split them into actionable versus auto-resolved. Almost always a handful of alerts generate most of the volume, and a big chunk auto-resolve before anyone does anything. That tells me where to aim. Then I work through it in order of impact: 1. Delete or downgrade non-actionable alerts. The test is simple: if a page doesn't require a human to do something right now, it is not a page. Make it a ticket or a dashboard panel. This alone usually cuts the worst noise. 2. Alert on symptoms, not causes. Page on what users feel: elevated error rate, latency past the SLO, a queue that's backing up. Don't page on CPU at 90 percent or memory pressure, because those are often fine and they fire constantly. If high CPU isn't hurting users, it isn't an incident. 3. Add duration guards. A `for:` clause stops flapping alerts that fire and clear every two minutes. If it can't stay broken for five minutes, it probably isn't worth a page. 4. Group and inhibit. When one outage trips twenty alerts, Alertmanager grouping collapses them into one notification, and inhibition rules suppress the downstream noise. If the cluster is down, don't also page for every pod on it. 5. Move to multi-window burn-rate alerts for SLOs. A static threshold either pages too early or too late. A fast-burn plus slow-burn pair catches a real budget burn quickly while staying quiet on blips. 6. Route by severity. Only genuine urgency pages. Everything else goes to Slack or a ticket queue that gets triaged during the day. 7. Set a budget and defend it. Pick a target like fewer than two pages per off-hours shift, and treat going over it as a bug with an owner, reviewed at every on-call handoff. Without a target, noise creeps back. The edge cases that bite people: don't just raise thresholds, because that hides real problems instead of removing noise. Watch the auto-resolving alerts, since a thing that flaps and clears all night is a real bug masquerading as harmless. And every remaining page must have a runbook. If a page has no runbook, either it's not worth paging on, or you owe it one.

Why This Matters

This is the senior-level question that reveals whether someone has actually owned alerting in production. The tell is sequencing: a strong answer starts with measuring, then cuts and downgrades alerts, and treats symptom-based alerting and severity routing as the structural fixes. Weaker answers jump straight to "raise the thresholds" or "add a dashboard," both of which hide problems rather than fix them. Mentioning a pages-per-shift budget, multi-window burn-rate alerts, or making alert tuning part of the on-call retro are clear senior signals.

Code Examples

A noisy cause-based alert rewritten as a quiet symptom-based one

yaml

Multi-window burn-rate SLO alert: fast and slow, fewer false pages

yaml

Find your noisiest alerts so you fix the right ones first

promql
Common Mistakes
  • Just raising the thresholds. That hides real degradation instead of removing noise, and the next real incident slips under the new bar.
  • Silencing alerts without fixing the underlying cause. A muted alert that's still flapping is a bug you've agreed to stop looking at.
  • Adding more dashboards instead of cutting alerts. Dashboards don't reduce pages; deleting non-actionable alerts does.
  • Alerting on causes like CPU and memory instead of user-facing symptoms. Resource metrics belong on dashboards, not on the pager.
Follow-up Questions
Interviewers often ask these as follow-up questions
  • How do you decide whether a given alert should page someone versus go to a ticket?
  • After you make these changes, how would you measure whether on-call fatigue actually went down?
  • What's a multi-window burn-rate alert, and why is it better than a single static threshold?
Tags
incident-management
alert-fatigue
on-call
observability
prometheus
Sponsored
Carbon Ads

More Incident Management interview questions

Also worth your time on this topic