Reducing On-Call Alert Fatigue
Your on-call engineers are burning out. They're getting 40 to 50 pages a shift and they tell you most of it is noise they just ack and ignore. How do you fix this?
Your on-call engineers are burning out. They're getting 40 to 50 pages a shift and they tell you most of it is noise they just ack and ignore. How do you fix this?
First I'd measure before touching anything. I'd pull pages grouped by alert name, by service, and by time of day, and split them into actionable versus auto-resolved. Almost always a handful of alerts generate most of the volume, and a big chunk auto-resolve before anyone does anything. That tells me where to aim. Then I work through it in order of impact: 1. Delete or downgrade non-actionable alerts. The test is simple: if a page doesn't require a human to do something right now, it is not a page. Make it a ticket or a dashboard panel. This alone usually cuts the worst noise. 2. Alert on symptoms, not causes. Page on what users feel: elevated error rate, latency past the SLO, a queue that's backing up. Don't page on CPU at 90 percent or memory pressure, because those are often fine and they fire constantly. If high CPU isn't hurting users, it isn't an incident. 3. Add duration guards. A `for:` clause stops flapping alerts that fire and clear every two minutes. If it can't stay broken for five minutes, it probably isn't worth a page. 4. Group and inhibit. When one outage trips twenty alerts, Alertmanager grouping collapses them into one notification, and inhibition rules suppress the downstream noise. If the cluster is down, don't also page for every pod on it. 5. Move to multi-window burn-rate alerts for SLOs. A static threshold either pages too early or too late. A fast-burn plus slow-burn pair catches a real budget burn quickly while staying quiet on blips. 6. Route by severity. Only genuine urgency pages. Everything else goes to Slack or a ticket queue that gets triaged during the day. 7. Set a budget and defend it. Pick a target like fewer than two pages per off-hours shift, and treat going over it as a bug with an owner, reviewed at every on-call handoff. Without a target, noise creeps back. The edge cases that bite people: don't just raise thresholds, because that hides real problems instead of removing noise. Watch the auto-resolving alerts, since a thing that flaps and clears all night is a real bug masquerading as harmless. And every remaining page must have a runbook. If a page has no runbook, either it's not worth paging on, or you owe it one.
This is the senior-level question that reveals whether someone has actually owned alerting in production. The tell is sequencing: a strong answer starts with measuring, then cuts and downgrades alerts, and treats symptom-based alerting and severity routing as the structural fixes. Weaker answers jump straight to "raise the thresholds" or "add a dashboard," both of which hide problems rather than fix them. Mentioning a pages-per-shift budget, multi-window burn-rate alerts, or making alert tuning part of the on-call retro are clear senior signals.
A noisy cause-based alert rewritten as a quiet symptom-based one
Multi-window burn-rate SLO alert: fast and slow, fewer false pages
Find your noisiest alerts so you fix the right ones first
- Just raising the thresholds. That hides real degradation instead of removing noise, and the next real incident slips under the new bar.
- Silencing alerts without fixing the underlying cause. A muted alert that's still flapping is a bug you've agreed to stop looking at.
- Adding more dashboards instead of cutting alerts. Dashboards don't reduce pages; deleting non-actionable alerts does.
- Alerting on causes like CPU and memory instead of user-facing symptoms. Resource metrics belong on dashboards, not on the pager.
- How do you decide whether a given alert should page someone versus go to a ticket?
- After you make these changes, how would you measure whether on-call fatigue actually went down?
- What's a multi-window burn-rate alert, and why is it better than a single static threshold?
More Incident Management interview questions
Also worth your time on this topic
How to Build an Effective On-Call Rotation and Escalation Policy
Your phone buzzed at 3:14 AM for a disk warning that auto-resolved by 3:16. Nobody fixes the alert. The next person on rotation hates their life. Here is how to build on-call schedules, escalation policies, and alert rules that respect your engineers.
How to Build an Effective On-Call Rotation and Escalation Policy
A practical checklist for designing on-call schedules, defining escalation paths, and cutting alert fatigue so your team can sleep at night and still respond fast when things break.
60-120 minutes
On-Call Rotation and Escalation Basics
You're about to go on-call for the first time. In your own words, what is an on-call rotation, and why do teams bother setting up a formal escalation policy instead of just pinging whoever happens to be online when something breaks?
junior