Your on-call engineers are burning out. They're getting 40 to 50 pages a shift and they tell you most of it is noise they just ack and ignore. How do you fix this?

Question

Accepted Answer

First I'd measure before touching anything. I'd pull pages grouped by alert name, by service, and by time of day, and split them into actionable versus auto-resolved. Almost always a handful of alerts generate most of the volume, and a big chunk auto-resolve before anyone does anything. That tells me where to aim.

Then I work through it in order of impact:

1. Delete or downgrade non-actionable alerts. The test is simple: if a page doesn't require a human to do something right now, it is not a page. Make it a ticket or a dashboard panel. This alone usually cuts the worst noise.

2. Alert on symptoms, not causes. Page on what users feel: elevated error rate, latency past the SLO, a queue that's backing up. Don't page on CPU at 90 percent or memory pressure, because those are often fine and they fire constantly. If high CPU isn't hurting users, it isn't an incident.

3. Add duration guards. A `for:` clause stops flapping alerts that fire and clear every two minutes. If it can't stay broken for five minutes, it probably isn't worth a page.

4. Group and inhibit. When one outage trips twenty alerts, Alertmanager grouping collapses them into one notification, and inhibition rules suppress the downstream noise. If the cluster is down, don't also page for every pod on it.

5. Move to multi-window burn-rate alerts for SLOs. A static threshold either pages too early or too late. A fast-burn plus slow-burn pair catches a real budget burn quickly while staying quiet on blips.

6. Route by severity. Only genuine urgency pages. Everything else goes to Slack or a ticket queue that gets triaged during the day.

7. Set a budget and defend it. Pick a target like fewer than two pages per off-hours shift, and treat going over it as a bug with an owner, reviewed at every on-call handoff. Without a target, noise creeps back.

The edge cases that bite people: don't just raise thresholds, because that hides real problems instead of removing noise. Watch the auto-resolving alerts, since a thing that flaps and clears all night is a real bug masquerading as harmless. And every remaining page must have a runbook. If a page has no runbook, either it's not worth paging on, or you owe it one.

Reducing On-Call Alert Fatigue

More Incident Management interview questions

Also worth your time on this topic

How to Build an Effective On-Call Rotation and Escalation Policy

How to Build an Effective On-Call Rotation and Escalation Policy

On-Call Rotation and Escalation Basics