An alert fires at 3am and pages the primary on-call. Walk me through what your escalation policy should do from that moment, step by step, and tell me what failure modes you're designing around.

Question

Accepted Answer

Step one: the primary gets paged through a channel that will actually wake them, push plus an escalating phone call, not just a Slack message. The policy starts an acknowledgement timer, say 5 to 8 minutes.

Step two: if the primary acks, escalation stops and they own it. If the timer expires with no ack, the policy automatically pages the secondary. Same ack timer.

Step three: if the secondary also doesn't ack, it escalates to the on-call manager or engineering manager. For a real major incident, this is also where you'd auto-create an incident channel and pull in an incident commander.

That's the happy path. The failure modes I'm actually designing around are the interesting part:

- The first responder is asleep or has no signal. That's why there's a secondary, and why the notification escalates from push to phone call rather than relying on one quiet channel.
- The alert dead-ends. Every policy needs a final catch-all so an unacked alert eventually reaches a manager or opens an incident. It should never just stop.
- The wrong team gets paged. Routing should send the alert to the team that owns the failing service, based on a service or team label, not to a single central pager.
- Non-urgent things page humans. Severity-based routing matters here. A SEV1 pages immediately and notifies the manager and incident channel. A SEV3 should become a ticket or a Slack message, not a 3am phone call. If it doesn't need a human right now, it isn't a page.
- Alert storms. When one outage trips fifty alerts, grouping and deduplication keep it to one notification instead of fifty phone calls.

The edge case I'd call out: don't let escalation loop forever. Loop the chain once, maybe twice, then hand off to a guaranteed catch-all like a major-incident process. And every page should carry a link to its runbook, so the person you woke up at 3am has somewhere to start instead of a blank screen.

Designing an Escalation Policy

More Incident Management interview questions

Also worth your time on this topic

How to Build an Effective On-Call Rotation and Escalation Policy

On-Call Rotation and Escalation Basics

How to Build an Effective On-Call Rotation and Escalation Policy