Designing an Escalation Policy
An alert fires at 3am and pages the primary on-call. Walk me through what your escalation policy should do from that moment, step by step, and tell me what failure modes you're designing around.
An alert fires at 3am and pages the primary on-call. Walk me through what your escalation policy should do from that moment, step by step, and tell me what failure modes you're designing around.
Step one: the primary gets paged through a channel that will actually wake them, push plus an escalating phone call, not just a Slack message. The policy starts an acknowledgement timer, say 5 to 8 minutes. Step two: if the primary acks, escalation stops and they own it. If the timer expires with no ack, the policy automatically pages the secondary. Same ack timer. Step three: if the secondary also doesn't ack, it escalates to the on-call manager or engineering manager. For a real major incident, this is also where you'd auto-create an incident channel and pull in an incident commander. That's the happy path. The failure modes I'm actually designing around are the interesting part: - The first responder is asleep or has no signal. That's why there's a secondary, and why the notification escalates from push to phone call rather than relying on one quiet channel. - The alert dead-ends. Every policy needs a final catch-all so an unacked alert eventually reaches a manager or opens an incident. It should never just stop. - The wrong team gets paged. Routing should send the alert to the team that owns the failing service, based on a service or team label, not to a single central pager. - Non-urgent things page humans. Severity-based routing matters here. A SEV1 pages immediately and notifies the manager and incident channel. A SEV3 should become a ticket or a Slack message, not a 3am phone call. If it doesn't need a human right now, it isn't a page. - Alert storms. When one outage trips fifty alerts, grouping and deduplication keep it to one notification instead of fifty phone calls. The edge case I'd call out: don't let escalation loop forever. Loop the chain once, maybe twice, then hand off to a guaranteed catch-all like a major-incident process. And every page should carry a link to its runbook, so the person you woke up at 3am has somewhere to start instead of a blank screen.
This is the core mid-level question on the topic. A weak candidate describes only the happy path (primary, then secondary, then manager). A strong one talks about what they're defending against: dead-ended alerts, single points of failure, wrong-team routing, and non-actionable pages. The phrase to listen for is severity-based routing, because that's where most teams' escalation pain actually comes from. Auto-creating an incident channel, attaching runbooks, and capping the escalation loop are senior-leaning signals.
Escalation policy with ack timeouts and a manager catch-all (Terraform + PagerDuty)
Severity-based routing so only real urgency pages a human (Alertmanager)
- No acknowledgement timeout, or one so long the alert sits unanswered for 20 minutes before escalating.
- No final catch-all. If the last person in the chain misses the page, the alert silently dies and nobody finds out until customers complain.
- Paging on non-actionable or low-severity alerts. Every false page at 3am makes the next real one easier to ignore.
- Relying on a single notification channel like Slack. If their phone is on silent, a Slack ping does nothing.
- How does your escalation policy change for a SEV1 versus a SEV3?
- What would you do differently when you've got 100 services instead of one?
- How do you stop a non-urgent alert from waking up an entire escalation chain?
More Incident Management interview questions
Also worth your time on this topic
How to Build an Effective On-Call Rotation and Escalation Policy
A practical checklist for designing on-call schedules, defining escalation paths, and cutting alert fatigue so your team can sleep at night and still respond fast when things break.
60-120 minutes
On-Call Rotation and Escalation Basics
You're about to go on-call for the first time. In your own words, what is an on-call rotation, and why do teams bother setting up a formal escalation policy instead of just pinging whoever happens to be online when something breaks?
junior
How to Build an Effective On-Call Rotation and Escalation Policy
Your phone buzzed at 3:14 AM for a disk warning that auto-resolved by 3:16. Nobody fixes the alert. The next person on rotation hates their life. Here is how to build on-call schedules, escalation policies, and alert rules that respect your engineers.