Designing an On-Call Schedule
You've got six engineers split across two time zones and you need 24/7 coverage. How would you actually design the rotation? Walk me through the trade-offs you'd weigh.
You've got six engineers split across two time zones and you need 24/7 coverage. How would you actually design the rotation? Walk me through the trade-offs you'd weigh.
With two time zones, the goal I'd push for is follow-the-sun, so nobody gets paged at 3am on a regular basis. Each region covers their own daytime, and the pager hands off when their day ends. Concretely: split the six into two groups of three by region. Each region runs a weekly rotation during their working hours. When region A's day ends, coverage passes to region B who is just starting theirs. Overnight pages for region A land on region B, who is awake. That alone removes most of the pain of on-call. I'd also run a primary plus secondary, not primary only. The secondary is the safety net for when the primary misses a page. With three people per region you can do that without crushing anyone: roughly one week as primary and a different week as secondary out of every three. Rotation length: weekly is the usual sweet spot. Daily means constant context switching and bad handoffs. Monthly means one bad month ruins someone and they lose touch with the system between turns. A week is long enough to keep context and short enough to recover. The trade-offs I care about: - Fairness. Load should be even, and the schedule has to bend for PTO and holidays through overrides, not by silently dumping shifts on whoever's available. - Handoffs. Every shift change needs a written handoff: open incidents, risky changes shipping, anything being watched. A follow-the-sun handoff that loses context is worse than one tired person who remembers everything. - Coverage gaps. With only three per region, watch for the case where one person is on primary and there's nobody fresh for secondary. Never schedule the same person as both primary and secondary at once. - Compensation. On-call is work. Whether it's pay, time off, or reduced project load that week, it has to be acknowledged or people quietly resent it. The anti-pattern I'd avoid: a single 24/7 rotation in one region where someone eats overnight pages every shift. With two time zones you've been handed the fix for free, so use it.
This tests whether a mid-level engineer can reason about real scheduling constraints rather than reciting "we use PagerDuty." The detail that separates strong answers is recognizing that two time zones is an opportunity for follow-the-sun, not just a complication. Listen for primary plus secondary, sane rotation length with a justification, and the human factors: fairness, PTO overrides, handoffs, and compensation. Bonus points if they flag that a six-person team means rotations come around often and that's a retention risk.
Follow-the-sun schedule with two regional layers (Terraform + PagerDuty)
Adding a PTO override so the schedule bends instead of breaking
- Running a single 24/7 rotation where the same person absorbs overnight pages. With two time zones, follow-the-sun removes most night pages for free.
- Skipping the secondary. Primary-only means one missed page (phone on silent, dead zone) and the alert goes unanswered.
- Picking a rotation length without thinking. Daily rotations wreck handoffs and context; monthly rotations burn people out and let them lose touch between turns.
- Treating fairness and PTO as afterthoughts. No override mechanism means the schedule quietly becomes unfair and people stop trusting it.
- One engineer goes on PTO during their scheduled week. How do you cover it without dumping it on one person?
- What would you change about this design if the team grew to 30 engineers?
- How do you actually compensate people for being on-call, and what happens if you don't?
More Incident Management interview questions
Also worth your time on this topic
On-Call Rotation and Escalation Basics
You're about to go on-call for the first time. In your own words, what is an on-call rotation, and why do teams bother setting up a formal escalation policy instead of just pinging whoever happens to be online when something breaks?
junior
How to Build an Effective On-Call Rotation and Escalation Policy
A practical checklist for designing on-call schedules, defining escalation paths, and cutting alert fatigue so your team can sleep at night and still respond fast when things break.
60-120 minutes
How to Build an Effective On-Call Rotation and Escalation Policy
Your phone buzzed at 3:14 AM for a disk warning that auto-resolved by 3:16. Nobody fixes the alert. The next person on rotation hates their life. Here is how to build on-call schedules, escalation policies, and alert rules that respect your engineers.