Skip to main content

Designing an On-Call Schedule

You've got six engineers split across two time zones and you need 24/7 coverage. How would you actually design the rotation? Walk me through the trade-offs you'd weigh.

mid
intermediate
Incident Management
Question

You've got six engineers split across two time zones and you need 24/7 coverage. How would you actually design the rotation? Walk me through the trade-offs you'd weigh.

Answer

With two time zones, the goal I'd push for is follow-the-sun, so nobody gets paged at 3am on a regular basis. Each region covers their own daytime, and the pager hands off when their day ends. Concretely: split the six into two groups of three by region. Each region runs a weekly rotation during their working hours. When region A's day ends, coverage passes to region B who is just starting theirs. Overnight pages for region A land on region B, who is awake. That alone removes most of the pain of on-call. I'd also run a primary plus secondary, not primary only. The secondary is the safety net for when the primary misses a page. With three people per region you can do that without crushing anyone: roughly one week as primary and a different week as secondary out of every three. Rotation length: weekly is the usual sweet spot. Daily means constant context switching and bad handoffs. Monthly means one bad month ruins someone and they lose touch with the system between turns. A week is long enough to keep context and short enough to recover. The trade-offs I care about: - Fairness. Load should be even, and the schedule has to bend for PTO and holidays through overrides, not by silently dumping shifts on whoever's available. - Handoffs. Every shift change needs a written handoff: open incidents, risky changes shipping, anything being watched. A follow-the-sun handoff that loses context is worse than one tired person who remembers everything. - Coverage gaps. With only three per region, watch for the case where one person is on primary and there's nobody fresh for secondary. Never schedule the same person as both primary and secondary at once. - Compensation. On-call is work. Whether it's pay, time off, or reduced project load that week, it has to be acknowledged or people quietly resent it. The anti-pattern I'd avoid: a single 24/7 rotation in one region where someone eats overnight pages every shift. With two time zones you've been handed the fix for free, so use it.

Why This Matters

This tests whether a mid-level engineer can reason about real scheduling constraints rather than reciting "we use PagerDuty." The detail that separates strong answers is recognizing that two time zones is an opportunity for follow-the-sun, not just a complication. Listen for primary plus secondary, sane rotation length with a justification, and the human factors: fairness, PTO overrides, handoffs, and compensation. Bonus points if they flag that a six-person team means rotations come around often and that's a retention risk.

Code Examples

Follow-the-sun schedule with two regional layers (Terraform + PagerDuty)

hcl

Adding a PTO override so the schedule bends instead of breaking

bash
Common Mistakes
  • Running a single 24/7 rotation where the same person absorbs overnight pages. With two time zones, follow-the-sun removes most night pages for free.
  • Skipping the secondary. Primary-only means one missed page (phone on silent, dead zone) and the alert goes unanswered.
  • Picking a rotation length without thinking. Daily rotations wreck handoffs and context; monthly rotations burn people out and let them lose touch between turns.
  • Treating fairness and PTO as afterthoughts. No override mechanism means the schedule quietly becomes unfair and people stop trusting it.
Follow-up Questions
Interviewers often ask these as follow-up questions
  • One engineer goes on PTO during their scheduled week. How do you cover it without dumping it on one person?
  • What would you change about this design if the team grew to 30 engineers?
  • How do you actually compensate people for being on-call, and what happens if you don't?
Tags
incident-management
on-call
scheduling
pagerduty
Sponsored
Carbon Ads

More Incident Management interview questions

Also worth your time on this topic