Scaling an On-Call Program Across Many Teams
You've been asked to design the on-call program for an org that grew from one team to fifteen in a year. Right now it's a free-for-all. What does a healthy on-call program look like at that scale, and how would you measure whether it's working?
You've been asked to design the on-call program for an org that grew from one team to fifteen in a year. Right now it's a free-for-all. What does a healthy on-call program look like at that scale, and how would you measure whether it's working?
The core shift is from central on-call to service ownership. The team that builds a service runs its on-call. You build it, you run it. A single central ops team can't own fifteen teams' services, because the people getting paged have no power to fix the root cause, so the noise never goes away. So I'd route alerts to the owning team based on a service or team label, and make sure every service has a named owner. No orphan services, no central pager catching everything. Then I'd standardize the things that should be consistent and let teams own the rest. Standardize: severity definitions, the incident process (who declares an incident, the incident commander role, comms), the paging tooling and a golden-path alerting setup, runbook expectations, and the compensation policy. Let each team own their own schedule and their own alerts within that frame. Consistency where it matters, autonomy where it doesn't. The part most people skip is governance through metrics instead of vibes. I'd track on-call health across teams: - Pages per shift, and specifically off-hours pages per shift. This is the burnout signal. - Percent of pages that were actionable versus auto-resolved. - Time to acknowledge, plus MTTA and MTTR. - Alerts with no runbook attached. - On-call satisfaction from a short recurring survey. Then set thresholds that trigger a review. For example, any team over a set number of off-hours pages per week gets a dedicated session to fix their alerting, treated as real work with time allocated, not a nag. On-call health becomes something with an owner and a dashboard, like a product. The edge cases at this scale: - Teams too small to staff 24/7 alone. Pool a few related teams into a shared rotation, or use follow-the-sun across regions. Don't pretend a four-person team can sustainably cover nights forever. - Shared infrastructure. The platform team owns the platform, but you need crisp boundaries so they don't become the dumping ground for every alert that's vaguely infra-shaped. Alerts route to the service owner first. - Culture. None of the metrics work without blameless postmortems. If reporting a noisy alert or a bad night gets you blamed, people hide it and the data goes dark. The failure mode I'd watch hardest: a central team quietly staying responsible for everyone's reliability. That breaks the feedback loop where the team that can fix the noise is the team that feels it.
This is a staff or senior-plus question about organizational design, not tooling. The single idea that has to be present is service ownership: you build it, you run it, with alerts routed to the owning team. From there, strong candidates separate what to standardize (severity, incident process, tooling, comp) from what to leave to teams (schedules, their own alerts), and they govern with on-call health metrics rather than just MTTR. Listen for off-hours page rate as the burnout signal, the small-team pooling problem, and the platform-team-as-dumping-ground trap. Tying it back to blameless culture shows they understand why the metrics survive contact with reality.
Route alerts to the owning team, not a central pager (Alertmanager)
On-call health SLIs you can actually put on a dashboard
- Keeping a central ops team responsible for everyone's services. The people getting paged can't fix the root cause, so the noise is permanent.
- Measuring only MTTR and ignoring on-call load. A team can have great MTTR while quietly burning out from page volume.
- Mandating tooling without a migration path or a golden-path setup, so each team reinvents alerting badly.
- Treating on-call quality as each team's private business with no org-wide visibility. Without shared metrics, the worst rotations stay invisible until people leave.
- How do you stop the central platform team from becoming the dumping ground for every alert?
- Which metrics would warn you that on-call is unhealthy before people start quitting?
- How do you handle a team that's too small to staff 24/7 coverage on its own?
More Incident Management interview questions
Also worth your time on this topic
How to Build an Effective On-Call Rotation and Escalation Policy
Your phone buzzed at 3:14 AM for a disk warning that auto-resolved by 3:16. Nobody fixes the alert. The next person on rotation hates their life. Here is how to build on-call schedules, escalation policies, and alert rules that respect your engineers.
How to Build an Effective On-Call Rotation and Escalation Policy
A practical checklist for designing on-call schedules, defining escalation paths, and cutting alert fatigue so your team can sleep at night and still respond fast when things break.
60-120 minutes
On-Call Rotation and Escalation Basics
You're about to go on-call for the first time. In your own words, what is an on-call rotation, and why do teams bother setting up a formal escalation policy instead of just pinging whoever happens to be online when something breaks?
junior