You've been asked to design the on-call program for an org that grew from one team to fifteen in a year. Right now it's a free-for-all. What does a healthy on-call program look like at that scale, and how would you measure whether it's working?

Question

Accepted Answer

The core shift is from central on-call to service ownership. The team that builds a service runs its on-call. You build it, you run it. A single central ops team can't own fifteen teams' services, because the people getting paged have no power to fix the root cause, so the noise never goes away.

So I'd route alerts to the owning team based on a service or team label, and make sure every service has a named owner. No orphan services, no central pager catching everything.

Then I'd standardize the things that should be consistent and let teams own the rest. Standardize: severity definitions, the incident process (who declares an incident, the incident commander role, comms), the paging tooling and a golden-path alerting setup, runbook expectations, and the compensation policy. Let each team own their own schedule and their own alerts within that frame. Consistency where it matters, autonomy where it doesn't.

The part most people skip is governance through metrics instead of vibes. I'd track on-call health across teams:

- Pages per shift, and specifically off-hours pages per shift. This is the burnout signal.
- Percent of pages that were actionable versus auto-resolved.
- Time to acknowledge, plus MTTA and MTTR.
- Alerts with no runbook attached.
- On-call satisfaction from a short recurring survey.

Then set thresholds that trigger a review. For example, any team over a set number of off-hours pages per week gets a dedicated session to fix their alerting, treated as real work with time allocated, not a nag. On-call health becomes something with an owner and a dashboard, like a product.

The edge cases at this scale:

- Teams too small to staff 24/7 alone. Pool a few related teams into a shared rotation, or use follow-the-sun across regions. Don't pretend a four-person team can sustainably cover nights forever.
- Shared infrastructure. The platform team owns the platform, but you need crisp boundaries so they don't become the dumping ground for every alert that's vaguely infra-shaped. Alerts route to the service owner first.
- Culture. None of the metrics work without blameless postmortems. If reporting a noisy alert or a bad night gets you blamed, people hide it and the data goes dark.

The failure mode I'd watch hardest: a central team quietly staying responsible for everyone's reliability. That breaks the feedback loop where the team that can fix the noise is the team that feels it.

Scaling an On-Call Program Across Many Teams

More Incident Management interview questions

Also worth your time on this topic

How to Build an Effective On-Call Rotation and Escalation Policy

How to Build an Effective On-Call Rotation and Escalation Policy

On-Call Rotation and Escalation Basics