Error Budget Management
Your service has a 99.9% availability SLO over a 30-day window. How much downtime does that give you, and what do you actually do with that error budget day-to-day?
Your service has a 99.9% availability SLO over a 30-day window. How much downtime does that give you, and what do you actually do with that error budget day-to-day?
A 99.9% SLO over 30 days gives you an error budget of 0.1%, which is about 43.2 minutes of total downtime (or roughly 4,320 failed requests out of every 4.32 million). The error budget is not just a number to watch -- it's a decision-making tool. Here's how I'd use it: When the budget is healthy (say, 70%+ remaining), the team moves fast. Ship features, run experiments, do risky deployments. You have room to absorb failures. This is the whole point -- error budgets turn the reliability vs. velocity argument into a data-driven decision. When the budget is getting low (under 30% remaining with time left in the window), slow down. Require extra review on deployments, hold off on risky changes, and focus on reliability work like adding retry logic or improving rollback speed. When the budget is burned (0% remaining), stop feature deployments entirely. The team shifts to reliability work only: fixing the issues that burned the budget, improving monitoring, adding automated rollbacks. This is the error budget policy, and it needs to be agreed on before you need it, not during an incident. The key insight is that error budgets align incentives. Product teams want to ship fast, SRE teams want reliability. The error budget gives both sides a shared framework: "We can ship this risky feature because we have budget" or "We need to pause and fix things because we're out of budget."
This question separates candidates who understand error budgets as a cultural and operational tool from those who just know the math. The calculation is table stakes -- what matters is whether they can explain the error budget policy and how it changes team behavior. Strong candidates will talk about the policy being pre-agreed and about using budgets to balance velocity and reliability.
Error budget calculation and burn rate
Error budget policy document (team agreement)
- Treating the error budget as just a monitoring metric instead of a decision-making tool with real consequences (like deployment freezes).
- Not having a pre-agreed error budget policy. If you decide what to do after the budget is burned, the conversation becomes political instead of data-driven.
- Confusing burn rate with budget remaining. A high burn rate with 90% budget left is more urgent than a low burn rate with 20% left -- you need to react to the rate of consumption, not just the current level.
- What's the difference between a time-based and request-based error budget, and when would you pick one over the other?
- How do you handle a situation where a single incident burns 80% of the error budget? Is that different from many small incidents adding up to 80%?
- What do you do when the product team pushes back on a feature freeze because of a burned error budget?
- How do rolling windows vs. calendar windows affect error budget behavior?
More SRE interview questions
Also worth your time on this topic
SLOs, SLIs, and Error Budgets: A Practical Implementation Guide
A step-by-step checklist for defining service level objectives, picking the right service level indicators, and using error budgets to make better decisions about reliability vs. feature velocity.
45-90 minutes
SLO vs SLI vs SLA Differences
Your team just launched a new API service. Your manager asks you to set up SLOs for it. Can you walk me through what SLOs, SLIs, and SLAs are, and how they relate to each other?
junior
SLOs, SLIs, and Error Budgets: A Practical Implementation Guide
Your service went down at 2 AM and nobody could agree on whether it was "bad enough" to page someone. SLOs, SLIs, and error budgets fix that. Here is how to define, measure, and act on them with real Prometheus queries and alerting rules.