Error Budget Burn Investigation

It's Monday morning. You check the dashboard and see that your service burned 80% of its monthly error budget over the weekend. Walk me through how you'd investigate this and what you'd do next.

Back to senior questions

senior

advanced

SRE

Question

It's Monday morning. You check the dashboard and see that your service burned 80% of its monthly error budget over the weekend. Walk me through how you'd investigate this and what you'd do next.

Answer

First, I need to understand the shape of the burn. Was it one big incident or a slow leak? I'd start by pulling up the error budget burn-down chart for the weekend. If I see a cliff -- a steep drop at a specific time -- that's an incident. I'd look at the deployment history: did someone ship something Friday afternoon? Check the change log, git history, and any config changes. Most budget burns trace back to a change. If the chart shows a steady downward slope, that's a slow degradation. This is harder to track down. I'd look at the error breakdown by endpoint, by error code, and by dependency. Often it's a single endpoint or a downstream service that started flaking. Maybe a database connection pool is saturating under weekend batch job load, or a third-party API started rate-limiting you. Once I know the cause, here's what I'd do: 1. Fix the immediate problem. Roll back the bad deploy, scale up the bottleneck, or disable the flaky feature behind a feature flag. 2. Trigger the error budget policy. With only 20% budget left, we're in the critical zone. That means feature freeze -- no new deployments except reliability fixes. I'd communicate this to the product team immediately, not as a request but as the policy kicking in. 3. Run a blameless postmortem. Focus on why the burn wasn't caught sooner. Were the burn rate alerts misconfigured? Was nobody watching the on-call channel? Did the slow-burn alert fire but get ignored? 4. Build in prevention. If a Friday deploy caused this, maybe we need a policy of no deploys after Thursday, or better canary analysis. If a dependency caused it, we need circuit breakers or fallbacks. The bigger question is why 80% of the budget burned before anyone noticed. That's a monitoring and alerting gap, and it's usually more important to fix than the original incident.

Why This Matters

This is a scenario-based senior question that tests incident response thinking, not just SLO knowledge. You want to see structured investigation (not random guessing), awareness of the error budget policy, and a focus on systemic improvements. The best candidates will immediately ask about the burn shape (cliff vs. slope) because the investigation path is completely different. Watch for whether they mention the process gap (why wasn't this caught?) -- that's what separates senior from mid-level thinking.

Code Examples

Investigation queries to diagnose budget burn

promql

Quick weekend incident timeline from deploy history

bash

Postmortem template for budget burn events

yaml

Common Mistakes

Jumping straight to 'fix the bug' without first understanding the full scope. Is it one endpoint or the whole service? Is it still happening or did it stop? You need the full picture before acting.
Skipping the process failure analysis. The budget burned over a weekend and nobody noticed until Monday -- that's a bigger problem than whatever caused the errors. Always ask why the alerting and response process failed.
Treating the error budget policy as optional. If the policy says feature freeze at 20% budget remaining, enforce it. Making exceptions undermines the entire system and trains teams to ignore SLOs.

Follow-up Questions

Interviewers often ask these as follow-up questions

How would your response differ if the burn was caused by a third-party dependency you don't control?
The product team has a major launch next week and is pushing back on the feature freeze. How do you handle that conversation?
Should you adjust the SLO target after repeated budget burns, or is the problem always in the system?

Also worth your time on this topic

Checklist

SLOs, SLIs, and Error Budgets: A Practical Implementation Guide

A step-by-step checklist for defining service level objectives, picking the right service level indicators, and using error budgets to make better decisions about reliability vs. feature velocity.

45-90 minutes

Interview

Error Budget Management

Your service has a 99.9% availability SLO over a 30-day window. How much downtime does that give you, and what do you actually do with that error budget day-to-day?

mid

Article

SLOs, SLIs, and Error Budgets: A Practical Implementation Guide

Your service went down at 2 AM and nobody could agree on whether it was "bad enough" to page someone. SLOs, SLIs, and error budgets fix that. Here is how to define, measure, and act on them with real Prometheus queries and alerting rules.

Error Budget Burn Investigation

More SRE interview questions

Also worth your time on this topic

SLOs, SLIs, and Error Budgets: A Practical Implementation Guide

Error Budget Management

SLOs, SLIs, and Error Budgets: A Practical Implementation Guide