SLO-Based Alerting and Burn Rates
Traditional alerting fires when error rate crosses a static threshold, like 'alert if errors > 1%'. What's wrong with that approach, and how would you set up SLO-based alerting instead?
Traditional alerting fires when error rate crosses a static threshold, like 'alert if errors > 1%'. What's wrong with that approach, and how would you set up SLO-based alerting instead?
Static threshold alerts have two problems. First, they fire too often on things that don't matter. A brief 2% error spike for 30 seconds won't affect your monthly SLO at all, but a static alert pages someone at 3am for it. Second, they miss slow burns. If your error rate sits at 0.5% for days, that might look fine against a 1% threshold, but it's quietly eating your 99.9% error budget. SLO-based alerting fixes both problems by alerting on burn rate -- how fast you're consuming your error budget relative to your window. The approach uses multi-window, multi-burn-rate alerts. You set up two kinds: 1. A fast-burn alert for acute incidents: "At this rate, you'll exhaust your entire 30-day error budget in 2 hours." This is a 36x burn rate (burning 36 times faster than sustainable). You check this over a short window (5 minutes) and a longer confirmation window (1 hour). This pages someone immediately. 2. A slow-burn alert for gradual degradation: "At this rate, you'll exhaust your budget in 3 days." This is a 10x burn rate, checked over a 30-minute short window and 6-hour long window. This creates a ticket, not a page. The two-window check prevents false positives. The short window catches the current condition, while the long window confirms it's not just a blip. Both must be true before the alert fires. This directly maps to Google's SRE book approach and it works well in practice because you only get paged for things that actually threaten your SLO.
This is a strong mid-to-senior question because it tests whether the candidate has actually operated SLO-based systems or just read about them. The multi-window burn rate concept trips up a lot of people. Listen for whether they understand why two windows are needed (short window alone gives false positives, long window alone alerts too late). Candidates who have done this in practice will mention specific burn rate numbers and the tradeoff between alert sensitivity and noise.
Prometheus alerting rules for multi-window burn rate
Calculate current burn rate from Prometheus
- Setting up SLO alerts but still keeping the old static threshold alerts running alongside them. This leads to alert fatigue because you get paged twice for the same incident from different systems.
- Using only a single window for burn rate alerts. A short window alone gives false positives on brief spikes. A long window alone means you don't catch fast-moving incidents until significant budget is already gone.
- Forgetting that burn rate alerts need enough request volume to be meaningful. For low-traffic services, a single failed request can show a 100x burn rate over a 5-minute window.
- Why do you need both a short window and a long window? What goes wrong if you only use one?
- How would you tune these burn rates for a service that has natural traffic spikes, like an e-commerce site during sales events?
- What's the relationship between burn rate multiplier and the time to budget exhaustion? How do you pick the right multipliers?
- How do you handle SLO alerting for services with very low traffic where statistical significance is a problem?
More SRE interview questions
Also worth your time on this topic
Choosing the Right SLIs
You're joining a team that runs a checkout service for an e-commerce platform. There are no SLOs yet. How would you decide which SLIs to track?
mid
SLOs, SLIs, and Error Budgets: A Practical Implementation Guide
Your service went down at 2 AM and nobody could agree on whether it was "bad enough" to page someone. SLOs, SLIs, and error budgets fix that. Here is how to define, measure, and act on them with real Prometheus queries and alerting rules.
SLOs, SLIs, and Error Budgets: A Practical Implementation Guide
A step-by-step checklist for defining service level objectives, picking the right service level indicators, and using error budgets to make better decisions about reliability vs. feature velocity.
45-90 minutes