Choosing the Right SLIs
You're joining a team that runs a checkout service for an e-commerce platform. There are no SLOs yet. How would you decide which SLIs to track?
You're joining a team that runs a checkout service for an e-commerce platform. There are no SLOs yet. How would you decide which SLIs to track?
Start with what your users care about. For a checkout service, users care about three things: can I complete my purchase (availability), how long does it take (latency), and does it charge me the right amount (correctness). I'd pick SLIs from the four golden signals: latency, error rate, throughput, and saturation. But not all of them become SLOs. For a checkout service, I'd focus on: 1. Availability -- the ratio of successful checkout requests (non-5xx) to total requests. This is your most important SLI because a broken checkout directly costs money. 2. Latency -- the 95th percentile response time for checkout completion. Not the average, because averages hide tail latency. A p95 of 2 seconds means 5% of your users are waiting longer than that, and those are often paying customers mid-purchase. 3. Correctness -- the percentage of checkout transactions where the charged amount matches the cart total. This is easy to overlook but a wrong charge is worse than a slow page. I'd skip throughput as an SLO target because it varies naturally with traffic patterns. It's good to monitor but bad as an objective. I'd also look at the SLIs from the user's perspective, not from the server's. Measure at the load balancer or edge, not inside your application, because that's closer to what the user actually experiences.
This question tests practical thinking. You want to see if the candidate can connect SLIs to business impact rather than just listing textbook metrics. Strong candidates will reason from user experience backward to metrics. They'll mention why certain metrics matter more for this specific service type. Watch for candidates who just list generic metrics without tying them to the checkout use case.
Prometheus queries for checkout SLIs
Grafana dashboard panel for SLI tracking
- Using averages instead of percentiles for latency. An average of 200ms can hide a p99 of 10 seconds, which means some users are having a terrible experience.
- Picking infrastructure metrics (CPU, memory, disk) as SLIs instead of user-facing metrics. High CPU doesn't necessarily mean users are affected.
- Setting the same SLIs for every service regardless of what the service does. A batch job and a real-time API need very different indicators.
- Would you pick different SLIs for an internal batch processing service vs. this user-facing checkout service?
- How do you handle SLIs when the service depends on third-party payment providers?
- Where would you instrument the measurement -- at the load balancer, in the application code, or using synthetic probes?
- How many SLOs should a single service have? What's the risk of having too many?
More SRE interview questions
Also worth your time on this topic
SLOs, SLIs, and Error Budgets: A Practical Implementation Guide
Your service went down at 2 AM and nobody could agree on whether it was "bad enough" to page someone. SLOs, SLIs, and error budgets fix that. Here is how to define, measure, and act on them with real Prometheus queries and alerting rules.
SLOs, SLIs, and Error Budgets: A Practical Implementation Guide
A step-by-step checklist for defining service level objectives, picking the right service level indicators, and using error budgets to make better decisions about reliability vs. feature velocity.
45-90 minutes
SLO vs SLI vs SLA Differences
Your team just launched a new API service. Your manager asks you to set up SLOs for it. Can you walk me through what SLOs, SLIs, and SLAs are, and how they relate to each other?
junior