Choosing the Right SLIs

You're joining a team that runs a checkout service for an e-commerce platform. There are no SLOs yet. How would you decide which SLIs to track?

Back to mid questions

mid

intermediate

SRE

Question

You're joining a team that runs a checkout service for an e-commerce platform. There are no SLOs yet. How would you decide which SLIs to track?

Answer

Start with what your users care about. For a checkout service, users care about three things: can I complete my purchase (availability), how long does it take (latency), and does it charge me the right amount (correctness). I'd pick SLIs from the four golden signals: latency, error rate, throughput, and saturation. But not all of them become SLOs. For a checkout service, I'd focus on: 1. Availability -- the ratio of successful checkout requests (non-5xx) to total requests. This is your most important SLI because a broken checkout directly costs money. 2. Latency -- the 95th percentile response time for checkout completion. Not the average, because averages hide tail latency. A p95 of 2 seconds means 5% of your users are waiting longer than that, and those are often paying customers mid-purchase. 3. Correctness -- the percentage of checkout transactions where the charged amount matches the cart total. This is easy to overlook but a wrong charge is worse than a slow page. I'd skip throughput as an SLO target because it varies naturally with traffic patterns. It's good to monitor but bad as an objective. I'd also look at the SLIs from the user's perspective, not from the server's. Measure at the load balancer or edge, not inside your application, because that's closer to what the user actually experiences.

Why This Matters

This question tests practical thinking. You want to see if the candidate can connect SLIs to business impact rather than just listing textbook metrics. Strong candidates will reason from user experience backward to metrics. They'll mention why certain metrics matter more for this specific service type. Watch for candidates who just list generic metrics without tying them to the checkout use case.

Code Examples

Prometheus queries for checkout SLIs

promql

Grafana dashboard panel for SLI tracking

yaml

Common Mistakes

Using averages instead of percentiles for latency. An average of 200ms can hide a p99 of 10 seconds, which means some users are having a terrible experience.
Picking infrastructure metrics (CPU, memory, disk) as SLIs instead of user-facing metrics. High CPU doesn't necessarily mean users are affected.
Setting the same SLIs for every service regardless of what the service does. A batch job and a real-time API need very different indicators.

Follow-up Questions

Interviewers often ask these as follow-up questions

Would you pick different SLIs for an internal batch processing service vs. this user-facing checkout service?
How do you handle SLIs when the service depends on third-party payment providers?
Where would you instrument the measurement -- at the load balancer, in the application code, or using synthetic probes?
How many SLOs should a single service have? What's the risk of having too many?

Also worth your time on this topic

Article

SLOs, SLIs, and Error Budgets: A Practical Implementation Guide

Your service went down at 2 AM and nobody could agree on whether it was "bad enough" to page someone. SLOs, SLIs, and error budgets fix that. Here is how to define, measure, and act on them with real Prometheus queries and alerting rules.

Checklist

SLOs, SLIs, and Error Budgets: A Practical Implementation Guide

A step-by-step checklist for defining service level objectives, picking the right service level indicators, and using error budgets to make better decisions about reliability vs. feature velocity.

45-90 minutes

Interview

SLO vs SLI vs SLA Differences

Your team just launched a new API service. Your manager asks you to set up SLOs for it. Can you walk me through what SLOs, SLIs, and SLAs are, and how they relate to each other?

junior

Choosing the Right SLIs

More SRE interview questions

Also worth your time on this topic

SLOs, SLIs, and Error Budgets: A Practical Implementation Guide

SLOs, SLIs, and Error Budgets: A Practical Implementation Guide

SLO vs SLI vs SLA Differences