Skip to main content

Choosing the Right SLIs

You're joining a team that runs a checkout service for an e-commerce platform. There are no SLOs yet. How would you decide which SLIs to track?

mid
intermediate
SRE
Question

You're joining a team that runs a checkout service for an e-commerce platform. There are no SLOs yet. How would you decide which SLIs to track?

Answer

Start with what your users care about. For a checkout service, users care about three things: can I complete my purchase (availability), how long does it take (latency), and does it charge me the right amount (correctness). I'd pick SLIs from the four golden signals: latency, error rate, throughput, and saturation. But not all of them become SLOs. For a checkout service, I'd focus on: 1. Availability -- the ratio of successful checkout requests (non-5xx) to total requests. This is your most important SLI because a broken checkout directly costs money. 2. Latency -- the 95th percentile response time for checkout completion. Not the average, because averages hide tail latency. A p95 of 2 seconds means 5% of your users are waiting longer than that, and those are often paying customers mid-purchase. 3. Correctness -- the percentage of checkout transactions where the charged amount matches the cart total. This is easy to overlook but a wrong charge is worse than a slow page. I'd skip throughput as an SLO target because it varies naturally with traffic patterns. It's good to monitor but bad as an objective. I'd also look at the SLIs from the user's perspective, not from the server's. Measure at the load balancer or edge, not inside your application, because that's closer to what the user actually experiences.

Why This Matters

This question tests practical thinking. You want to see if the candidate can connect SLIs to business impact rather than just listing textbook metrics. Strong candidates will reason from user experience backward to metrics. They'll mention why certain metrics matter more for this specific service type. Watch for candidates who just list generic metrics without tying them to the checkout use case.

Code Examples

Prometheus queries for checkout SLIs

promql

Grafana dashboard panel for SLI tracking

yaml
Common Mistakes
  • Using averages instead of percentiles for latency. An average of 200ms can hide a p99 of 10 seconds, which means some users are having a terrible experience.
  • Picking infrastructure metrics (CPU, memory, disk) as SLIs instead of user-facing metrics. High CPU doesn't necessarily mean users are affected.
  • Setting the same SLIs for every service regardless of what the service does. A batch job and a real-time API need very different indicators.
Follow-up Questions
Interviewers often ask these as follow-up questions
  • Would you pick different SLIs for an internal batch processing service vs. this user-facing checkout service?
  • How do you handle SLIs when the service depends on third-party payment providers?
  • Where would you instrument the measurement -- at the load balancer, in the application code, or using synthetic probes?
  • How many SLOs should a single service have? What's the risk of having too many?
Tags
sre
slis
slos
monitoring
observability
Sponsored
Carbon Ads

More SRE interview questions

Also worth your time on this topic