Canary Releases in Progressive Delivery
You're deploying a new version of a critical payment service. Walk me through how you'd set up a canary release for it.
First, I'd deploy the new version alongside the current one but route only a small slice of traffic to it, say 2-5%. The canary gets real production traffic while the stable version handles the rest. I'd set up automated analysis on key metrics: error rate, p99 latency, and business metrics like payment success rate. These metrics would be compared between the canary and the stable version. If the canary's error rate is more than 0.5% higher or its p99 latency is more than 50ms worse, the rollout automatically stops and traffic shifts back to the stable version. If metrics look healthy after 15-30 minutes, I'd step up to 10%, then 25%, then 50%, and finally 100%, with the same automated checks at each step. For a payment service specifically, I'd also watch for downstream effects like increased charge disputes or webhook delivery failures. The whole process might take 2-4 hours for a low-risk change or a full day for something significant. Tools like Flagger or Argo Rollouts can automate the traffic shifting and metric analysis so you're not manually watching dashboards.
This scenario-based question forces the candidate to think through a real deployment instead of reciting definitions. Strong answers include specific metrics, thresholds, and time windows. Listen for whether they mention automated rollback, business-level metrics beyond just error rates, and how they'd size the initial canary percentage based on the service's criticality.
Argo Rollouts canary strategy with automated analysis
Prometheus-based canary analysis template
- Setting the initial canary percentage too high for a critical service, defeating the purpose of gradual rollout
- Only checking error rates without looking at latency, business metrics, or downstream service health
- Not waiting long enough at each step to collect statistically significant data
- How would you handle a canary release for a service that processes async jobs instead of HTTP requests?
- What's the minimum amount of traffic you need through the canary before the metrics are statistically meaningful?
- How do you do canary releases when the change involves a database schema migration?