Istio Retries and Retry Amplification
How do you configure retries in Istio, and what's the danger of being too aggressive with them?
How do you configure retries in Istio, and what's the danger of being too aggressive with them?
Retries are set on the VirtualService route with `retries.attempts`, `retries.perTryTimeout`, and `retries.retryOn`. The `retryOn` field is the important one: it controls which failures get retried. Common values are `5xx`, `reset`, `connect-failure`, and `gateway-error`. Never put `retriable-4xx` on POST endpoints unless you know they're idempotent. The big danger is retry amplification in a deep call graph. If the gateway retries 3 times, and the next hop retries 3 times, and the hop after that retries 3 times, a single user request can produce 27 backend calls. When the downstream is already struggling, those retries are exactly what tips it from slow to dead. The fix is to retry only at the edge or only once at internal hops, and to set tight `perTryTimeout` so the total time is bounded. Pair retries with circuit breakers in the DestinationRule so once a host is failing, you stop hitting it instead of retrying into the fire. And make sure `attempts` plus `perTryTimeout` doesn't exceed the route's overall `timeout` — if it does, the timeout fires before retries finish and you waste budget.
Mid-level candidates often know how to write a retry block but don't think about second-order effects. The interviewer is listening for retry amplification, idempotency awareness, and the link between retries and circuit breakers. A weak answer is just citing the YAML fields. A strong answer ties retries to the bigger reliability story.
Safe retry policy for an idempotent GET
No retries on a non-idempotent POST
Check the effective retry policy on a sidecar
- Setting `attempts: 5` on every route without considering call depth
- Retrying non-idempotent verbs and creating duplicate writes
- Setting `perTryTimeout` so high that retries never actually fire before the outer timeout
- What's retry amplification and how would you cap it in a five-hop call graph?
- Why is `retryOn: 5xx` risky on a POST that creates a resource?
- How do retries interact with circuit breakers in the DestinationRule?
More Service Mesh interview questions
Also worth your time on this topic
Istio Circuit Breakers and Outlier Detection
How do you implement a circuit breaker in Istio? Explain the difference between the connection pool limits and outlier detection.
senior
Istio Traffic Management Checklist: Routing, Retries, and Circuit Breaking
How to configure traffic management policies in Istio so your services can do canary releases, retry transient failures, and shed load when a downstream service goes bad. Covers VirtualService, DestinationRule, retries, timeouts, circuit breakers, and outlier detection.
60-90 minutes
Istio Traffic Management: Routing, Retries, and Circuit Breaking
Configure weighted routing, automatic retries, and circuit breakers in Istio with copy-paste YAML examples and real kubectl output you can verify on your own cluster.