Skip to main content

Istio Retries and Retry Amplification

How do you configure retries in Istio, and what's the danger of being too aggressive with them?

mid
intermediate
Service Mesh
Question

How do you configure retries in Istio, and what's the danger of being too aggressive with them?

Answer

Retries are set on the VirtualService route with `retries.attempts`, `retries.perTryTimeout`, and `retries.retryOn`. The `retryOn` field is the important one: it controls which failures get retried. Common values are `5xx`, `reset`, `connect-failure`, and `gateway-error`. Never put `retriable-4xx` on POST endpoints unless you know they're idempotent. The big danger is retry amplification in a deep call graph. If the gateway retries 3 times, and the next hop retries 3 times, and the hop after that retries 3 times, a single user request can produce 27 backend calls. When the downstream is already struggling, those retries are exactly what tips it from slow to dead. The fix is to retry only at the edge or only once at internal hops, and to set tight `perTryTimeout` so the total time is bounded. Pair retries with circuit breakers in the DestinationRule so once a host is failing, you stop hitting it instead of retrying into the fire. And make sure `attempts` plus `perTryTimeout` doesn't exceed the route's overall `timeout` — if it does, the timeout fires before retries finish and you waste budget.

Why This Matters

Mid-level candidates often know how to write a retry block but don't think about second-order effects. The interviewer is listening for retry amplification, idempotency awareness, and the link between retries and circuit breakers. A weak answer is just citing the YAML fields. A strong answer ties retries to the bigger reliability story.

Code Examples

Safe retry policy for an idempotent GET

yaml

No retries on a non-idempotent POST

yaml

Check the effective retry policy on a sidecar

bash
Common Mistakes
  • Setting `attempts: 5` on every route without considering call depth
  • Retrying non-idempotent verbs and creating duplicate writes
  • Setting `perTryTimeout` so high that retries never actually fire before the outer timeout
Follow-up Questions
Interviewers often ask these as follow-up questions
  • What's retry amplification and how would you cap it in a five-hop call graph?
  • Why is `retryOn: 5xx` risky on a POST that creates a resource?
  • How do retries interact with circuit breakers in the DestinationRule?
Tags
istio
service-mesh
traffic-management
retries
reliability
Sponsored
Carbon Ads

More Service Mesh interview questions

Also worth your time on this topic