Skip to main content
senior
advanced
CI/CD

Progressive Delivery Rollback Strategy

Question

Your team just enabled a new feature for 20% of users via feature flags, and your monitoring shows a 3x increase in p99 latency for those users. Walk me through exactly what you'd do in the next 10 minutes.

Answer

First, kill the flag immediately. Don't investigate first, don't wait for more data. Flip the flag to 0% rollout. This should take under 30 seconds if your flag system supports instant updates. With the flag off, confirm that p99 latency drops back to baseline. If it does, the bleeding has stopped and you can investigate calmly. If latency stays high even after disabling the flag, you have a bigger problem - the feature may have caused a state change (corrupted cache, filled a queue, triggered a slow database migration). In that case, check downstream systems: is a queue backing up, did the feature write bad data, is a connection pool exhausted? Next, send a quick incident notification to the team with what happened and that the flag is off. Then investigate the root cause. Pull up traces from the canary cohort. Compare slow requests against the baseline. Often the problem is an N+1 query, a missing index for a new query pattern, or an external API call that works fine at 1% but chokes at 20% traffic. Before re-enabling, fix the issue and test it. Don't just bump to 20% again. Restart at 2% and step up more slowly this time. Add the specific metric that caught this problem to your canary analysis automation so it triggers an automatic rollback next time.

Why This Matters

This incident-response question tests composure and prioritization under pressure. The strongest signal is whether the candidate's first instinct is to stop the bleeding (disable the flag) before investigating. Weak candidates will say 'I'd look at the logs first' or try to debug while users are still affected. Also listen for awareness that disabling a flag doesn't always fix the damage if the feature caused side effects.

Code Examples

Emergency flag kill and verification

bash

Automated rollback rule in Argo Rollouts

yaml

Post-incident: automated flag rollback on metric breach

python
Common Mistakes
  • Trying to debug the issue before disabling the flag, leaving users impacted while investigating
  • Assuming that disabling the flag fully reverts the system, ignoring state changes like database writes or queue messages
  • Re-enabling the flag at the same percentage after fixing the bug instead of starting the rollout over from a small percentage
Follow-up Questions
Interviewers often ask these as follow-up questions
  • What if the feature flag service itself is down and you can't disable the flag remotely?
  • How would you handle a rollback if the feature wrote data in a new format that the old code can't read?
  • At what rollout percentage would you trust automated rollback versus requiring a human decision?
Tags
progressive-delivery
feature-flags
incident-response
rollback
ci-cd