Progressive Delivery Rollback Strategy
Your team just enabled a new feature for 20% of users via feature flags, and your monitoring shows a 3x increase in p99 latency for those users. Walk me through exactly what you'd do in the next 10 minutes.
First, kill the flag immediately. Don't investigate first, don't wait for more data. Flip the flag to 0% rollout. This should take under 30 seconds if your flag system supports instant updates. With the flag off, confirm that p99 latency drops back to baseline. If it does, the bleeding has stopped and you can investigate calmly. If latency stays high even after disabling the flag, you have a bigger problem - the feature may have caused a state change (corrupted cache, filled a queue, triggered a slow database migration). In that case, check downstream systems: is a queue backing up, did the feature write bad data, is a connection pool exhausted? Next, send a quick incident notification to the team with what happened and that the flag is off. Then investigate the root cause. Pull up traces from the canary cohort. Compare slow requests against the baseline. Often the problem is an N+1 query, a missing index for a new query pattern, or an external API call that works fine at 1% but chokes at 20% traffic. Before re-enabling, fix the issue and test it. Don't just bump to 20% again. Restart at 2% and step up more slowly this time. Add the specific metric that caught this problem to your canary analysis automation so it triggers an automatic rollback next time.
This incident-response question tests composure and prioritization under pressure. The strongest signal is whether the candidate's first instinct is to stop the bleeding (disable the flag) before investigating. Weak candidates will say 'I'd look at the logs first' or try to debug while users are still affected. Also listen for awareness that disabling a flag doesn't always fix the damage if the feature caused side effects.
Emergency flag kill and verification
Automated rollback rule in Argo Rollouts
Post-incident: automated flag rollback on metric breach
- Trying to debug the issue before disabling the flag, leaving users impacted while investigating
- Assuming that disabling the flag fully reverts the system, ignoring state changes like database writes or queue messages
- Re-enabling the flag at the same percentage after fixing the bug instead of starting the rollout over from a small percentage
- What if the feature flag service itself is down and you can't disable the flag remotely?
- How would you handle a rollback if the feature wrote data in a new format that the old code can't read?
- At what rollout percentage would you trust automated rollback versus requiring a human decision?