Progressive Delivery Rollback Strategy

Your team just enabled a new feature for 20% of users via feature flags, and your monitoring shows a 3x increase in p99 latency for those users. Walk me through exactly what you'd do in the next 10 minutes.

Senior questions

CI/CDadvanced

// interview question

Sample answer

First, kill the flag immediately. Don't investigate first, don't wait for more data. Flip the flag to 0% rollout. This should take under 30 seconds if your flag system supports instant updates. With the flag off, confirm that p99 latency drops back to baseline. If it does, the bleeding has stopped and you can investigate calmly. If latency stays high even after disabling the flag, you have a bigger problem - the feature may have caused a state change (corrupted cache, filled a queue, triggered a slow database migration). In that case, check downstream systems: is a queue backing up, did the feature write bad data, is a connection pool exhausted? Next, send a quick incident notification to the team with what happened and that the flag is off. Then investigate the root cause. Pull up traces from the canary cohort. Compare slow requests against the baseline. Often the problem is an N+1 query, a missing index for a new query pattern, or an external API call that works fine at 1% but chokes at 20% traffic. Before re-enabling, fix the issue and test it. Don't just bump to 20% again. Restart at 2% and step up more slowly this time. Add the specific metric that caught this problem to your canary analysis automation so it triggers an automatic rollback next time.

Why this matters

This incident-response question tests composure and prioritization under pressure. The strongest signal is whether the candidate's first instinct is to stop the bleeding (disable the flag) before investigating. Weak candidates will say 'I'd look at the logs first' or try to debug while users are still affected. Also listen for awareness that disabling a flag doesn't always fix the damage if the feature caused side effects.

Code examples

Emergency flag kill and verification

bash

# Step 1: Kill the flag immediately
curl -X PATCH "https://flags.internal/api/flags/new-checkout-flow" \
  -H "Authorization: Bearer $FLAG_TOKEN" \
  -d '{"rollout": {"percentage": 0}}'

# Step 2: Verify latency is recovering
# Check p99 over last 5 minutes vs previous hour
curl -s "http://prometheus:9090/api/v1/query" \
  --data-urlencode 'query=
    histogram_quantile(0.99,
      rate(http_request_duration_seconds_bucket{
        service="checkout"
      }[5m])
    )' | jq '.data.result[0].value[1]'

# Step 3: Check for lingering damage
# Queue depth - should be draining, not growing
curl -s "http://prometheus:9090/api/v1/query" \
  --data-urlencode 'query=
    rabbitmq_queue_messages{queue="checkout-events"}' \
  | jq '.data.result[0].value[1]'

Automated rollback rule in Argo Rollouts

yaml

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: latency-check
spec:
  metrics:
  - name: p99-latency
    interval: 1m
    count: 5
    failureLimit: 2         # 2 failures out of 5 checks = auto rollback
    provider:
      prometheus:
        address: http://prometheus:9090
        query: |
          histogram_quantile(0.99,
            rate(http_request_duration_seconds_bucket{
              service="checkout",
              version="canary"
            }[2m])
          )
    # Fail if canary p99 is more than 3x the stable version
    successCondition: "result[0] < 0.9"
    failureCondition: "result[0] >= 0.9"

Post-incident: automated flag rollback on metric breach

python

# progressive_delivery/auto_rollback.py
import requests
from prometheus_client import CollectorRegistry

def check_and_rollback(flag_key, metric_query, threshold, flag_service_url):
    """Auto-disable flag if metric exceeds threshold."""
    result = requests.get(
        "http://prometheus:9090/api/v1/query",
        params={"query": metric_query}
    ).json()

    current_value = float(result["data"]["result"][0]["value"][1])

    if current_value > threshold:
        # Kill the flag
        requests.patch(
            f"{flag_service_url}/api/flags/{flag_key}",
            json={"rollout": {"percentage": 0}},
            headers={"Authorization": f"Bearer {FLAG_TOKEN}"}
        )
        # Alert the team
        requests.post(SLACK_WEBHOOK, json={
            "text": f":rotating_light: Auto-rolled back flag `{flag_key}`. "
                    f"Metric breached threshold: {current_value:.2f} > {threshold}"
        })
        return False
    return True

Common mistakes to avoid

Trying to debug the issue before disabling the flag, leaving users impacted while investigating
Assuming that disabling the flag fully reverts the system, ignoring state changes like database writes or queue messages
Re-enabling the flag at the same percentage after fixing the bug instead of starting the rollout over from a small percentage

Likely follow-ups

What if the feature flag service itself is down and you can't disable the flag remotely?
How would you handle a rollback if the feature wrote data in a new format that the old code can't read?
At what rollout percentage would you trust automated rollback versus requiring a human decision?

Answer out loud first, then check yourself against the model answer.

Practice all Senior questions More CI/CD questions

#progressive-delivery#feature-flags#incident-response#rollback#ci-cd

More CI/CD interview questions

Also worth your time on this topic

Checklist

How to Implement Progressive Delivery with Feature Flags

A step-by-step checklist for implementing progressive delivery using feature flags, canary releases, and gradual rollouts so you can ship to production without breaking things.

45-90 minutes

Interview

Progressive Delivery Basics

What is progressive delivery and how does it differ from traditional continuous delivery?

junior

Article

How to Implement Progressive Delivery with Feature Flags

Learn how to implement progressive delivery using feature flags, canary releases, and gradual rollouts to ship changes safely in production without risking your entire user base.