Feature Flag Architecture at Scale

Your team has 200+ microservices and wants to adopt feature flags across all of them. How would you design the feature flag infrastructure?

Senior questions

CI/CDadvanced

// interview question

Your team has 200+ microservices and wants to adopt feature flags across all of them. How would you design the feature flag infrastructure?

Sample answer

I'd use a centralized feature flag service rather than having each team roll their own. The architecture has three layers. First, the management plane: a central service where teams define flags, set targeting rules, and configure rollout percentages. This is backed by a database and has a UI for non-engineers to manage flags too. Second, the evaluation layer: lightweight SDKs embedded in each service that evaluate flags locally. The SDKs pull flag definitions from the central service on startup and receive updates via streaming (SSE or WebSockets), not by polling. Evaluations happen in-memory with no network call per flag check, so latency impact is near zero. Third, the data plane: all flag evaluations emit events to a pipeline for analytics, audit trails, and debugging. You need to know which users saw which flag values and when. For resilience, the SDKs cache the last known flag state locally. If the flag service goes down, services keep running with the cached values. You also need a bootstrapping strategy for cold starts when the flag service is unreachable. I'd ship a default config baked into each deployment as a fallback. On the operational side, flag changes should go through the same review process as code changes. A single flag flip can affect all 200 services, so you need change tracking, approval workflows, and the ability to audit who changed what and when.

Why this matters

This question tests system design thinking and whether the candidate has dealt with feature flags beyond a single application. Look for awareness of the latency implications of remote flag evaluation, resilience when the flag service is unavailable, and governance concerns at scale. Strong candidates will talk about caching, local evaluation, and the blast radius of flag changes.

Code examples

Feature flag service deployment with caching and streaming

yaml

# flag-service/docker-compose.yaml
services:
  flag-service:
    image: your-org/flag-service:latest
    ports:
      - "8080:8080"     # REST API for management
      - "8081:8081"     # SSE stream for real-time updates
    environment:
      DATABASE_URL: postgres://flags-db:5432/flags
      REDIS_URL: redis://flags-cache:6379
      AUDIT_LOG_TOPIC: flag-changes
    depends_on:
      - flags-db
      - flags-cache
      - kafka

  flags-db:
    image: postgres:16
    volumes:
      - flags-data:/var/lib/postgresql/data

  flags-cache:
    image: redis:7
    # Cache for flag definitions - reduces DB load
    # SDKs also maintain their own in-memory cache

SDK with local evaluation and resilient caching

python

import json
import threading
from pathlib import Path

class FeatureFlagClient:
    def __init__(self, service_url, fallback_path="/etc/flags/defaults.json"):
        self._flags = {}
        self._lock = threading.Lock()
        self._service_url = service_url
        self._fallback_path = Path(fallback_path)

        # Try to connect; fall back to cached/default config
        try:
            self._flags = self._fetch_flags()
            self._start_streaming()
        except ConnectionError:
            self._flags = self._load_fallback()

    def is_enabled(self, flag_key, user_id=None, attributes=None):
        """Evaluate flag locally - no network call."""
        with self._lock:
            flag = self._flags.get(flag_key)
            if not flag:
                return False
            return self._evaluate(flag, user_id, attributes)

    def _evaluate(self, flag, user_id, attributes):
        if not flag["enabled"]:
            return False
        if flag["rollout"]["strategy"] == "percentage":
            bucket = hash(f"{flag['key']}:{user_id}") % 100
            return bucket < flag["rollout"]["percentage"]
        return flag.get("default", False)

    def _load_fallback(self):
        """Load baked-in defaults when flag service is down."""
        if self._fallback_path.exists():
            return json.loads(self._fallback_path.read_text())
        return {}

Audit trail query for flag change investigation

bash

# Who changed the 'new-checkout' flag in the last 24 hours?
curl -s "http://flag-service:8080/api/audit" \
  --header "Authorization: Bearer $TOKEN" \
  --data-urlencode "flag=new-checkout" \
  --data-urlencode "since=24h" | jq '.events[] | {
    time: .timestamp,
    user: .changed_by,
    field: .field_changed,
    old: .old_value,
    new: .new_value
  }'

# Output:
# {
#   "time": "2026-04-01T14:23:00Z",
#   "user": "[email protected]",
#   "field": "rollout.percentage",
#   "old": "10",
#   "new": "50"
# }

Common mistakes to avoid

Making a remote API call for every flag evaluation, adding latency to every request in the hot path
No fallback strategy when the flag service is unavailable, causing all services to fail or default to wrong values
Treating flag changes as less risky than code changes and skipping review and audit processes

Likely follow-ups

What happens during a deployment if the feature flag service is completely down?
How would you prevent a single bad flag change from causing a cascading failure across all 200 services?
How do you handle feature flags that need to be consistent across multiple services in a single user request?
Would you build this in-house or use a managed service like LaunchDarkly? What's the tradeoff?

Answer out loud first, then check yourself against the model answer.

Practice all Senior questions More CI/CD questions

#feature-flags#progressive-delivery#system-design#microservices#ci-cd

More CI/CD interview questions

Also worth your time on this topic

Checklist

How to Implement Progressive Delivery with Feature Flags

A step-by-step checklist for implementing progressive delivery using feature flags, canary releases, and gradual rollouts so you can ship to production without breaking things.

45-90 minutes

Interview

Progressive Delivery Basics

What is progressive delivery and how does it differ from traditional continuous delivery?

junior

Article

How to Implement Progressive Delivery with Feature Flags

Learn how to implement progressive delivery using feature flags, canary releases, and gradual rollouts to ship changes safely in production without risking your entire user base.