Feature Flag Architecture at Scale
Your team has 200+ microservices and wants to adopt feature flags across all of them. How would you design the feature flag infrastructure?
I'd use a centralized feature flag service rather than having each team roll their own. The architecture has three layers. First, the management plane: a central service where teams define flags, set targeting rules, and configure rollout percentages. This is backed by a database and has a UI for non-engineers to manage flags too. Second, the evaluation layer: lightweight SDKs embedded in each service that evaluate flags locally. The SDKs pull flag definitions from the central service on startup and receive updates via streaming (SSE or WebSockets), not by polling. Evaluations happen in-memory with no network call per flag check, so latency impact is near zero. Third, the data plane: all flag evaluations emit events to a pipeline for analytics, audit trails, and debugging. You need to know which users saw which flag values and when. For resilience, the SDKs cache the last known flag state locally. If the flag service goes down, services keep running with the cached values. You also need a bootstrapping strategy for cold starts when the flag service is unreachable. I'd ship a default config baked into each deployment as a fallback. On the operational side, flag changes should go through the same review process as code changes. A single flag flip can affect all 200 services, so you need change tracking, approval workflows, and the ability to audit who changed what and when.
This question tests system design thinking and whether the candidate has dealt with feature flags beyond a single application. Look for awareness of the latency implications of remote flag evaluation, resilience when the flag service is unavailable, and governance concerns at scale. Strong candidates will talk about caching, local evaluation, and the blast radius of flag changes.
Feature flag service deployment with caching and streaming
SDK with local evaluation and resilient caching
Audit trail query for flag change investigation
- Making a remote API call for every flag evaluation, adding latency to every request in the hot path
- No fallback strategy when the flag service is unavailable, causing all services to fail or default to wrong values
- Treating flag changes as less risky than code changes and skipping review and audit processes
- What happens during a deployment if the feature flag service is completely down?
- How would you prevent a single bad flag change from causing a cascading failure across all 200 services?
- How do you handle feature flags that need to be consistent across multiple services in a single user request?
- Would you build this in-house or use a managed service like LaunchDarkly? What's the tradeoff?