From One Experiment to Continuous Chaos at Scale
You have proven a single pod-delete works. Leadership now wants chaos as an ongoing practice across dozens of services and several clusters, not a one-off demo. What changes, and what would you do differently at scale?
You have proven a single pod-delete works. Leadership now wants chaos as an ongoing practice across dozens of services and several clusters, not a one-off demo. What changes, and what would you do differently at scale?
A single hand-applied ChaosEngine does not become a program by running it more often. Four things change. First, you chain faults into scenarios. Litmus drives Argo-workflow-based chaos experiments that sequence multiple faults plus probes into one run, for example kill a pod, then add network latency, while a probe asserts the SLO holds the whole time. ChaosCenter becomes the control plane, and in Litmus 3.x you install chaos infrastructure (the delegate agents) into each target cluster and drive runs centrally. Second, you put the definitions under GitOps. Store the ChaosEngine, workflow, and probe YAML in a repo and deploy it with Argo CD, so chaos is reviewed, versioned, and reproducible instead of kubectl-applied from someone's laptop. Third, you schedule and gate. Run experiments on a cron with a ChaosSchedule for steady drift detection, and wire them into your delivery pipeline as a resilience gate where a dropped probeSuccessPercentage blocks promotion. Reserve structured GameDays for the big multi-team scenarios. Fourth, you observe. Export chaos metrics to Prometheus, overlay chaos events on your golden-signal Grafana dashboards, and make a promProbe the auto-halt so a burning SLO ends the run and reverts. What I would do differently at scale is mostly organizational: enforce blast-radius and RBAC policy through admission control so nobody can run with litmus-admin, give each team a namespaced scope, track a resilience score per service over time so you can show leadership the curve, and treat every chaos finding like an incident with an owner and a fix rather than a dashboard nobody reads. The goal is regressions caught automatically, not heroics.
This is a senior systems-thinking question. The jump you are listening for is from a custom resource to a practice: workflows that chain faults, GitOps for versioning and review, scheduling, CD gating, observability with auto-halt, and the org mechanics that keep it alive. The 'what would you do differently at scale' part should surface policy enforcement and treating findings as owned work. A weak answer just says 'run it more and put it on a schedule' with no governance, no observability, and no story for who acts on the results.
Recurring chaos with a ChaosSchedule, fenced to work hours and weekdays
A Prometheus probe used as an auto-halt SLO gate
- Treating chaos as a one-off demo instead of a versioned, scheduled, and owned practice
- Scheduling chaos with no freeze or incident awareness, so it can fire in the middle of a real outage
- Collecting chaos results that nobody acts on, with no resilience score and no owner for findings
- How do you keep scheduled chaos from firing during a real incident or a change freeze?
- In a shared cluster, how do you prevent one team's chaos from spilling into another team's services?
- How do you actually measure whether the chaos program is improving reliability, rather than just generating activity?
- Build versus adopt: how would you choose between Litmus, Chaos Mesh, and a managed offering for a program this size?
More Chaos Engineering interview questions
Also worth your time on this topic
Running Your First Chaos Engineering Experiment with Litmus
How to install Litmus on Kubernetes and run a controlled failure experiment from a written hypothesis to a verdict you can act on, without breaking production by accident.
90-150 minutes
Litmus Building Blocks: ChaosEngine vs ChaosExperiment
You install Litmus on a cluster and want to kill a pod to see what happens. Walk me through the pieces Litmus gives you, and what is the actual difference between a ChaosExperiment and a ChaosEngine?
junior
Running Your First Chaos Engineering Experiment with Litmus
A hands-on walkthrough of installing LitmusChaos on Kubernetes, killing pods on purpose, and watching whether your app actually recovers. Real YAML, real output, no theory.