You have proven a single pod-delete works. Leadership now wants chaos as an ongoing practice across dozens of services and several clusters, not a one-off demo. What changes, and what would you do differently at scale?

Question

Accepted Answer

A single hand-applied ChaosEngine does not become a program by running it more often. Four things change. First, you chain faults into scenarios. Litmus drives Argo-workflow-based chaos experiments that sequence multiple faults plus probes into one run, for example kill a pod, then add network latency, while a probe asserts the SLO holds the whole time. ChaosCenter becomes the control plane, and in Litmus 3.x you install chaos infrastructure (the delegate agents) into each target cluster and drive runs centrally. Second, you put the definitions under GitOps. Store the ChaosEngine, workflow, and probe YAML in a repo and deploy it with Argo CD, so chaos is reviewed, versioned, and reproducible instead of kubectl-applied from someone's laptop. Third, you schedule and gate. Run experiments on a cron with a ChaosSchedule for steady drift detection, and wire them into your delivery pipeline as a resilience gate where a dropped probeSuccessPercentage blocks promotion. Reserve structured GameDays for the big multi-team scenarios. Fourth, you observe. Export chaos metrics to Prometheus, overlay chaos events on your golden-signal Grafana dashboards, and make a promProbe the auto-halt so a burning SLO ends the run and reverts. What I would do differently at scale is mostly organizational: enforce blast-radius and RBAC policy through admission control so nobody can run with litmus-admin, give each team a namespaced scope, track a resilience score per service over time so you can show leadership the curve, and treat every chaos finding like an incident with an owner and a fix rather than a dashboard nobody reads. The goal is regressions caught automatically, not heroics.

From One Experiment to Continuous Chaos at Scale

Sample answer

Why this matters

Code examples

Common mistakes to avoid

Likely follow-ups

More Chaos Engineering interview questions

Also worth your time on this topic

Running Your First Chaos Engineering Experiment with Litmus

Litmus Building Blocks: ChaosEngine vs ChaosExperiment

Running Your First Chaos Engineering Experiment with Litmus