Scoping Litmus Safely: RBAC and Blast Radius
Your security team sees that Litmus can delete pods and inject network faults cluster-wide, and they want it gone. How do you scope Litmus so you can still run chaos in production without handing it the keys to the cluster?
Your security team sees that Litmus can delete pods and inject network faults cluster-wide, and they want it gone. How do you scope Litmus so you can still run chaos in production without handing it the keys to the cluster?
There are two levers, RBAC and blast radius, and you pull both. On RBAC: Litmus ships litmus-admin, a cluster-wide service account that is fine for a sandbox and wrong for production. Each ChaosExperiment already declares exactly the verbs and resources its fault needs, so build a least-privilege ServiceAccount per experiment or per team namespace using a Role and RoleBinding instead of a ClusterRole wherever the fault allows it. Run Litmus in namespaced scope so a blast cannot cross tenant boundaries. On blast radius: the ChaosEngine appns and applabel narrow the target set, PODS_AFFECTED_PERC and KILL_COUNT cap how many pods go at once, and for node or infra faults NODE_LABEL fences which nodes can be touched. Sequence multiple faults serially rather than in parallel so you do not compound failures you cannot reason about. Then layer operational safety on top: run chaos in its own namespace, set resource quotas, attach probes in Continuous mode that fail fast and abort the run when an SLO breaks, and rehearse the whole thing in staging first. And know the kill switch cold: set engineState to stop or delete the ChaosEngine, which reverts the chaos best-effort and tears down the runner. The pitch back to security is simple. A least-privilege service account, namespaced scope, a bounded blast radius, and an auto-halt probe is a smaller and known standing risk than the unknown failure modes you are already shipping to production blind.
This is a senior security and operations conversation, not a feature recall. Listen for least-privilege per experiment, namespaced versus cluster scope, the specific blast-radius env vars, serial sequencing, an auto-halt probe, and the kill switch. The framing for security matters too: a good candidate sells bounded risk, not 'trust me'. The red flag is anyone who shrugs and says give it cluster-admin or just use litmus-admin.
A least-privilege service account scoped to one namespace for pod-delete
Bounding the blast radius in the ChaosEngine
- Reaching for litmus-admin or cluster-admin in production because it is the path of least resistance
- Leaving blast-radius env vars at defaults, so PODS_AFFECTED_PERC takes out far more than intended
- Assuming chaos always reverts cleanly and never testing the abort path or the runner-dies-mid-run case
- How would you actually enforce that nobody runs chaos with litmus-admin, for example with an admission policy or OPA Gatekeeper?
- How do you isolate two teams running chaos in the same shared cluster so one cannot affect the other?
- If the chaos-runner pod dies mid-experiment, how do you make sure the injected fault still reverts?
- How do you stop a blast from cascading into dependencies you did not explicitly target?
More Chaos Engineering interview questions
Also worth your time on this topic
Running Your First Chaos Engineering Experiment with Litmus
How to install Litmus on Kubernetes and run a controlled failure experiment from a written hypothesis to a verdict you can act on, without breaking production by accident.
90-150 minutes
Litmus Building Blocks: ChaosEngine vs ChaosExperiment
You install Litmus on a cluster and want to kill a pod to see what happens. Walk me through the pieces Litmus gives you, and what is the actual difference between a ChaosExperiment and a ChaosEngine?
junior
Running Your First Chaos Engineering Experiment with Litmus
A hands-on walkthrough of installing LitmusChaos on Kubernetes, killing pods on purpose, and watching whether your app actually recovers. Real YAML, real output, no theory.