Skip to main content

Scoping Litmus Safely: RBAC and Blast Radius

Your security team sees that Litmus can delete pods and inject network faults cluster-wide, and they want it gone. How do you scope Litmus so you can still run chaos in production without handing it the keys to the cluster?

senior
advanced
Chaos Engineering
Question

Your security team sees that Litmus can delete pods and inject network faults cluster-wide, and they want it gone. How do you scope Litmus so you can still run chaos in production without handing it the keys to the cluster?

Answer

There are two levers, RBAC and blast radius, and you pull both. On RBAC: Litmus ships litmus-admin, a cluster-wide service account that is fine for a sandbox and wrong for production. Each ChaosExperiment already declares exactly the verbs and resources its fault needs, so build a least-privilege ServiceAccount per experiment or per team namespace using a Role and RoleBinding instead of a ClusterRole wherever the fault allows it. Run Litmus in namespaced scope so a blast cannot cross tenant boundaries. On blast radius: the ChaosEngine appns and applabel narrow the target set, PODS_AFFECTED_PERC and KILL_COUNT cap how many pods go at once, and for node or infra faults NODE_LABEL fences which nodes can be touched. Sequence multiple faults serially rather than in parallel so you do not compound failures you cannot reason about. Then layer operational safety on top: run chaos in its own namespace, set resource quotas, attach probes in Continuous mode that fail fast and abort the run when an SLO breaks, and rehearse the whole thing in staging first. And know the kill switch cold: set engineState to stop or delete the ChaosEngine, which reverts the chaos best-effort and tears down the runner. The pitch back to security is simple. A least-privilege service account, namespaced scope, a bounded blast radius, and an auto-halt probe is a smaller and known standing risk than the unknown failure modes you are already shipping to production blind.

Why This Matters

This is a senior security and operations conversation, not a feature recall. Listen for least-privilege per experiment, namespaced versus cluster scope, the specific blast-radius env vars, serial sequencing, an auto-halt probe, and the kill switch. The framing for security matters too: a good candidate sells bounded risk, not 'trust me'. The red flag is anyone who shrugs and says give it cluster-admin or just use litmus-admin.

Code Examples

A least-privilege service account scoped to one namespace for pod-delete

yaml

Bounding the blast radius in the ChaosEngine

yaml
Common Mistakes
  • Reaching for litmus-admin or cluster-admin in production because it is the path of least resistance
  • Leaving blast-radius env vars at defaults, so PODS_AFFECTED_PERC takes out far more than intended
  • Assuming chaos always reverts cleanly and never testing the abort path or the runner-dies-mid-run case
Follow-up Questions
Interviewers often ask these as follow-up questions
  • How would you actually enforce that nobody runs chaos with litmus-admin, for example with an admission policy or OPA Gatekeeper?
  • How do you isolate two teams running chaos in the same shared cluster so one cannot affect the other?
  • If the chaos-runner pod dies mid-experiment, how do you make sure the injected fault still reverts?
  • How do you stop a blast from cascading into dependencies you did not explicitly target?
Tags
chaos-engineering
litmus
kubernetes
rbac
Sponsored
Carbon Ads

More Chaos Engineering interview questions

Also worth your time on this topic