Running Your First Pod-Delete Experiment Safely
I hand you a fresh cluster with a demo nginx deployment. Take me from nothing to a controlled pod-delete experiment. What are the steps, and how do you keep it from turning into an outage?
I hand you a fresh cluster with a demo nginx deployment. Take me from nothing to a controlled pod-delete experiment. What are the steps, and how do you keep it from turning into an outage?
Six steps. First, install Litmus, which gives you the chaos-operator and the CRDs (helm chart or the operator manifest is fine). Second, install the pod-delete ChaosExperiment from the ChaosHub into the target namespace. Third, set up RBAC: a dedicated ServiceAccount with a Role and RoleBinding scoped to that namespace, granting only what pod-delete needs. Do not reach for the bundled litmus-admin here. Fourth, create a ChaosEngine pointing at the target with appns=default, applabel=app=nginx, appkind=deployment, and the service account from step three. Fifth, keep the blast radius small in that ChaosEngine: PODS_AFFECTED_PERC set to 50 or even target a single pod, a short TOTAL_CHAOS_DURATION like 30 seconds, and FORCE=false so you mimic a graceful eviction instead of a hard kill. Sixth, watch it: tail the runner and experiment pod logs, watch the ChaosResult verdict, and keep an eye on the app's availability the whole time. The safety rules that matter: start in staging, target a Deployment so the ReplicaSet actually reschedules the pod (pod-delete against a bare pod just leaves you down), kill one replica at a time, have a probe or at least a live dashboard so you observe steady state rather than assume it, and know the abort before you start: kubectl delete chaosengine, or set engineState to stop.
This is the practical 'have you done this with your own hands' question. Listen for ordered steps and especially the RBAC step, which beginners skip and then spend an hour confused about why the runner gets 'forbidden' listing pods. Good candidates mention blast-radius tunables, the FORCE semantics, targeting a controller-backed pod, and the abort path. Someone who says 'apply the ChaosEngine and watch it' without the service account or blast radius has not run this on anything they cared about.
Install Litmus and the pod-delete experiment, then watch the run
ChaosEngine with a small, graceful blast radius
The abort switch, before you start
- Skipping the RBAC service account, hitting 'forbidden' errors from the runner, then papering over it with cluster-admin
- Leaving FORCE at a value that hard-kills pods, so the test never exercises graceful shutdown the way a real eviction would
- Running against a bare pod with no controller, so nothing reschedules and the 'experiment' is just an outage
- What does FORCE=true actually change, and when does that make your experiment unrealistic or dangerous?
- What happens if you run pod-delete against a standalone pod with no Deployment or ReplicaSet behind it?
- How would you target one specific pod by name instead of a percentage of the deployment?
- Where exactly do you read whether the run passed, and what does Awaited mean?
More Chaos Engineering interview questions
Also worth your time on this topic
Running Your First Chaos Engineering Experiment with Litmus
How to install Litmus on Kubernetes and run a controlled failure experiment from a written hypothesis to a verdict you can act on, without breaking production by accident.
90-150 minutes
Litmus Building Blocks: ChaosEngine vs ChaosExperiment
You install Litmus on a cluster and want to kill a pod to see what happens. Walk me through the pieces Litmus gives you, and what is the actual difference between a ChaosExperiment and a ChaosEngine?
junior
Running Your First Chaos Engineering Experiment with Litmus
A hands-on walkthrough of installing LitmusChaos on Kubernetes, killing pods on purpose, and watching whether your app actually recovers. Real YAML, real output, no theory.