Skip to main content

Running Your First Pod-Delete Experiment Safely

I hand you a fresh cluster with a demo nginx deployment. Take me from nothing to a controlled pod-delete experiment. What are the steps, and how do you keep it from turning into an outage?

mid
intermediate
Chaos Engineering
Question

I hand you a fresh cluster with a demo nginx deployment. Take me from nothing to a controlled pod-delete experiment. What are the steps, and how do you keep it from turning into an outage?

Answer

Six steps. First, install Litmus, which gives you the chaos-operator and the CRDs (helm chart or the operator manifest is fine). Second, install the pod-delete ChaosExperiment from the ChaosHub into the target namespace. Third, set up RBAC: a dedicated ServiceAccount with a Role and RoleBinding scoped to that namespace, granting only what pod-delete needs. Do not reach for the bundled litmus-admin here. Fourth, create a ChaosEngine pointing at the target with appns=default, applabel=app=nginx, appkind=deployment, and the service account from step three. Fifth, keep the blast radius small in that ChaosEngine: PODS_AFFECTED_PERC set to 50 or even target a single pod, a short TOTAL_CHAOS_DURATION like 30 seconds, and FORCE=false so you mimic a graceful eviction instead of a hard kill. Sixth, watch it: tail the runner and experiment pod logs, watch the ChaosResult verdict, and keep an eye on the app's availability the whole time. The safety rules that matter: start in staging, target a Deployment so the ReplicaSet actually reschedules the pod (pod-delete against a bare pod just leaves you down), kill one replica at a time, have a probe or at least a live dashboard so you observe steady state rather than assume it, and know the abort before you start: kubectl delete chaosengine, or set engineState to stop.

Why This Matters

This is the practical 'have you done this with your own hands' question. Listen for ordered steps and especially the RBAC step, which beginners skip and then spend an hour confused about why the runner gets 'forbidden' listing pods. Good candidates mention blast-radius tunables, the FORCE semantics, targeting a controller-backed pod, and the abort path. Someone who says 'apply the ChaosEngine and watch it' without the service account or blast radius has not run this on anything they cared about.

Code Examples

Install Litmus and the pod-delete experiment, then watch the run

bash

ChaosEngine with a small, graceful blast radius

yaml

The abort switch, before you start

bash
Common Mistakes
  • Skipping the RBAC service account, hitting 'forbidden' errors from the runner, then papering over it with cluster-admin
  • Leaving FORCE at a value that hard-kills pods, so the test never exercises graceful shutdown the way a real eviction would
  • Running against a bare pod with no controller, so nothing reschedules and the 'experiment' is just an outage
Follow-up Questions
Interviewers often ask these as follow-up questions
  • What does FORCE=true actually change, and when does that make your experiment unrealistic or dangerous?
  • What happens if you run pod-delete against a standalone pod with no Deployment or ReplicaSet behind it?
  • How would you target one specific pod by name instead of a percentage of the deployment?
  • Where exactly do you read whether the run passed, and what does Awaited mean?
Tags
chaos-engineering
litmus
kubernetes
resilience
Sponsored
Carbon Ads

More Chaos Engineering interview questions

Also worth your time on this topic