Your pod-delete experiment shows Pass, but during the run users got 502s for about 20 seconds. How can the experiment pass while the service was actually down, and how do you fix that?

Question

Accepted Answer

It passes because, with no probe attached, the verdict mostly checks that the fault was injected and the target recovered to a healthy pod count. It never checked whether your service kept serving traffic. A bare ChaosEngine gives you green that means almost nothing. The fix is probes, which is how you encode the steady-state hypothesis and feed probeSuccessPercentage into the verdict. Litmus has four probe types: httpProbe hits an endpoint and asserts on the status or response, cmdProbe runs a command and compares output, k8sProbe asserts on the state of a Kubernetes resource, and promProbe runs a PromQL query and asserts on the result (for example, error rate stays under a threshold). But the type matters less than the mode. SOT and EOT run the probe once, at the start or end of the test. Edge runs it at both ends. Continuous polls throughout the chaos window at probePollingInterval, and that is the one that catches the 20 seconds of 502s, because it samples while the failure is happening. So add an httpProbe in Continuous mode against the service, with a tight polling interval and a sane retry budget. If availability dips during chaos, the probe fails, probeSuccessPercentage drops, and ChaosResult becomes Fail. Now Pass actually means the failure was injected and the service stayed up. The probe is the experiment's assertion. Without one you are just deleting pods and hoping.

How Litmus Decides Pass or Fail: Probes

Sample answer

Why this matters

Code examples

Common mistakes to avoid

Likely follow-ups

More Chaos Engineering interview questions

Also worth your time on this topic

Running Your First Chaos Engineering Experiment with Litmus

Litmus Building Blocks: ChaosEngine vs ChaosExperiment

Running Your First Chaos Engineering Experiment with Litmus