How Litmus Decides Pass or Fail: Probes
Your pod-delete experiment shows Pass, but during the run users got 502s for about 20 seconds. How can the experiment pass while the service was actually down, and how do you fix that?
Your pod-delete experiment shows Pass, but during the run users got 502s for about 20 seconds. How can the experiment pass while the service was actually down, and how do you fix that?
It passes because, with no probe attached, the verdict mostly checks that the fault was injected and the target recovered to a healthy pod count. It never checked whether your service kept serving traffic. A bare ChaosEngine gives you green that means almost nothing. The fix is probes, which is how you encode the steady-state hypothesis and feed probeSuccessPercentage into the verdict. Litmus has four probe types: httpProbe hits an endpoint and asserts on the status or response, cmdProbe runs a command and compares output, k8sProbe asserts on the state of a Kubernetes resource, and promProbe runs a PromQL query and asserts on the result (for example, error rate stays under a threshold). But the type matters less than the mode. SOT and EOT run the probe once, at the start or end of the test. Edge runs it at both ends. Continuous polls throughout the chaos window at probePollingInterval, and that is the one that catches the 20 seconds of 502s, because it samples while the failure is happening. So add an httpProbe in Continuous mode against the service, with a tight polling interval and a sane retry budget. If availability dips during chaos, the probe fails, probeSuccessPercentage drops, and ChaosResult becomes Fail. Now Pass actually means the failure was injected and the service stayed up. The probe is the experiment's assertion. Without one you are just deleting pods and hoping.
This separates people who ran the happy-path tutorial from people who validate a hypothesis. The scenario is real and common: a green verdict that hides a real availability gap. Listen for probes, and specifically Continuous mode versus the once-at-the-edges modes, plus the framing that the probe is the assertion that makes the verdict meaningful. A strong candidate reaches for promProbe to assert on an SLO like error rate or latency, not just a 200 from an endpoint.
ChaosEngine with a Continuous httpProbe plus a Prometheus SLO probe
Read the probe result, not just the verdict
- Running experiments with no probe at all and trusting the green Pass it produces
- Using SOT or EOT mode when the failure window is mid-experiment, so the availability dip is never sampled
- Setting probes so strict (zero retries, tiny timeout) that you get flaky failures and the team stops trusting the results
- What is the difference between Continuous and Edge mode, and when would Edge actually be the right choice?
- How do probeTimeout, attempt, and probePollingInterval interact, and how do you keep a probe from producing flaky verdicts?
- How would you assert on a real SLO like p99 latency staying under 300ms during the chaos window?
- How would you turn probeSuccessPercentage into a pass or fail gate in a CI pipeline?
More Chaos Engineering interview questions
Also worth your time on this topic
Running Your First Chaos Engineering Experiment with Litmus
How to install Litmus on Kubernetes and run a controlled failure experiment from a written hypothesis to a verdict you can act on, without breaking production by accident.
90-150 minutes
Litmus Building Blocks: ChaosEngine vs ChaosExperiment
You install Litmus on a cluster and want to kill a pod to see what happens. Walk me through the pieces Litmus gives you, and what is the actual difference between a ChaosExperiment and a ChaosEngine?
junior
Running Your First Chaos Engineering Experiment with Litmus
A hands-on walkthrough of installing LitmusChaos on Kubernetes, killing pods on purpose, and watching whether your app actually recovers. Real YAML, real output, no theory.