Skip to main content

How Litmus Decides Pass or Fail: Probes

Your pod-delete experiment shows Pass, but during the run users got 502s for about 20 seconds. How can the experiment pass while the service was actually down, and how do you fix that?

mid
intermediate
Chaos Engineering
Question

Your pod-delete experiment shows Pass, but during the run users got 502s for about 20 seconds. How can the experiment pass while the service was actually down, and how do you fix that?

Answer

It passes because, with no probe attached, the verdict mostly checks that the fault was injected and the target recovered to a healthy pod count. It never checked whether your service kept serving traffic. A bare ChaosEngine gives you green that means almost nothing. The fix is probes, which is how you encode the steady-state hypothesis and feed probeSuccessPercentage into the verdict. Litmus has four probe types: httpProbe hits an endpoint and asserts on the status or response, cmdProbe runs a command and compares output, k8sProbe asserts on the state of a Kubernetes resource, and promProbe runs a PromQL query and asserts on the result (for example, error rate stays under a threshold). But the type matters less than the mode. SOT and EOT run the probe once, at the start or end of the test. Edge runs it at both ends. Continuous polls throughout the chaos window at probePollingInterval, and that is the one that catches the 20 seconds of 502s, because it samples while the failure is happening. So add an httpProbe in Continuous mode against the service, with a tight polling interval and a sane retry budget. If availability dips during chaos, the probe fails, probeSuccessPercentage drops, and ChaosResult becomes Fail. Now Pass actually means the failure was injected and the service stayed up. The probe is the experiment's assertion. Without one you are just deleting pods and hoping.

Why This Matters

This separates people who ran the happy-path tutorial from people who validate a hypothesis. The scenario is real and common: a green verdict that hides a real availability gap. Listen for probes, and specifically Continuous mode versus the once-at-the-edges modes, plus the framing that the probe is the assertion that makes the verdict meaningful. A strong candidate reaches for promProbe to assert on an SLO like error rate or latency, not just a 200 from an endpoint.

Code Examples

ChaosEngine with a Continuous httpProbe plus a Prometheus SLO probe

yaml

Read the probe result, not just the verdict

bash
Common Mistakes
  • Running experiments with no probe at all and trusting the green Pass it produces
  • Using SOT or EOT mode when the failure window is mid-experiment, so the availability dip is never sampled
  • Setting probes so strict (zero retries, tiny timeout) that you get flaky failures and the team stops trusting the results
Follow-up Questions
Interviewers often ask these as follow-up questions
  • What is the difference between Continuous and Edge mode, and when would Edge actually be the right choice?
  • How do probeTimeout, attempt, and probePollingInterval interact, and how do you keep a probe from producing flaky verdicts?
  • How would you assert on a real SLO like p99 latency staying under 300ms during the chaos window?
  • How would you turn probeSuccessPercentage into a pass or fail gate in a CI pipeline?
Tags
chaos-engineering
litmus
kubernetes
observability
Sponsored
Carbon Ads

More Chaos Engineering interview questions

Also worth your time on this topic