Chaos Engineering

Chaos Engineering is the practice of intentionally introducing small, controlled failures into a production-like system to verify how it behaves under real stress and to improve resilience. It addresses the problem that modern distributed services can fail in surprising ways due to partial outages, network delays, overloaded dependencies, or configuration drift that rarely show up in unit tests or staging environments. At a high level, teams define a steady state (the normal, measurable signals such as latency, error rate, and throughput), form a hypothesis about what should happen during a disruption, run a tightly scoped experiment like injecting latency or terminating a service instance, observe the impact, and then harden the system with better timeouts, retries, fallbacks, alerting, and automation.

With Chaos Engineering, failure modes are discovered early and recovery becomes faster and more predictable; without it, hidden coupling and brittle assumptions often surface only during real incidents, increasing downtime and customer impact. This gap exists because many outages come from interactions between components rather than a single obvious fault.

DevOps Glossary