In the world of distributed systems, failure is not an "if" but a "when." Servers crash, networks lag, and software has bugs. The key to building reliable systems is not to prevent all failures but to detect and respond to them quickly.
This is where observability comes in.
Observability is our ability to understand the internal state of a system from its external outputs. It is often described as having three pillars:
Monitoring and alerting are the actionable layers built on top of these pillars, primarily focusing on metrics. Monitoring gives us the visibility, and alerting gives us the trigger to act.