Last Updated: January 7, 2026
Observability is the ability to understand what is happening inside your system by examining its outputs.
Unlike traditional monitoring which asks "is the system up?", observability asks "why is the system behaving this way?"
Observability gives you the tools to debug problems you have never seen before, in systems too complex to understand through code alone.
In this chapter, you will learn:
Understanding these fundamentals is essential before diving into the specific techniques covered in later chapters.
In a monolithic application, debugging is relatively straightforward. You have one process, one log file, and one stack trace when something goes wrong. You can step through code with a debugger and reproduce issues locally.
Distributed systems change everything. A single user request might touch dozens of services, each running on different machines, each with its own logs and failure modes. The problem might be in any of them, or in the network between them, or in the interaction between services that individually work fine.
When latency spikes in this system, where do you look first? Is it the database? The external API? Network congestion between services? A slow cache lookup?
Without observability, finding the answer requires guesswork and luck.
Traditional debugging approaches fail in distributed systems:
| Approach | Works For | Fails For |
|---|---|---|
| Debugger | Local development | Production systems, distributed calls |
| Print statements | Simple applications | High-volume production traffic |
| Log files | Single server | Dozens of servers with separate logs |
| Reproduce locally | Deterministic bugs | Race conditions, network issues, load-dependent bugs |
Observability tools fill this gap. They collect and correlate data from across your entire system, making it possible to answer questions like:
Observability rests on three pillars: logs, metrics, and traces. Each one answers a different kind of question, and together they give you a clear, end-to-end view of what’s happening inside your system.
Logs give you detailed, time-ordered records that help you debug specific issues.
Metrics give you aggregated numbers that help you track trends, set thresholds, and catch problems early.
Traces give you a request’s journey across services, showing the full flow and where time is spent.
A simple way to remember them is to think of a doctor’s toolkit:
Logs are timestamped records of discrete events in your system. Whenever something meaningful happens, your code writes a log entry that describes it.
Each entry typically captures:
Modern systems prefer structured logs (usually JSON) over plain text because structured logs are easier to parse, index, and query.
With structured logs, queries become straightforward. Example:
payment-service in the last hour”$1000 that failed”| Use Case | Example |
|---|---|
| Debugging errors | Stack traces, error messages, failed validations |
| Audit trails | Who did what and when |
| Security analysis | Login attempts, permission changes, suspicious activity |
| Understanding behavior | Why a specific decision was made |
Logs are essential, but they don’t scale gracefully on their own:
This is why logs alone are not enough. Next, we’ll use metrics to zoom out and understand the system’s overall health and trends.
Metrics are numerical measurements collected over time. While logs capture individual events, metrics aggregate those events into time series so you can spot trends, compare behavior over time, and detect issues early.
Metrics generally fall into a few common types:
Here’s a quick summary:
| Type | Description | Example | Operations |
|---|---|---|---|
| Counter | Monotonically increasing value | Total requests, errors, bytes | Rate, increase |
| Gauge | Current value that can go up or down | Memory usage, queue size, temperature | Current, min, max, avg |
| Histogram | Distribution of observations | Request latency, response size | Percentiles, averages |
| Summary | Like histogram but with pre-calculated percentiles | Same use cases, lower storage | p50, p95, p99 |
If you track nothing else, track these four. They cover the most important ways services fail or degrade.
They quickly tell you whether your service is healthy and whether it is trending toward trouble.
| Use Case | Example |
|---|---|
| Alerting | Trigger alert when error rate exceeds 1% |
| Capacity planning | Track growth trends to predict when to scale |
| SLA tracking | Measure if you are meeting latency targets |
| Anomaly detection | Spot unusual patterns in traffic or errors |
| Dashboard visualization | Real-time graphs of system health |
Metrics are powerful, but they have blind spots:
Metrics usually tell you that something is wrong. To find where it went wrong in a distributed system, you need traces.
Traces follow a single request as it moves through your distributed system. They show which services were called, in what order, and how long each step took. If metrics tell you something is wrong, traces help you find where it went wrong.
A trace is made up of spans. Each span represents a unit of work, like a service call, a database query, or a cache lookup.
| Use Case | Example |
|---|---|
| Finding bottlenecks | Which service is causing slow requests? |
| Understanding dependencies | How do services call each other? |
| Debugging specific requests | Why did this particular request fail? |
| Identifying cascading failures | One slow service causing timeouts elsewhere |
| Optimizing critical paths | Where should we focus performance work? |
Traces are powerful, but they come with practical constraints:
The real power of observability comes from combining metrics, traces, and logs. Each pillar answers a different question, and together they take you from “something feels wrong” to a concrete root cause.
Imagine users report slow checkout:
Without all three, you end up guessing. With all three, you can move quickly and confidently from symptom to cause.
These terms are often used interchangeably, but they represent different philosophies.
Monitoring is about watching for known problems. You decide ahead of time what to measure, set thresholds, and alert when something crosses the line. This works well for predictable failure modes.
Observability is about investigating unknown problems. You collect enough high-quality signals (logs, metrics, traces, plus useful context) so you can ask new questions when something unexpected happens. This is what helps in complex distributed systems where failures do not follow a script.
| Monitoring | Observability |
|---|---|
| Answers known questions | Answers unknown questions |
| Predefined alerts | Ad-hoc queries |
| "Is the server up?" | "Why is this user's request slow?" |
| Dashboards | Exploration tools |
| Reactive | Proactive and reactive |
In practice, you need both. Monitoring catches the issues you can predict. Observability helps you debug the ones you cannot.
Observability does not happen by accident. You have to design for it from day one, just like scalability or reliability.
Every service should emit logs, metrics, and traces as a first-class part of the codebase.
Observability breaks down quickly when every team does things differently. Standardize the basics:
service_operation_unit) and controlled labelsThe pillars are most useful when you can move between them quickly:
This lets you jump from a metric spike to the exact traces behind it, then to the logs for the failing span.
A good rule of thumb is simple: If this breaks at 3 AM, what will I need to know to fix it?
Then make sure the system captures that information by default, not as an afterthought.
Observability is about understanding what is happening inside your system:
Each pillar answers different questions:
The pillars work best together. A typical debugging flow starts with metrics detecting an anomaly, traces locating the problem, and logs revealing the root cause.
With the fundamentals established, we will now dive deeper into each pillar. The next chapter focuses on logging, the most familiar pillar but one that is often implemented poorly. We will cover how to write useful logs, choose the right level, structure log data, and avoid common mistakes that make logs useless when you need them most.