AlgoMaster Logo

Three Pillars of Observability

Last Updated: January 7, 2026

Ashish

Ashish Pratap Singh

Observability gives you the tools to debug problems you have never seen before, in systems too complex to understand through code alone.

In this chapter, you will learn:

  • What observability means and why it matters for distributed systems
  • The three pillars of observability: logs, metrics, and traces
  • How each pillar answers different questions about system behavior
  • When to use each pillar and how they complement each other
  • The difference between monitoring and observability

Understanding these fundamentals is essential before diving into the specific techniques covered in later chapters.

Why Observability Matters

In a monolithic application, debugging is relatively straightforward. You have one process, one log file, and one stack trace when something goes wrong. You can step through code with a debugger and reproduce issues locally.

Distributed systems change everything. A single user request might touch dozens of services, each running on different machines, each with its own logs and failure modes. The problem might be in any of them, or in the network between them, or in the interaction between services that individually work fine.

When latency spikes in this system, where do you look first? Is it the database? The external API? Network congestion between services? A slow cache lookup?

Without observability, finding the answer requires guesswork and luck.

The Debugging Challenge

Traditional debugging approaches fail in distributed systems:

ApproachWorks ForFails For
DebuggerLocal developmentProduction systems, distributed calls
Print statementsSimple applicationsHigh-volume production traffic
Log filesSingle serverDozens of servers with separate logs
Reproduce locallyDeterministic bugsRace conditions, network issues, load-dependent bugs

Observability tools fill this gap. They collect and correlate data from across your entire system, making it possible to answer questions like:

  • Why did this specific request take 5 seconds?
  • Which service is causing the increase in error rates?
  • What changed between yesterday (when it worked) and today (when it broke)?

The Three Pillars

Observability rests on three pillars: logs, metrics, and traces. Each one answers a different kind of question, and together they give you a clear, end-to-end view of what’s happening inside your system.

Logs give you detailed, time-ordered records that help you debug specific issues.

Metrics give you aggregated numbers that help you track trends, set thresholds, and catch problems early.

Traces give you a request’s journey across services, showing the full flow and where time is spent.

A simple way to remember them is to think of a doctor’s toolkit:

  • Logs are like clinical notes: rich detail about what occurred.
  • Metrics are like vital signs: high-level signals of system health.
  • Traces are like scans: they reveal the path and connections across the system.

Pillar 1: Logs

Logs are timestamped records of discrete events in your system. Whenever something meaningful happens, your code writes a log entry that describes it.

What Logs Look Like

Each entry typically captures:

  • When: timestamp (often with millisecond precision)
  • Where: service, component, host, or instance
  • What: the event that occurred
  • Context: identifiers and details that make the event searchable (order_id, user_id, error, etc.)

Structured vs Unstructured Logs

Modern systems prefer structured logs (usually JSON) over plain text because structured logs are easier to parse, index, and query.

Unstructured (hard to parse reliably)

Structured (machine-readable)

With structured logs, queries become straightforward. Example:

  • “Show all ERROR logs from payment-service in the last hour”
  • “Find all orders over $1000 that failed”
  • “Count events by type per service”

What Logs Are Good For

Use CaseExample
Debugging errorsStack traces, error messages, failed validations
Audit trailsWho did what and when
Security analysisLogin attempts, permission changes, suspicious activity
Understanding behaviorWhy a specific decision was made

Limitations of Logs

Logs are essential, but they don’t scale gracefully on their own:

  • Volume: high-traffic systems can produce billions of lines per day
  • Cost: storing, shipping, and indexing everything gets expensive
  • Signal-to-noise: the useful log line is often buried in a huge stream
  • Poor aggregation: logs describe individual events, not trends
  • Context gaps: a log in one service usually cannot explain what happened across other services

This is why logs alone are not enough. Next, we’ll use metrics to zoom out and understand the system’s overall health and trends.

Pillar 2: Metrics

Metrics are numerical measurements collected over time. While logs capture individual events, metrics aggregate those events into time series so you can spot trends, compare behavior over time, and detect issues early.

What Metrics Look Like

Types of Metrics

Metrics generally fall into a few common types:

Counters

  • Track totals that only go up
  • Examples: requests served, errors occurred, bytes transferred

Gauges

  • Track a current value that can go up or down
  • Examples: active connections, queue depth, memory usage

Histograms

  • Track a distribution of values by buckets
  • Examples: request latency, request size, processing duration

Here’s a quick summary:

TypeDescriptionExampleOperations
CounterMonotonically increasing valueTotal requests, errors, bytesRate, increase
GaugeCurrent value that can go up or downMemory usage, queue size, temperatureCurrent, min, max, avg
HistogramDistribution of observationsRequest latency, response sizePercentiles, averages
SummaryLike histogram but with pre-calculated percentilesSame use cases, lower storagep50, p95, p99

The Four Golden Signals

If you track nothing else, track these four. They cover the most important ways services fail or degrade.

  • Latency: how long requests take (p50, p95, p99)
  • Traffic: how much demand you’re serving (requests per second)
  • Errors: how many requests are failing (error rate)
  • Saturation: how “full” your resources are (CPU, memory, queue depth)

They quickly tell you whether your service is healthy and whether it is trending toward trouble.

What Metrics Are Good For

Use CaseExample
AlertingTrigger alert when error rate exceeds 1%
Capacity planningTrack growth trends to predict when to scale
SLA trackingMeasure if you are meeting latency targets
Anomaly detectionSpot unusual patterns in traffic or errors
Dashboard visualizationReal-time graphs of system health

Limitations of Metrics

Metrics are powerful, but they have blind spots:

  • Aggregation hides details: Knowing average latency is high does not tell you why
  • Cardinality explosion: Too many unique label values makes metrics unmanageable
  • No context: A spike in errors does not tell you which users are affected
  • Predefined: You can only query metrics you decided to collect in advance

Metrics usually tell you that something is wrong. To find where it went wrong in a distributed system, you need traces.

Pillar 3: Traces

Traces follow a single request as it moves through your distributed system. They show which services were called, in what order, and how long each step took. If metrics tell you something is wrong, traces help you find where it went wrong.

What Traces Look Like

A trace is made up of spans. Each span represents a unit of work, like a service call, a database query, or a cache lookup.

What a span typically includes

  • Trace ID: Unique identifier linking all spans in a request
  • Span ID: Unique identifier for this span
  • Parent Span ID: Which span created this one
  • Operation name: What work was done
  • Start and end time: Duration of the operation
  • Tags/attributes: Key-value pairs with context

What Traces Are Good For

Use CaseExample
Finding bottlenecksWhich service is causing slow requests?
Understanding dependenciesHow do services call each other?
Debugging specific requestsWhy did this particular request fail?
Identifying cascading failuresOne slow service causing timeouts elsewhere
Optimizing critical pathsWhere should we focus performance work?

Limitations of Traces

Traces are powerful, but they come with practical constraints:

  • Sampling: Storing traces for every request is expensive, so most systems sample
  • Instrumentation required: Every service must propagate trace context
  • Storage costs: Traces are larger than metrics, expensive to retain long-term
  • Overwhelming detail: Large traces with hundreds of spans are hard to analyze

How the Pillars Work Together

The real power of observability comes from combining metrics, traces, and logs. Each pillar answers a different question, and together they take you from “something feels wrong” to a concrete root cause.

A practical debugging workflow

Imagine users report slow checkout:

  1. Metrics tell you there is a problem:
    • Dashboard shows p99 latency spiked from 500ms to 5s
    • Error rate increased from 0.1% to 2%
    • You know something is wrong, but not what
  2. Traces tell you where the problem is:
    • Look at slow traces during the spike
    • See that Payment Service is taking 4s instead of 100ms
    • Payment Service is calling an external API that is timing out
  3. Logs tell you why:
    • Check Payment Service logs during the timeframe
    • Find error: "Connection to payment gateway refused: max retries exceeded"
    • The payment gateway had an outage

Without all three, you end up guessing. With all three, you can move quickly and confidently from symptom to cause.

Monitoring vs Observability

These terms are often used interchangeably, but they represent different philosophies.

Monitoring is about watching for known problems. You decide ahead of time what to measure, set thresholds, and alert when something crosses the line. This works well for predictable failure modes.

Observability is about investigating unknown problems. You collect enough high-quality signals (logs, metrics, traces, plus useful context) so you can ask new questions when something unexpected happens. This is what helps in complex distributed systems where failures do not follow a script.

Quick comparison

MonitoringObservability
Answers known questionsAnswers unknown questions
Predefined alertsAd-hoc queries
"Is the server up?""Why is this user's request slow?"
DashboardsExploration tools
ReactiveProactive and reactive

In practice, you need both. Monitoring catches the issues you can predict. Observability helps you debug the ones you cannot.

Building an Observable System

Observability does not happen by accident. You have to design for it from day one, just like scalability or reliability.

1. Instrument Everything

Every service should emit logs, metrics, and traces as a first-class part of the codebase.

2. Use Consistent Standards

Observability breaks down quickly when every team does things differently. Standardize the basics:

  • Logs: JSON with consistent field names (service, env, request_id, trace_id, user_id, error, etc.)
  • Metrics: clear naming conventions (for example service_operation_unit) and controlled labels
  • Traces: consistent context propagation using standards like W3C Trace Context

3. Connect the Pillars

The pillars are most useful when you can move between them quickly:

  • Include trace IDs in log entries
  • Attach exemplars (trace IDs) to metric points when possible

This lets you jump from a metric spike to the exact traces behind it, then to the logs for the failing span.

4. Design for Debuggability

A good rule of thumb is simple: If this breaks at 3 AM, what will I need to know to fix it?

Then make sure the system captures that information by default, not as an afterthought.

Summary

Observability is about understanding what is happening inside your system:

  • Logs record discrete events with full context. They tell you what happened but can be overwhelming at scale.
  • Metrics aggregate data into time series. They show trends and enable alerting but hide individual request details.
  • Traces follow requests through distributed systems. They show where time is spent but require instrumentation and storage.

Each pillar answers different questions:

  • Metrics: "Is something wrong?" (alerting, dashboards)
  • Traces: "Where is it wrong?" (bottlenecks, dependencies)
  • Logs: "Why is it wrong?" (debugging, root cause)

The pillars work best together. A typical debugging flow starts with metrics detecting an anomaly, traces locating the problem, and logs revealing the root cause.

With the fundamentals established, we will now dive deeper into each pillar. The next chapter focuses on logging, the most familiar pillar but one that is often implemented poorly. We will cover how to write useful logs, choose the right level, structure log data, and avoid common mistakes that make logs useless when you need them most.