Last Updated: January 7, 2026
Lets say your metrics dashboard shows that p99 latency spiked from 200ms to 2 seconds. You know something is wrong, but where? The request touches 8 services. Is it the database? The payment gateway? Network latency between services? A slow cache lookup?
Logs can tell you what happened in each service, but piecing together the timeline across 8 services is tedious. Metrics show aggregate latency but not which component is slow. Correlation IDs link logs together but do not show timing.
Distributed tracing solves this by recording the journey of each request through your system, including exactly how long each step took. It shows you a timeline of every service call, database query, and external API request. When latency spikes, you can look at slow traces and immediately see where the time went.
In this chapter, you will learn:
This chapter builds directly on correlation IDs. Distributed tracing is correlation IDs with structure, timing, and visualization.
A distributed trace is a record of a request's journey through a system. It captures every service, database call, and external API request, along with timing information.
From this trace, you can immediately see:
Without tracing, finding this information would require correlating logs across 5 services and manually calculating timing differences.
Distributed tracing is built on two simple ideas: a trace tells the full story, and spans are the chapters. Once you understand these two concepts, trace visualizations stop looking mysterious and start feeling like a timeline you can reason about.
A trace represents the complete journey of a single request as it moves through your distributed system.
A trace typically includes:
A trace answers questions like:
A span is a single unit of work inside a trace. Every meaningful operation can create a span:
Spans form a tree. The first span is the root (often the API gateway or edge service). Downstream work becomes child spans, and deeper calls become grandchildren.
Here is an example trace:
This one tree tells you a lot:
Every span carries enough metadata to answer: what happened, where, and how long it took.
| Field | Description | Example |
|---|---|---|
| Trace ID | Links span to its trace | abc-123 |
| Span ID | Unique identifier for this span | span-456 |
| Parent Span ID | The span that created this one | span-123 |
| Operation Name | What work was done | HTTP GET /users |
| Service Name | Which service ran this span | user-service |
| Start Time | When the span began | 2024-01-15T10:23:45.123Z |
| Duration | How long it took | 45ms |
| Tags/Attributes | Key-value metadata | http.status=200 |
| Logs/Events | Timestamped events within the span | Error messages |
| Status | Success or error | OK, ERROR |
A useful way to think about attributes is: they make spans filterable. Without attributes, you can only look at one trace at a time. With attributes, you can ask questions like “show me traces where db.statement is slow” or “only traces with http.status_code=500.”
Distributed tracing only works if trace context flows through every hop. It is the same idea as propagating a correlation ID, but more structured.
Instead of passing just one ID, you pass enough information to rebuild the full span tree across services.
The most common standard for HTTP-based propagation is W3C Trace Context. It defines two headers:
traceparent: the core identifiers needed to connect spanstracestate: optional vendor-specific metadatatraceparent formatWhat each piece means:
00)01 often means “sampled”)tracestate formatThis is where vendors can store extra data (for example, internal routing, tenant information, or sampling details). Your application usually treats it as opaque and just forwards it.
Here is how a trace typically forms as a request moves through the system.
To keep traces connected, each service must follow the same loop:
If any service forgets step 3, downstream services will start new traces and your end-to-end view will be broken.
To get useful traces, you need instrumentation at the right places. There are two ways to add it: automatic instrumentation for common libraries, and manual instrumentation for your own business logic.
A good rule is to start with automatic instrumentation to get broad coverage, then add manual spans only where they add clarity.
Automatic instrumentation uses libraries, agents, or SDKs that hook into common frameworks and clients. You get spans without writing much tracing code yourself.
Auto-instrumentation can tell you that a request was slow and which dependency was slow, but it may not explain which part of your business logic was responsible. That is where manual spans help.
Manual instrumentation is when you explicitly create spans around the parts of your code that matter to you. This is how you turn a trace from “a list of RPCs” into something that matches how your system actually works.
This pattern is worth copying:
If you try to instrument everything, you will create noisy traces and spend more time maintaining spans than using them. Prioritize the places that give the biggest debugging value.
| Category | Examples | Priority |
|---|---|---|
| Service boundaries | HTTP endpoints, gRPC methods | Critical |
| External calls | Databases, caches, APIs | Critical |
| Message processing | Queue consumers, event handlers | High |
| Business operations | Order creation, payment processing | Medium |
| Internal computations | Complex algorithms (if slow) | Low |
Start with service boundaries and external calls. These give you 80% of the value. Add business operations as needed.
Tracing every request sounds great until you see the bill. Traces are rich, detailed, and expensive to store. At high traffic, sampling is not optional. It is how you keep tracing useful without turning it into a cost and storage problem.
Imagine a service doing 10,000 requests per second.
Sampling is about choosing the right balance: enough visibility to debug issues, but not so much data that the tracing system collapses under its own weight.
Head-based sampling decides whether to sample at the start of the trace (at the first span).
Pros
Cons
Head-based sampling is a solid default when you want predictable cost and you are okay with occasionally missing edge cases.
Tail-based sampling collects span data first, then decides what to keep at the end of the trace.
A typical policy keeps traces that are:
Pros
Cons
Tail-based sampling is ideal when debugging quality matters more than predictable cost, and you can support the extra infrastructure.
Priority-based sampling is a practical middle ground. You sample different traffic at different rates.
| Request Type | Sample Rate |
|---|---|
| Errors | 100% |
| Slow requests (>1s) | 100% |
| Critical paths (checkout) | 50% |
| Normal requests | 1-10% |
| Health checks | 0.1% |
This approach is easy to explain, easy to tune, and usually gives good results in production.
| Strategy | Storage Cost | Captures Errors | Complexity |
|---|---|---|---|
| No sampling | Very high | Yes | Low |
| Head-based | Predictable | Sometimes | Low |
| Tail-based | Variable | Yes | High |
| Priority-based | Medium | Yes | Medium |
A good default for most teams is: head-based sampling + always sample errors, and then evolve toward tail-based sampling as traffic grows and debugging needs increase.
Once you have sampled traces, you need a backend to collect, store, and visualize them.
Features:
Similar architecture to Jaeger, originally from Twitter:
Features:
| Service | Provider | Integration |
|---|---|---|
| AWS X-Ray | AWS | Deep AWS integration |
| Google Cloud Trace | GCP | GCP integration |
| Azure Monitor | Microsoft | Azure integration |
| Datadog APM | Datadog | Full observability platform |
| Honeycomb | Honeycomb | High-cardinality analysis |
Managed services reduce operational work but can become expensive at high volume, especially if you keep too many traces or retain them for too long.
OpenTelemetry is becoming the standard way to instrument traces (and increasingly metrics and logs) without locking into a single vendor.
A typical setup looks like:
OpenTelemetry provides:
Traces are most valuable when you are debugging a real incident or chasing a performance regression. Metrics tell you something is wrong. Traces tell you where the time went and which dependency is responsible.
The first step is to filter for outliers. Most tracing tools let you query by duration, service, operation name, status, and tags.
A good habit is to narrow your search further:
operation = "POST /orders")status = ERROR)This makes it easier to find traces that represent the problem, not normal background noise.
Once you open a slow trace, look for the span that dominates the timeline. Most UIs show it visually, but you can reason about it as a breakdown of total time.
What this tells you:
This is why tracing is so effective. You stop guessing and start following the time.
A single slow trace is useful. Comparing slow traces with normal ones is where the insight usually appears.
The slow query is scanning 50,000 rows instead of looking up by primary key.
Tracing systems can build a service dependency map by looking at who calls whom.
This helps you see:
Each pillar is useful on its own, but the real power shows up when they are connected. You want to move smoothly from a high-level signal to the exact request and then to the root cause.
A good observability setup lets you follow one flow:
Metrics tell you something changed. Traces show where it changed. Logs explain why it changed.
The simplest and most effective connection is to include trace_id (and ideally span_id) in every log entry. Once you do that, every trace becomes a clickable thread that leads to the exact log lines generated by that request.
Now the workflow is straightforward:
trace_idThis is especially important in microservices, where a single user action can generate logs in five different services.
Metrics give you trends, but they do not tell you which request caused a spike. Exemplars bridge that gap by attaching a trace reference to a specific metric sample.
Think of an exemplar as “a representative trace for this data point.”
When you see a latency spike in metrics, the exemplar links to an example slow trace.
Here is what “connected observability” looks like during an incident:
/orders jumped from 500ms to 5s.trace_id show repeated retries and a slow query caused by a missing index.Without the connections, each step takes longer and involves guesswork. With the connections, you follow a trail of evidence.
Start where requests cross boundaries. That is where tracing delivers the most value early.
This gets you most of the end-to-end picture with minimal effort.
Attributes are what make traces searchable and explainable. The goal is to add enough context to filter and compare without turning spans into data dumps.
If an attribute is high-cardinality, sensitive, or huge, it probably belongs in logs (with masking) rather than in traces.
Consistency is the difference between “searchable traces” and “random metadata.” OpenTelemetry provides standard attribute names so different teams and libraries speak the same language.
| Category | Attributes |
|---|---|
| HTTP | http.method, http.url, http.status_code |
| Database | db.system, db.statement, db.operation |
| RPC | rpc.system, rpc.method, rpc.service |
| Messaging | messaging.system, messaging.destination |
When everyone follows conventions, you can build shared dashboards and queries that work across all services.
Sampling should match the environment and the cost profile.
| Environment | Sampling Rate |
|---|---|
| Development | 100% |
| Staging | 50-100% |
| Production | 1-10% + 100% errors |
A practical production policy:
Traces are most valuable close to the time of an incident. Keep high-value traces longer and normal traces shorter.
| Trace Type | Retention |
|---|---|
| Error traces | 30 days |
| Slow traces (>p99) | 14 days |
| Normal traces | 7 days |
Tune these based on how often you investigate historical incidents and how expensive storage is for your tracing backend.
Connecting the pillars turns observability from three separate tools into one coherent workflow:
trace_id and span_id in structured logs so you can pull all request logs instantlyIf you build these connections early, debugging becomes a repeatable process instead of an art form.