AlgoMaster Logo

Distributed Tracing

Last Updated: January 7, 2026

Ashish

Ashish Pratap Singh

Lets say your metrics dashboard shows that p99 latency spiked from 200ms to 2 seconds. You know something is wrong, but where? The request touches 8 services. Is it the database? The payment gateway? Network latency between services? A slow cache lookup?

Logs can tell you what happened in each service, but piecing together the timeline across 8 services is tedious. Metrics show aggregate latency but not which component is slow. Correlation IDs link logs together but do not show timing.

Distributed tracing solves this by recording the journey of each request through your system, including exactly how long each step took. It shows you a timeline of every service call, database query, and external API request. When latency spikes, you can look at slow traces and immediately see where the time went.

In this chapter, you will learn:

  • How distributed tracing works
  • The concepts of traces, spans, and context propagation
  • How to instrument services for tracing
  • Sampling strategies for high-volume systems
  • Common tracing systems like Jaeger and Zipkin
  • When to use tracing vs logs vs metrics

This chapter builds directly on correlation IDs. Distributed tracing is correlation IDs with structure, timing, and visualization.

What Is Distributed Tracing?

A distributed trace is a record of a request's journey through a system. It captures every service, database call, and external API request, along with timing information.

From this trace, you can immediately see:

  • The request took 850ms total
  • The slowest component is the database write at 400ms
  • Auth, inventory, and notifications are relatively fast
  • The order service spent most of its time waiting for the database

Without tracing, finding this information would require correlating logs across 5 services and manually calculating timing differences.

Traces and Spans

Distributed tracing is built on two simple ideas: a trace tells the full story, and spans are the chapters. Once you understand these two concepts, trace visualizations stop looking mysterious and start feeling like a timeline you can reason about.

Traces

A trace represents the complete journey of a single request as it moves through your distributed system.

A trace typically includes:

  • Trace ID: Unique identifier for this request (like a correlation ID)
  • Start time: When the request began
  • Duration: Total time from start to finish
  • Spans: Collection of operations that make up the trace

A trace answers questions like:

  • “Which services were involved?”
  • “Where did the time go?”
  • “Which dependency caused the slowdown?”
  • “Where did the error originate?”

Spans

A span is a single unit of work inside a trace. Every meaningful operation can create a span:

  • an inbound HTTP request
  • a call to another service
  • a database query
  • a cache lookup
  • publishing a message
  • running a background job step

Spans form a tree. The first span is the root (often the API gateway or edge service). Downstream work becomes child spans, and deeper calls become grandchildren.

Here is an example trace:

This one tree tells you a lot:

  • the request took 850ms end-to-end
  • most of the time is inside Order Service
  • the slowest child span is a database write (400ms)

Span Anatomy

Every span carries enough metadata to answer: what happened, where, and how long it took.

FieldDescriptionExample
Trace IDLinks span to its traceabc-123
Span IDUnique identifier for this spanspan-456
Parent Span IDThe span that created this onespan-123
Operation NameWhat work was doneHTTP GET /users
Service NameWhich service ran this spanuser-service
Start TimeWhen the span began2024-01-15T10:23:45.123Z
DurationHow long it took45ms
Tags/AttributesKey-value metadatahttp.status=200
Logs/EventsTimestamped events within the spanError messages
StatusSuccess or errorOK, ERROR

A useful way to think about attributes is: they make spans filterable. Without attributes, you can only look at one trace at a time. With attributes, you can ask questions like “show me traces where db.statement is slow” or “only traces with http.status_code=500.”

Context Propagation

Distributed tracing only works if trace context flows through every hop. It is the same idea as propagating a correlation ID, but more structured.

Instead of passing just one ID, you pass enough information to rebuild the full span tree across services.

W3C Trace Context

The most common standard for HTTP-based propagation is W3C Trace Context. It defines two headers:

  • traceparent: the core identifiers needed to connect spans
  • tracestate: optional vendor-specific metadata

traceparent format

What each piece means:

  • Version: protocol version (usually 00)
  • Trace ID: identifies the trace (shared across all spans in the request)
  • Parent span ID: the span that created this outgoing request
  • Flags: sampling decision and other trace options (the last 01 often means “sampled”)

tracestate format

This is where vendors can store extra data (for example, internal routing, tenant information, or sampling details). Your application usually treats it as opaque and just forwards it.

Propagation Flow

Here is how a trace typically forms as a request moves through the system.

What every service must do

To keep traces connected, each service must follow the same loop:

  1. Extract trace context from the incoming request
  2. Start a new span as a child of the incoming parent span
  3. Inject updated context into every outgoing call (HTTP, gRPC, messaging, database clients)
  4. Close the span when the operation completes (success or error)

If any service forgets step 3, downstream services will start new traces and your end-to-end view will be broken.

Instrumenting for Tracing

To get useful traces, you need instrumentation at the right places. There are two ways to add it: automatic instrumentation for common libraries, and manual instrumentation for your own business logic.

A good rule is to start with automatic instrumentation to get broad coverage, then add manual spans only where they add clarity.

Automatic Instrumentation

Automatic instrumentation uses libraries, agents, or SDKs that hook into common frameworks and clients. You get spans without writing much tracing code yourself.

How it works

  • Your application runs normally
  • A tracing agent or instrumentation library wraps common operations
  • Spans are created automatically, with trace context propagated across calls

What usually gets auto-instrumented

  • Inbound HTTP server requests (your API entry points)
  • Outbound HTTP client calls (service-to-service calls)
  • Database queries (Postgres/MySQL via JDBC, MongoDB, etc.)
  • Cache calls (Redis, Memcached)
  • Message queues (produce and consume operations)
  • gRPC client and server calls

Why auto-instrumentation is useful

  • minimal code changes
  • consistent span naming and attributes across services
  • fast way to get baseline end-to-end traces

Where it falls short

Auto-instrumentation can tell you that a request was slow and which dependency was slow, but it may not explain which part of your business logic was responsible. That is where manual spans help.

Manual Instrumentation

Manual instrumentation is when you explicitly create spans around the parts of your code that matter to you. This is how you turn a trace from “a list of RPCs” into something that matches how your system actually works.

Example pseudocode

This pattern is worth copying:

  • put the span around the operation you want to understand
  • add a few useful attributes
  • record exceptions and set status on errors
  • always end the span, even on failure

What manual spans are best for

  • marking key stages in a workflow (validate, reserve inventory, charge, confirm)
  • tracking slow internal operations that do not show up as external calls
  • adding business context that you want to filter by later (tenant, feature flag, order size bucket)

What to Instrument

If you try to instrument everything, you will create noisy traces and spend more time maintaining spans than using them. Prioritize the places that give the biggest debugging value.

CategoryExamplesPriority
Service boundariesHTTP endpoints, gRPC methodsCritical
External callsDatabases, caches, APIsCritical
Message processingQueue consumers, event handlersHigh
Business operationsOrder creation, payment processingMedium
Internal computationsComplex algorithms (if slow)Low

Start with service boundaries and external calls. These give you 80% of the value. Add business operations as needed.

Sampling Strategies

Tracing every request sounds great until you see the bill. Traces are rich, detailed, and expensive to store. At high traffic, sampling is not optional. It is how you keep tracing useful without turning it into a cost and storage problem.

Why Sample?

Imagine a service doing 10,000 requests per second.

Sampling is about choosing the right balance: enough visibility to debug issues, but not so much data that the tracing system collapses under its own weight.

Sampling Strategies

1. Head-Based Sampling

Head-based sampling decides whether to sample at the start of the trace (at the first span).

Pros

  • simple to implement
  • predictable storage and cost
  • consistent decision for every span in the trace

Cons

  • can miss rare but important traces (errors and slow requests)
  • during incidents, you might sample the wrong things

Head-based sampling is a solid default when you want predictable cost and you are okay with occasionally missing edge cases.

2. Tail-Based Sampling

Tail-based sampling collects span data first, then decides what to keep at the end of the trace.

A typical policy keeps traces that are:

  • slow (duration above a threshold)
  • errors (status = ERROR)
  • part of a critical path (checkout, payments)
  • plus a random sample of the remaining “normal” traffic

Pros

  • keeps the traces you care about most
  • very effective during incidents because errors and slow traces are retained

Cons

  • requires buffering spans until the decision is made
  • more moving parts and more operational complexity
  • storage cost becomes variable, which can surprise teams

Tail-based sampling is ideal when debugging quality matters more than predictable cost, and you can support the extra infrastructure.

3. Priority-Based Sampling

Priority-based sampling is a practical middle ground. You sample different traffic at different rates.

Request TypeSample Rate
Errors100%
Slow requests (>1s)100%
Critical paths (checkout)50%
Normal requests1-10%
Health checks0.1%

This approach is easy to explain, easy to tune, and usually gives good results in production.

Sampling Trade-offs

StrategyStorage CostCaptures ErrorsComplexity
No samplingVery highYesLow
Head-basedPredictableSometimesLow
Tail-basedVariableYesHigh
Priority-basedMediumYesMedium

A good default for most teams is: head-based sampling + always sample errors, and then evolve toward tail-based sampling as traffic grows and debugging needs increase.

Tracing Systems

Once you have sampled traces, you need a backend to collect, store, and visualize them.

Jaeger

Features:

  • Open source (CNCF project)
  • Native OpenTelemetry support
  • Multiple storage backends (Cassandra, Elasticsearch, memory)
  • Service dependency graphs
  • Adaptive sampling

Zipkin

Similar architecture to Jaeger, originally from Twitter:

Features:

  • Open source, mature project
  • Simple setup
  • Multiple storage backends
  • B3 propagation format (widely supported)

Managed Services

ServiceProviderIntegration
AWS X-RayAWSDeep AWS integration
Google Cloud TraceGCPGCP integration
Azure MonitorMicrosoftAzure integration
Datadog APMDatadogFull observability platform
HoneycombHoneycombHigh-cardinality analysis

Managed services reduce operational work but can become expensive at high volume, especially if you keep too many traces or retain them for too long.

OpenTelemetry

OpenTelemetry is becoming the standard way to instrument traces (and increasingly metrics and logs) without locking into a single vendor.

A typical setup looks like:

OpenTelemetry provides:

  • Vendor-neutral APIs and SDKs
  • Auto-instrumentation for many languages
  • Exporters for any backend
  • Unified approach for traces, metrics, and logs

Analyzing Traces

Traces are most valuable when you are debugging a real incident or chasing a performance regression. Metrics tell you something is wrong. Traces tell you where the time went and which dependency is responsible.

Finding Slow Traces

The first step is to filter for outliers. Most tracing tools let you query by duration, service, operation name, status, and tags.

Example

A good habit is to narrow your search further:

  • focus on a specific operation (operation = "POST /orders")
  • filter to errors (status = ERROR)
  • filter to a time window during the incident

This makes it easier to find traces that represent the problem, not normal background noise.

Identifying Bottlenecks

Once you open a slow trace, look for the span that dominates the timeline. Most UIs show it visually, but you can reason about it as a breakdown of total time.

Example trace breakdown

What this tells you:

  • the gateway is slow because a downstream service is slow
  • the order service is slow because a database call dominates
  • the database span is the real bottleneck, not auth or notification

This is why tracing is so effective. You stop guessing and start following the time.

Comparing fast vs slow traces

A single slow trace is useful. Comparing slow traces with normal ones is where the insight usually appears.

Example comparison

The slow query is scanning 50,000 rows instead of looking up by primary key.

Service Dependency Maps

Tracing systems can build a service dependency map by looking at who calls whom.

This helps you see:

  • which services depend on which
  • the typical request flows through the system
  • where failures can cascade
  • potential choke points and single points of failure

Connecting Traces to Logs and Metrics

Each pillar is useful on its own, but the real power shows up when they are connected. You want to move smoothly from a high-level signal to the exact request and then to the root cause.

A good observability setup lets you follow one flow:

metrics → traces → logs

Metrics tell you something changed. Traces show where it changed. Logs explain why it changed.

Traces to Logs

The simplest and most effective connection is to include trace_id (and ideally span_id) in every log entry. Once you do that, every trace becomes a clickable thread that leads to the exact log lines generated by that request.

Now the workflow is straightforward:

  • open a slow or failed trace
  • copy or click the trace_id
  • instantly filter logs across all services for that same request

This is especially important in microservices, where a single user action can generate logs in five different services.

Traces to Metrics

Metrics give you trends, but they do not tell you which request caused a spike. Exemplars bridge that gap by attaching a trace reference to a specific metric sample.

Think of an exemplar as “a representative trace for this data point.”

When you see a latency spike in metrics, the exemplar links to an example slow trace.

The debugging flow

Here is what “connected observability” looks like during an incident:

  1. Metrics alert you to a problem: p99 latency for /orders jumped from 500ms to 5s.
  2. Exemplars link you to example traces: You click an exemplar and open a slow trace from the spike window.
  3. Traces show where time is spent: The trace reveals that the database span dominates the request timeline.
  4. Logs explain why: Logs for the same trace_id show repeated retries and a slow query caused by a missing index.

Without the connections, each step takes longer and involves guesswork. With the connections, you follow a trail of evidence.

Best Practices

1. Instrument at Service Boundaries First

Start where requests cross boundaries. That is where tracing delivers the most value early.

  • Priority 1: inbound HTTP and gRPC endpoints
  • Priority 2: outbound HTTP and gRPC calls
  • Priority 3: database and cache operations
  • Priority 4: message queue operations
  • Priority 5: internal business logic spans

This gets you most of the end-to-end picture with minimal effort.

2. Add Meaningful Attributes

Attributes are what make traces searchable and explainable. The goal is to add enough context to filter and compare without turning spans into data dumps.

If an attribute is high-cardinality, sensitive, or huge, it probably belongs in logs (with masking) rather than in traces.

3. Use Semantic Conventions

Consistency is the difference between “searchable traces” and “random metadata.” OpenTelemetry provides standard attribute names so different teams and libraries speak the same language.

CategoryAttributes
HTTPhttp.method, http.url, http.status_code
Databasedb.system, db.statement, db.operation
RPCrpc.system, rpc.method, rpc.service
Messagingmessaging.system, messaging.destination

When everyone follows conventions, you can build shared dashboards and queries that work across all services.

4. Sample Appropriately

Sampling should match the environment and the cost profile.

EnvironmentSampling Rate
Development100%
Staging50-100%
Production1-10% + 100% errors

A practical production policy:

  • keep 100% of error traces
  • keep 100% of very slow traces (for example above p99 or above a fixed threshold)
  • keep a small baseline of normal traffic for comparisons

5. Set Retention Based on Value

Traces are most valuable close to the time of an incident. Keep high-value traces longer and normal traces shorter.

Trace TypeRetention
Error traces30 days
Slow traces (>p99)14 days
Normal traces7 days

Tune these based on how often you investigate historical incidents and how expensive storage is for your tracing backend.

Summary

Connecting the pillars turns observability from three separate tools into one coherent workflow:

  • Traces → logs: include trace_id and span_id in structured logs so you can pull all request logs instantly
  • Metrics → traces: use exemplars so metric spikes link to real traces
  • Follow the evidence: metrics detect, traces locate, logs explain

If you build these connections early, debugging becomes a repeatable process instead of an art form.