Last Updated: January 7, 2026
Logs tell you what happened to specific requests. But what if you need to know how the system is performing overall? Are response times getting slower? Is the error rate increasing? How many requests are we handling per second?
These questions require metrics: numerical measurements collected over time.
Metrics answer questions like "how many?" and "how fast?" across thousands or millions of requests. They power dashboards, drive alerts, and enable capacity planning. Good instrumentation gives you early warning when things degrade, often before users notice.
In this chapter, you will learn:
This builds on the observability foundation we established earlier. Metrics complement logs by providing the aggregate view that logs cannot offer.
To understand why metrics are indispensable, consider an e-commerce platform during a flash sale.
Logs show individual events. To understand the flash sale's impact, you would need to aggregate 50,000 log entries. Metrics give you instant visibility: request rate tripled, latency increased by 40%, error rate is still acceptable, CPU is climbing.
| Aspect | Metrics | Logs |
|---|---|---|
| Data type | Numeric time series | Text events |
| Question answered | How much? How many? | What happened? |
| Storage efficiency | Very efficient (numbers) | Less efficient (text) |
| Query style | Aggregate, graph | Search, filter |
| Retention | Months to years | Days to weeks |
| Alerting | Primary use case | Secondary use case |
| Debugging | Find the problem | Understand the problem |
Both are essential. Metrics alert you that something is wrong. Logs and traces help you understand why. Think of metrics as the vital signs monitor in a hospital: it tells doctors instantly when something needs attention, but they still need tests and exams to diagnose the cause.
There are four fundamental metric types, each designed for different kinds of measurements. Understanding when to use each one is essential for effective instrumentation.
Counters are monotonically increasing values. They only go up (or reset to zero on restart).
Use for:
Common queries
rate(http_requests_total[5m]) → requests per secondincrease(http_requests_total[1h]) → requests in last hourExamples:
Gauges are values that can go up or down. They represent current state.
Use for:
Queries:
memory_usage_bytesmax_over_time(memory_usage_bytes[1h])Examples:
Histograms measure the distribution of values by counting observations in buckets.
Use for:
Queries:
histogram_quantile(0.99, rate(request_duration_seconds_bucket[5m])) → p99 latencyrate(request_duration_seconds_sum[5m]) / rate(request_duration_seconds_count[5m])Examples:
Summaries are similar to histograms but calculate percentiles on the client side.
| Aspect | Histogram | Summary |
|---|---|---|
| Percentile calculation | At query time (server) | At collection time (client) |
| Aggregation | Can aggregate across instances | Cannot aggregate |
| Storage | Fixed buckets | Configurable quantiles |
| Use case | Most use cases | When you need exact percentiles |
Use histograms in most cases. They are easier to operate, more flexible, and they aggregate cleanly across instances.
With so many possible metrics to collect, how do you decide what to monitor?
Google's SRE team distilled years of experience into four fundamental metrics that every service should track. If you monitor nothing else, monitor these:
How long requests take to process.
What to measure:
Why percentiles matter:
Metrics:
How much demand is being placed on the system.
What to measure:
Metrics:
The rate of requests that fail.
What to measure:
Important: Distinguish between client errors (4xx) and server errors (5xx). A spike in 4xx might be a client bug or attack, not a service problem.
Metrics:
How "full" the service is and how close to capacity.
What to measure:
Metrics:
Good metric names are critical for usability. When you have thousands of metrics across dozens of services, consistent naming is the difference between quickly finding what you need and frustrating searches through documentation.
Prometheus conventions are widely adopted and worth following.
<namespace>_<name>_<unit>
Examples:
| Component | Description | Examples |
|---|---|---|
| Namespace | Application or subsystem | http, database, queue |
| Name | What is measured | request_duration, connections, messages |
| Unit | Measurement unit | seconds, bytes, total, percent |
Use base units without prefixes:
| Measure | Use | Avoid |
|---|---|---|
| Time | seconds | milliseconds |
| Size | bytes | megabytes |
| Rate | _total suffix | per_second |
| Ratio | _ratio or _percent | ratio without suffix |
Labels add dimensions to metrics:
Label best practices:
Cardinality is the number of unique time series a metric produces. It is easy to create thousands of series by accident, and once you hit millions, your metrics system slows down, gets expensive, or falls over.
That is the kind of number that can crash a metrics backend.
user_id (millions of users)request_id (unique per request)email (unique per user)timestamp (effectively infinite)method (GET, POST, …)status (200, 500, …)service (order, payment, …)environment (prod, staging, …)region (us-east, eu-west, …)| Label Type | Values | Safe? |
|---|---|---|
| HTTP method | ~10 | Yes |
| Status code | ~20 | Yes |
| Endpoint | 10-100 | Usually |
| Region | ~10 | Yes |
| Environment | 2-5 | Yes |
| User ID | Millions | No |
| Request ID | Billions | No |
| Millions | No |
When you truly need per-user or per-request visibility, metrics are usually the wrong tool.
Instrumentation is the practice of adding metrics collection to your code so you can measure how the system behaves in production.
The goal is not to measure everything. It is to measure the parts that matter when you are debugging incidents, tuning performance, or validating business outcomes.
Instrument these areas first:
Every API endpoint should expose the golden signals: latency, traffic, errors.
Every call you do not control should be instrumented for latency and errors.
Database
db_query_duration_seconds{operation="select", table="orders"}db_errors_total{operation="select", error="timeout"}Cache
cache_request_duration_seconds{operation="get"}cache_hits_total, cache_misses_totalExternal APIs
external_api_duration_seconds{service="payment-gateway"}external_api_errors_total{service="payment-gateway", error="timeout"}Capture dependency metrics at the client boundary (the code making the call), because that is where you can measure timeouts, retries, and error handling behavior.
Resource saturation is a top cause of cascading failures. Track utilization of anything that can get exhausted.
Connection pool
connection_pool_size{pool="database"} (gauge)connection_pool_active{pool="database"} (gauge)connection_pool_waiting{pool="database"} (gauge)Thread pool
thread_pool_size{pool="http-workers"} (gauge)thread_pool_active{pool="http-workers"} (gauge)thread_pool_queue_depth{pool="http-workers"} (gauge)Technical metrics tell you whether the system is running. Business metrics tell you whether the product is working.
E-commerce
orders_created_totalorders_completed_totalorder_value_dollars_sumcart_abandonment_totalPayments
payments_processed_total{method="credit", status="success"}payment_amount_dollars_sumrefunds_totalThese are the metrics that answer: “Are we making money?” and “Is the user journey healthy?” not just “Are servers alive?”
Prometheus is one of the most widely used open-source metrics systems. If you understand how it works, instrumentation decisions become much easier.
A typical Prometheus setup looks like this:
/metrics endpointPrometheus uses a pull model. It pulls metrics from services rather than services pushing metrics to a collector.
This changes how you think about discovery, failure detection, and load control.
| Aspect | Pull (Prometheus) | Push (StatsD, Datadog) |
|---|---|---|
| Discovery | Prometheus finds targets | Services find collector |
| Failure detection | "No metrics" = service down | Cannot distinguish down vs not pushing |
| Network config | Prometheus needs access to targets | Targets need access to collector |
| Load | Prometheus controls scrape rate | Services control push rate |
Prometheus handles this case with a separate component called a “push gateway,” but the default mental model is still pull.
Services expose a /metrics endpoint in Prometheus’ text format. Prometheus scrapes it on a schedule and ingests whatever it sees.
A typical endpoint includes HELP and TYPE metadata plus the actual samples:
Prometheus Query Language (PromQL) is how you slice and aggregate time series.
Counters are totals, so for “per second” you usually want a rate:
Histograms store bucket counts. PromQL computes percentiles at query time:
This is why histograms are so useful: you can compute p95, p99, and other percentiles across all instances and services without changing your instrumentation.
Good metrics feel boring in the best way. They are consistent, predictable, and easy to use across services. Most problems with metrics do not come from missing data. They come from messy naming, inconsistent labels, and high cardinality that makes the system expensive and hard to query.
All services should use the same label names:
Technical metrics answer: “Is the system healthy?”
Business metrics answer: “Is the product working?”
Keep them separate so each audience gets what they need without mixing concerns.
Latency is not a single number. The average can look fine while a small percentage of users have a terrible experience. Histograms let you compute percentiles at query time, which is exactly what you want for p95 and p99.
Errors often have very different latency profiles than successful requests.
If you mix them together, your latency charts become misleading.
This makes it possible to answer questions like:
Metrics give you the quantitative view of system behavior:
The four golden signals are the minimum set every service should have:
Key practices to remember: