AlgoMaster Logo

Metrics & Instrumentation

Last Updated: January 7, 2026

Ashish

Ashish Pratap Singh

Logs tell you what happened to specific requests. But what if you need to know how the system is performing overall? Are response times getting slower? Is the error rate increasing? How many requests are we handling per second?

These questions require metrics: numerical measurements collected over time.

Metrics answer questions like "how many?" and "how fast?" across thousands or millions of requests. They power dashboards, drive alerts, and enable capacity planning. Good instrumentation gives you early warning when things degrade, often before users notice.

In this chapter, you will learn:

  • Types of metrics and when to use each
  • The four golden signals every service should track
  • How to instrument your code effectively
  • Metric naming conventions and best practices
  • Cardinality and why it matters for scalability
  • Common metrics infrastructure like Prometheus

This builds on the observability foundation we established earlier. Metrics complement logs by providing the aggregate view that logs cannot offer.

Why Metrics Matter

To understand why metrics are indispensable, consider an e-commerce platform during a flash sale.

What Logs Show:

  • Order 123 completed in 250ms
  • Order 124 completed in 280ms
  • Order 125 completed in 310ms
  • .. 50,000 more orders ...

What Metrics Show

  • Request rate: 2,500/sec ↑
  • p99 latency: 450ms ↑
  • Error rate: 0.3% ↑
  • CPU: 78% ↑

Logs show individual events. To understand the flash sale's impact, you would need to aggregate 50,000 log entries. Metrics give you instant visibility: request rate tripled, latency increased by 40%, error rate is still acceptable, CPU is climbing.

Metrics vs Logs Comparison

AspectMetricsLogs
Data typeNumeric time seriesText events
Question answeredHow much? How many?What happened?
Storage efficiencyVery efficient (numbers)Less efficient (text)
Query styleAggregate, graphSearch, filter
RetentionMonths to yearsDays to weeks
AlertingPrimary use caseSecondary use case
DebuggingFind the problemUnderstand the problem

Both are essential. Metrics alert you that something is wrong. Logs and traces help you understand why. Think of metrics as the vital signs monitor in a hospital: it tells doctors instantly when something needs attention, but they still need tests and exams to diagnose the cause.

Types of Metrics

There are four fundamental metric types, each designed for different kinds of measurements. Understanding when to use each one is essential for effective instrumentation.

Counters

Counters are monotonically increasing values. They only go up (or reset to zero on restart).

Use for:

  • Total requests handled
  • Errors occurred
  • Bytes transferred
  • Tasks completed

Common queries

  • Rate: rate(http_requests_total[5m]) → requests per second
  • Increase: increase(http_requests_total[1h]) → requests in last hour

Examples:

Gauges

Gauges are values that can go up or down. They represent current state.

Use for:

  • Current memory usage
  • Active connections
  • Queue depth
  • Temperature
  • Thread pool size

Queries:

  • Current value: memory_usage_bytes
  • Max over time: max_over_time(memory_usage_bytes[1h])

Examples:

Histograms

Histograms measure the distribution of values by counting observations in buckets.

Use for:

  • Request latency
  • Response sizes
  • Queue wait times
  • Any measurement where distribution matters

Queries:

  • Percentile: histogram_quantile(0.99, rate(request_duration_seconds_bucket[5m])) → p99 latency
  • Average: rate(request_duration_seconds_sum[5m]) / rate(request_duration_seconds_count[5m])

Examples:

Summaries

Summaries are similar to histograms but calculate percentiles on the client side.

AspectHistogramSummary
Percentile calculationAt query time (server)At collection time (client)
AggregationCan aggregate across instancesCannot aggregate
StorageFixed bucketsConfigurable quantiles
Use caseMost use casesWhen you need exact percentiles

The Four Golden Signals

With so many possible metrics to collect, how do you decide what to monitor?

Google's SRE team distilled years of experience into four fundamental metrics that every service should track. If you monitor nothing else, monitor these:

Latency

How long requests take to process.

What to measure:

  • Successful request latency (most important)
  • Failed request latency (often different)
  • Percentiles: p50, p90, p95, p99

Why percentiles matter:

Metrics:

Traffic

How much demand is being placed on the system.

What to measure:

  • Requests per second
  • Transactions per second
  • Bytes transferred
  • Concurrent users

Metrics:

Errors

The rate of requests that fail.

What to measure:

  • Error rate (errors / total requests)
  • Error count by type
  • HTTP 5xx responses
  • Failed transactions

Important: Distinguish between client errors (4xx) and server errors (5xx). A spike in 4xx might be a client bug or attack, not a service problem.

Metrics:

Saturation

How "full" the service is and how close to capacity.

What to measure:

  • CPU utilization
  • Memory usage
  • Disk I/O
  • Queue depth
  • Connection pool usage
  • Thread pool saturation

Metrics:

Metric Naming Conventions

Good metric names are critical for usability. When you have thousands of metrics across dozens of services, consistent naming is the difference between quickly finding what you need and frustrating searches through documentation.

Prometheus conventions are widely adopted and worth following.

Naming Rules

<namespace>_<name>_<unit>

Examples:

ComponentDescriptionExamples
NamespaceApplication or subsystemhttp, database, queue
NameWhat is measuredrequest_duration, connections, messages
UnitMeasurement unitseconds, bytes, total, percent

Unit Conventions

Use base units without prefixes:

MeasureUseAvoid
Timesecondsmilliseconds
Sizebytesmegabytes
Rate_total suffixper_second
Ratio_ratio or _percentratio without suffix

Labels (Dimensions)

Labels add dimensions to metrics:

Label best practices:

  • Use consistent label names across metrics
  • Keep label values low cardinality (see next section)
  • Include useful dimensions: method, endpoint, status, service
  • Avoid high-cardinality values: user_id, request_id

Cardinality: The Hidden Killer

Cardinality is the number of unique time series a metric produces. It is easy to create thousands of series by accident, and once you hit millions, your metrics system slows down, gets expensive, or falls over.

How Cardinality Grows

That is the kind of number that can crash a metrics backend.

Cardinality Guidelines

Dangerous labels (avoid)

  • user_id (millions of users)
  • request_id (unique per request)
  • email (unique per user)
  • timestamp (effectively infinite)

Safe labels (usually fine)

  • method (GET, POST, …)
  • status (200, 500, …)
  • service (order, payment, …)
  • environment (prod, staging, …)
  • region (us-east, eu-west, …)
Label TypeValuesSafe?
HTTP method~10Yes
Status code~20Yes
Endpoint10-100Usually
Region~10Yes
Environment2-5Yes
User IDMillionsNo
Request IDBillionsNo
EmailMillionsNo

Managing High Cardinality

When you truly need per-user or per-request visibility, metrics are usually the wrong tool.

Use logs, not metrics:

Aggregate to lower cardinality:

Instrumenting Your Code

Instrumentation is the practice of adding metrics collection to your code so you can measure how the system behaves in production.

The goal is not to measure everything. It is to measure the parts that matter when you are debugging incidents, tuning performance, or validating business outcomes.

What to Instrument

Instrument these areas first:

  • Entry points: API endpoints, cron jobs, queue consumers
  • Dependencies: database, cache, external APIs
  • Business logic: orders, payments, checkouts, state transitions
  • Resources: pools, queues, buffers, thread pools

Entry Points

Every API endpoint should expose the golden signals: latency, traffic, errors.

Pseudocode:

Dependencies

Every call you do not control should be instrumented for latency and errors.

Database

  • db_query_duration_seconds{operation="select", table="orders"}
  • db_errors_total{operation="select", error="timeout"}

Cache

  • cache_request_duration_seconds{operation="get"}
  • cache_hits_total, cache_misses_total

External APIs

  • external_api_duration_seconds{service="payment-gateway"}
  • external_api_errors_total{service="payment-gateway", error="timeout"}

Resource Pools

Resource saturation is a top cause of cascading failures. Track utilization of anything that can get exhausted.

Connection pool

  • connection_pool_size{pool="database"} (gauge)
  • connection_pool_active{pool="database"} (gauge)
  • connection_pool_waiting{pool="database"} (gauge)

Thread pool

  • thread_pool_size{pool="http-workers"} (gauge)
  • thread_pool_active{pool="http-workers"} (gauge)
  • thread_pool_queue_depth{pool="http-workers"} (gauge)

Business Metrics

Technical metrics tell you whether the system is running. Business metrics tell you whether the product is working.

E-commerce

  • orders_created_total
  • orders_completed_total
  • order_value_dollars_sum
  • cart_abandonment_total

Payments

  • payments_processed_total{method="credit", status="success"}
  • payment_amount_dollars_sum
  • refunds_total

These are the metrics that answer: “Are we making money?” and “Is the user journey healthy?” not just “Are servers alive?”

Prometheus Architecture

Prometheus is one of the most widely used open-source metrics systems. If you understand how it works, instrumentation decisions become much easier.

The big picture

A typical Prometheus setup looks like this:

  • Instrumented services expose metrics at a /metrics endpoint
  • The Prometheus server periodically scrapes those endpoints
  • Prometheus stores the results in its time series database (TSDB)
  • Grafana queries Prometheus to build dashboards
  • Alertmanager sends notifications when alert rules fire

Pull-Based Model

Prometheus uses a pull model. It pulls metrics from services rather than services pushing metrics to a collector.

This changes how you think about discovery, failure detection, and load control.

AspectPull (Prometheus)Push (StatsD, Datadog)
DiscoveryPrometheus finds targetsServices find collector
Failure detection"No metrics" = service downCannot distinguish down vs not pushing
Network configPrometheus needs access to targetsTargets need access to collector
LoadPrometheus controls scrape rateServices control push rate

Why pull is nice in practice

  • Prometheus can scrape more frequently for critical services and less frequently for others
  • You get a clear “we stopped seeing metrics” signal when targets disappear
  • It works well with dynamic environments like Kubernetes, where targets come and go often

When push can still make sense

  • very short-lived jobs (batch jobs that finish before Prometheus can scrape them)
  • environments where Prometheus cannot reach the targets due to network boundaries

Prometheus handles this case with a separate component called a “push gateway,” but the default mental model is still pull.

Metrics Endpoint

Services expose a /metrics endpoint in Prometheus’ text format. Prometheus scrapes it on a schedule and ingests whatever it sees.

A typical endpoint includes HELP and TYPE metadata plus the actual samples:

PromQL Basics

Prometheus Query Language (PromQL) is how you slice and aggregate time series.

Basic queries:

Rate and aggregation:

Counters are totals, so for “per second” you usually want a rate:

Percentiles:

Histograms store bucket counts. PromQL computes percentiles at query time:

This is why histograms are so useful: you can compute p95, p99, and other percentiles across all instances and services without changing your instrumentation.

Metrics Best Practices

Good metrics feel boring in the best way. They are consistent, predictable, and easy to use across services. Most problems with metrics do not come from missing data. They come from messy naming, inconsistent labels, and high cardinality that makes the system expensive and hard to query.

1. Use Consistent Labels Across Services

All services should use the same label names:

2. Separate Technical and Business Metrics

Technical metrics answer: “Is the system healthy?”

Business metrics answer: “Is the product working?”

Keep them separate so each audience gets what they need without mixing concerns.

3. Use Histograms for Latency

Latency is not a single number. The average can look fine while a small percentage of users have a terrible experience. Histograms let you compute percentiles at query time, which is exactly what you want for p95 and p99.

4. Include Status Labels on Duration Metrics

Errors often have very different latency profiles than successful requests.

  • Some errors are fast (fail quickly due to validation or auth).
  • Some errors are slow (timeouts and retries).

If you mix them together, your latency charts become misleading.

This makes it possible to answer questions like:

  • “Is p99 for successful requests getting worse?”
  • “Are 5xx errors caused by timeouts, or are we failing fast?”

Summary

Metrics give you the quantitative view of system behavior:

  • Counters count things that only increase: requests, errors, bytes
  • Gauges measure current state: connections, queue depth, memory
  • Histograms capture distributions: latency, response sizes

The four golden signals are the minimum set every service should have:

  • Latency: percentiles, not averages
  • Traffic: requests per second
  • Errors: error rate and error type breakdown
  • Saturation: utilization of constrained resources

Key practices to remember:

  • Use consistent metric and label names across services
  • Keep label cardinality low
  • Separate technical and business metrics
  • Use histograms for latency and include status labels
  • Instrument entry points, dependencies, and resource pools