Last Updated: January 7, 2026
Imagine you have implemented structured logging across your services. Each service writes beautiful JSON logs with consistent fields and appropriate context.
There is just one problem: your logs are scattered across 50 servers, each with its own log files. When something goes wrong, you need to SSH into each machine, grep through files, and somehow piece together what happened.
This does not scale. A single request might touch 10 services running on 30 instances. Finding all logs related to that request means searching through 30 different places. During an incident, you do not have time for this.
Log aggregation solves this by collecting logs from everywhere and storing them in a central location. Instead of searching 30 servers, you search one system. Instead of correlating timestamps manually, you filter by request ID and see everything.
In this chapter, you will learn:
This chapter builds on the logging best practices we covered previously. The structured logs you write are only useful if you can search and analyze them at scale.
In a single-server app, logs live in one place, so debugging is straightforward. In a distributed system, every service writes its own logs on its own machines. Without aggregation, even simple incidents turn into a scavenger hunt.
A single user request might touch:
server-01server-02server-03db-01Now the engineer has to SSH into multiple machines, grep through files, and manually stitch the story together. That does not scale.
| Challenge | Impact |
|---|---|
| Scattered logs | Must search each server individually |
| Ephemeral infrastructure | Container logs disappear when containers die |
| Access control | Engineers need SSH access to production servers |
| Correlation | Manually matching logs across services |
| Retention | Each server manages its own rotation and deletion |
| Analysis | No ability to query or visualize patterns |
Modern infrastructure makes this worse. With containers, auto-scaling, and frequent redeploys, instances are short-lived. When a container crashes, its local logs may vanish. When a node scales down, its files go with it. If you are not centralizing logs, you are losing the evidence you need most during failures.
A log aggregation system centralizes logs so you can search, correlate, and retain them reliably. Most designs break into four layers: collection, processing, storage, and query.
The collection layer gets logs from sources to the central system:
Lightweight daemons run on each node, read logs, and forward them.
stdout, or container log APIsApplications send logs directly over the network.
This layer transforms and enriches logs before storage:
Common tools: Logstash, Fluentd pipelines, Vector transforms, custom stream processors.
This is where logs are indexed for search and kept for retention.
This is how people and systems interact with logs:
The ELK stack is a popular setup for centralized logging. It gives you an end-to-end pipeline: collect logs, process them, store them in a searchable index, and explore them through dashboards.
Elasticsearch is a distributed search engine built on Apache Lucene. For logs, its strength is fast search over huge volumes of JSON documents.
logs-2024.01.15| Concept | Description |
|---|---|
| Index | Collection of documents with similar structure |
| Document | A single log entry (JSON object) |
| Shard | Horizontal partition of an index |
| Replica | Copy of a shard for redundancy |
| Mapping | Schema defining field types |
Tip: use keyword fields (like service.keyword) for exact matches and text fields for full-text search, depending on your mapping.
Logstash processes logs through a pipeline:
Filtering is where teams often add the most value. Enrichment (env, region, version), sanitization (mask secrets), and routing (audit logs vs app logs) usually live here.
Kibana is the UI layer for exploration and visualization:
ELK is a common choice, but it is not the only one. Different stacks make different trade-offs around cost, query flexibility, and operational complexity..
Loki takes a different approach from Elasticsearch. It indexes only metadata (labels), not the full log content.
The raw logs are stored cheaply (often in object storage), and Loki keeps a lightweight label index so you can quickly narrow down what you want to read.
| Aspect | Elasticsearch | Loki |
|---|---|---|
| Index strategy | Full-text indexing | Label-only indexing |
| Storage cost | Higher (indexes everything) | Lower (compresses logs) |
| Query speed | Faster for text search | Faster for label-based queries |
| Cardinality | Handles high cardinality | Struggles with high cardinality |
| Operations | Complex cluster management | Simpler, stateless |
Loki is a good choice when:
Loki works best when you keep labels low-cardinality. Put high-cardinality values (request_id, user_id) in the log body, not as labels.
ClickHouse is a column-oriented database that is excellent for log analytics:
ClickHouse is a strong option when your logging use case is less “grep everything” and more “run analytics at scale” (top errors, percentiles, breakdowns by service, trends over time).
Cloud providers and observability vendors offer managed logging. These reduce the operational burden, but costs can rise quickly at scale.
| Service | Provider | Strengths |
|---|---|---|
| CloudWatch Logs | AWS | Deep AWS integration, serverless |
| Cloud Logging | Google Cloud | Integrated with GCP services |
| Azure Monitor | Microsoft | Azure ecosystem integration |
| Datadog Logs | Datadog | Unified observability platform |
| Splunk Cloud | Splunk | Powerful search, enterprise features |
Managed services are great when you want to move fast, avoid running clusters, or need enterprise features out of the box. The trade-off is usually cost and vendor lock-in.
As log volume grows, your logging pipeline has to scale just like any other production system. The bottlenecks usually show up in three places: ingestion, indexing/storage, and query load.
A rough progression looks like this:
The exact breakpoints vary, but the trend is consistent: once you start indexing everything, costs and operational complexity rise quickly.
Put a durable message queue between collection and processing:
Benefits:
Use partitioning to scale throughput, and set retention long enough to handle outages of your processing or storage layers.
Elasticsearch is designed to scale by adding nodes, but roles matter.
Common node roles
Scaling is not just “add nodes.” You also need to tune shard counts, mappings, refresh intervals, and ingestion rates so the cluster stays stable.
Not all logs are equally valuable. Recent logs need fast search. Older logs are rarely queried and can live on cheaper storage.
A common tiering model:
| Tier | Storage Type | Retention | Query Speed | Cost |
|---|---|---|---|---|
| Hot | SSD | 1-7 days | Fastest | Highest |
| Warm | HDD | 7-30 days | Fast | Medium |
| Cold | Frozen | 30-90 days | Slow | Low |
| Archive | Object storage | 90+ days | Very slow | Lowest |
This keeps your “hot” cluster small and fast while still meeting retention and compliance needs.
The cheapest log is the one you never store. Reduce volume before it hits your index.
High-value logs (always keep)
Low-value logs (sample or drop)
A good rule: if you need counts and rates, use metrics. Use logs for context and investigation.
In Elasticsearch, index management has a direct impact on cost, stability, and query speed. The core idea is to keep indices easy to delete, easy to query, and sized so shards stay healthy.
A common pattern is to create a new index per day:
logs-2024.01.13logs-2024.01.14logs-2024.01.15 (today’s hot index)Benefits
ILM automates what you would otherwise do manually: keep recent data fast, move older data to cheaper tiers, and delete it when it expires.
Typical transitions:
Common ILM actions
Practical tip: even if you do daily indices, rollover-by-size is useful when volume is uneven (for example during incidents or traffic spikes).
Shard size matters because shards are the units Elasticsearch moves, recovers, and searches. A useful target for log workloads is typically:
10–50 GB per shard (many teams aim around 20–40 GB)
| Problem | Symptoms | Solution |
|---|---|---|
| Too many small shards | High cluster overhead, slow performance | Fewer shards, rollover by size |
| Too few large shards | Slow queries, long recovery times | More shards, smaller rollover threshold |
Rule of thumb: keep shard count under control. A commonly cited guideline is around 20–25 shards per GB of heap per data node, but treat it as a starting point, not a guarantee. The real goal is to avoid a shard explosion that eats memory and slows the cluster down.
Log aggregation costs can grow faster than you expect because you pay for volume at every stage: ingestion, processing, indexing, storage, and queries.
The simplest way to control cost is to treat logs as a product with a budget, not an unlimited dump.
Reducing volume is the highest leverage move because it lowers costs everywhere downstream.
Techniques that work well:
| Strategy | Savings | Trade-off |
|---|---|---|
| Compression | 60-80% | Slightly slower queries |
| Hot-warm-cold tiers | 50-70% | Slower old log queries |
| Shorter retention | Variable | Less historical data |
| Field trimming | 20-40% | Less queryable fields |
Keep hot data searchable and move older data to cheaper tiers or object storage. Most teams search the last few days far more than the last few months.
These are ballpark ranges and can swing widely based on retention, query patterns, and indexing choices.
| Solution | Monthly Cost | Notes |
|---|---|---|
| Self-managed ELK | $5,000-15,000 | Infrastructure + operations |
| Elastic Cloud | $10,000-25,000 | Managed, less operations |
| Loki + S3 | $2,000-5,000 | Lower cost, different trade-offs |
| CloudWatch Logs | $10,000-30,000 | Fully managed, pay per use |
| Datadog | $15,000-40,000 | Full platform, expensive at scale |
Costs vary significantly based on retention, query patterns, and feature requirements.
Your log pipeline is production infrastructure. If it breaks, you lose the evidence you need during incidents, so it must be monitored like any other critical system.
Collection
Processing
Storage and query
| Component | Metric | Alert Threshold |
|---|---|---|
| Agents | Events/second | Sudden drop (source failure) |
| Agents | Failed deliveries | > 0.1% (delivery issues) |
| Kafka | Consumer lag | Growing lag (processing behind) |
| Logstash | Events per second | Below baseline |
| Elasticsearch | Indexing latency | > 5 seconds |
| Elasticsearch | Search latency | p99 > 10 seconds |
| Cluster | Disk usage | > 80% |
| Failure | Symptom | Mitigation |
|---|---|---|
| Agent failure | Missing logs from hosts | Heartbeat monitoring |
| Network partition | Log delivery delays | Buffer and retry |
| Kafka full | Producers backing up | Monitor lag, scale consumers |
| Elasticsearch slow | Query timeouts | Scale cluster, optimize queries |
| Disk full | Indexing stops | ILM policies, alerts |
Log aggregation turns scattered logs into a centralized, queryable system:
Key decisions that drive cost and scale: