AlgoMaster Logo

Log Aggregation

Last Updated: January 7, 2026

Ashish

Ashish Pratap Singh

Imagine you have implemented structured logging across your services. Each service writes beautiful JSON logs with consistent fields and appropriate context.

There is just one problem: your logs are scattered across 50 servers, each with its own log files. When something goes wrong, you need to SSH into each machine, grep through files, and somehow piece together what happened.

This does not scale. A single request might touch 10 services running on 30 instances. Finding all logs related to that request means searching through 30 different places. During an incident, you do not have time for this.

Log aggregation solves this by collecting logs from everywhere and storing them in a central location. Instead of searching 30 servers, you search one system. Instead of correlating timestamps manually, you filter by request ID and see everything.

In this chapter, you will learn:

  • How log aggregation systems work
  • Common architectures and their trade-offs
  • The ELK stack and its alternatives
  • How to scale log aggregation for high-volume systems
  • Cost optimization strategies

This chapter builds on the logging best practices we covered previously. The structured logs you write are only useful if you can search and analyze them at scale.

The Problem with Distributed Logs

In a single-server app, logs live in one place, so debugging is straightforward. In a distributed system, every service writes its own logs on its own machines. Without aggregation, even simple incidents turn into a scavenger hunt.

A single user request might touch:

  • Service 1 on server-01
  • Service 2 on server-02
  • Service 3 on server-03
  • Database logs on db-01

Now the engineer has to SSH into multiple machines, grep through files, and manually stitch the story together. That does not scale.

The Challenges

ChallengeImpact
Scattered logsMust search each server individually
Ephemeral infrastructureContainer logs disappear when containers die
Access controlEngineers need SSH access to production servers
CorrelationManually matching logs across services
RetentionEach server manages its own rotation and deletion
AnalysisNo ability to query or visualize patterns

Modern infrastructure makes this worse. With containers, auto-scaling, and frequent redeploys, instances are short-lived. When a container crashes, its local logs may vanish. When a node scales down, its files go with it. If you are not centralizing logs, you are losing the evidence you need most during failures.

Log Aggregation Architecture

A log aggregation system centralizes logs so you can search, correlate, and retain them reliably. Most designs break into four layers: collection, processing, storage, and query.

Collection Layer

The collection layer gets logs from sources to the central system:

1) Agents (most common)

Lightweight daemons run on each node, read logs, and forward them.

  • Examples: Filebeat, Fluentd, Fluent Bit, Vector
  • Read from files, stdout, or container log APIs
  • Buffer locally and retry on network failures
  • Decouple your app from the logging backend

2) Direct shipping

Applications send logs directly over the network.

  • Works well in some containerized setups
  • Avoids file I/O
  • Higher risk of log loss if the destination is slow or unavailable unless you build buffering into the app

Processing Layer

This layer transforms and enriches logs before storage:

  • Parsing: extract fields from raw logs (or validate JSON structure)
  • Enrichment: add metadata like hostname, environment, region, pod name, version
  • Filtering: drop noisy logs, apply sampling rules, redact sensitive fields
  • Routing: send different log types to different destinations (audit vs app logs)

Common tools: Logstash, Fluentd pipelines, Vector transforms, custom stream processors.

Storage Layer

This is where logs are indexed for search and kept for retention.

Search-oriented stores

  • Elasticsearch / OpenSearch
  • Fast full-text search across huge volumes
  • Rich queries and aggregations
  • Typically higher cost, especially at high cardinality

Time-series / columnar approaches

  • Loki, ClickHouse
  • Often cheaper at scale due to compression and columnar storage
  • Optimized for time-range queries
  • Trade some query flexibility for efficiency (especially compared to full-text search)

Query Layer

This is how people and systems interact with logs:

  • Dashboards: Kibana, Grafana, or internal UIs
  • APIs: programmatic access for automation and incident tooling
  • Alerts: trigger notifications based on patterns (for example error spikes, specific signatures)

The ELK Stack

The ELK stack is a popular setup for centralized logging. It gives you an end-to-end pipeline: collect logs, process them, store them in a searchable index, and explore them through dashboards.

Elasticsearch

Elasticsearch is a distributed search engine built on Apache Lucene. For logs, its strength is fast search over huge volumes of JSON documents.

How it stores logs:

  • Logs are stored as JSON documents
  • Documents are grouped into indices (like database tables)
  • Indices are often time-based, for example: logs-2024.01.15
  • Data is split across shards for scalability
  • Replicas provide redundancy and read scaling

Key concepts:

ConceptDescription
IndexCollection of documents with similar structure
DocumentA single log entry (JSON object)
ShardHorizontal partition of an index
ReplicaCopy of a shard for redundancy
MappingSchema defining field types

Example query (errors in payment-service in the last hour)

Tip: use keyword fields (like service.keyword) for exact matches and text fields for full-text search, depending on your mapping.

Logstash

Logstash processes logs through a pipeline:

Input → Filter → Output

  • Input: receive logs (Beats, files, Kafka, HTTP, etc.)
  • Filter: parse, transform, enrich (JSON parsing, timestamps, geoip, redaction)
  • Output: send to Elasticsearch or other destinations

Example pipeline configuration

Kibana

Kibana is the UI layer for exploration and visualization:

  • Discover: Search and browse logs
  • Visualize: Create charts and graphs
  • Dashboard: Combine visualizations into dashboards
  • Alerting: Set up alerts based on log patterns

Alternative Architectures

ELK is a common choice, but it is not the only one. Different stacks make different trade-offs around cost, query flexibility, and operational complexity..

Loki (Grafana)

Loki takes a different approach from Elasticsearch. It indexes only metadata (labels), not the full log content.

The raw logs are stored cheaply (often in object storage), and Loki keeps a lightweight label index so you can quickly narrow down what you want to read.

Elasticsearch vs Loki

AspectElasticsearchLoki
Index strategyFull-text indexingLabel-only indexing
Storage costHigher (indexes everything)Lower (compresses logs)
Query speedFaster for text searchFaster for label-based queries
CardinalityHandles high cardinalityStruggles with high cardinality
OperationsComplex cluster managementSimpler, stateless

Loki is a good choice when:

  • Cost is a primary concern
  • You query by known labels (service, environment)
  • You use Grafana for visualization
  • Log volume is very high

Loki works best when you keep labels low-cardinality. Put high-cardinality values (request_id, user_id) in the log body, not as labels.

ClickHouse

ClickHouse is a column-oriented database that is excellent for log analytics:

  • extremely fast aggregations over large datasets
  • strong compression and low storage cost per TB
  • SQL interface, which many teams find easier than DSL queries
  • can power both logs and metrics-style analytics

ClickHouse is a strong option when your logging use case is less “grep everything” and more “run analytics at scale” (top errors, percentiles, breakdowns by service, trends over time).

Managed Services

Cloud providers and observability vendors offer managed logging. These reduce the operational burden, but costs can rise quickly at scale.

ServiceProviderStrengths
CloudWatch LogsAWSDeep AWS integration, serverless
Cloud LoggingGoogle CloudIntegrated with GCP services
Azure MonitorMicrosoftAzure ecosystem integration
Datadog LogsDatadogUnified observability platform
Splunk CloudSplunkPowerful search, enterprise features

Managed services are great when you want to move fast, avoid running clusters, or need enterprise features out of the box. The trade-off is usually cost and vendor lock-in.

Scaling Log Aggregation

As log volume grows, your logging pipeline has to scale just like any other production system. The bottlenecks usually show up in three places: ingestion, indexing/storage, and query load.

Volume Challenges

A rough progression looks like this:

  • ~10 GB/day: single node or simple setup
  • ~100 GB/day: small cluster
  • ~1 TB/day: large cluster with careful tuning
  • 10+ TB/day: multi-cluster, tiered storage, and strict controls on volume and cardinality

The exact breakpoints vary, but the trend is consistent: once you start indexing everything, costs and operational complexity rise quickly.

Scaling Strategies

1. Add a Buffer (Kafka)

Put a durable message queue between collection and processing:

Benefits:

  • absorbs traffic spikes without dropping logs
  • decouples collection from processing and storage
  • enables replay if a downstream stage fails
  • supports multiple consumers (for example, security pipeline plus analytics pipeline)

2. Scale Elasticsearch Horizontally

Elasticsearch is designed to scale by adding nodes, but roles matter.

Common node roles

  • Master nodes: cluster coordination (use 3 for high availability)
  • Data nodes: store and search data (scale these for storage and indexing throughput)
  • Coordinating nodes: route queries and reduce query load on data nodes (scale for heavy dashboards and search traffic)

Scaling is not just “add nodes.” You also need to tune shard counts, mappings, refresh intervals, and ingestion rates so the cluster stays stable.

3. Use Hot-Warm-Cold Architecture

Not all logs are equally valuable. Recent logs need fast search. Older logs are rarely queried and can live on cheaper storage.

A common tiering model:

TierStorage TypeRetentionQuery SpeedCost
HotSSD1-7 daysFastestHighest
WarmHDD7-30 daysFastMedium
ColdFrozen30-90 daysSlowLow
ArchiveObject storage90+ daysVery slowLowest

This keeps your “hot” cluster small and fast while still meeting retention and compliance needs.

4. Sampling and Filtering

The cheapest log is the one you never store. Reduce volume before it hits your index.

High-value logs (always keep)

  • errors and warnings
  • transaction completions and state changes
  • security events
  • important user actions

Low-value logs (sample or drop)

  • health checks (sample 1%)
  • debug logs (drop in production)
  • routine success logs (sample 10% or convert to metrics)

A good rule: if you need counts and rates, use metrics. Use logs for context and investigation.

Index Management

In Elasticsearch, index management has a direct impact on cost, stability, and query speed. The core idea is to keep indices easy to delete, easy to query, and sized so shards stay healthy.

Time-Based Indices

A common pattern is to create a new index per day:

  • logs-2024.01.13
  • logs-2024.01.14
  • logs-2024.01.15 (today’s hot index)

Benefits

  • Simple retention: delete old indices in one operation
  • Faster queries: most searches are time-bounded, so you hit fewer indices
  • Predictable sizing: you avoid one giant index that grows forever

Index Lifecycle Management (ILM)

ILM automates what you would otherwise do manually: keep recent data fast, move older data to cheaper tiers, and delete it when it expires.

Hot → Warm → Cold → Delete

Typical transitions:

  • Hot: actively written, frequent queries
  • Warm: read-heavy, less frequent queries
  • Cold: rarely accessed
  • Delete: remove past retention

Common ILM actions

  • Rollover: create a new index when size or age threshold is reached
  • Shrink: reduce shard count for older indices
  • Force merge: merge segments to improve query efficiency on read-mostly indices
  • Delete: drop indices past retention

Practical tip: even if you do daily indices, rollover-by-size is useful when volume is uneven (for example during incidents or traffic spikes).

Shard Sizing

Shard size matters because shards are the units Elasticsearch moves, recovers, and searches. A useful target for log workloads is typically:

10–50 GB per shard (many teams aim around 20–40 GB)

ProblemSymptomsSolution
Too many small shardsHigh cluster overhead, slow performanceFewer shards, rollover by size
Too few large shardsSlow queries, long recovery timesMore shards, smaller rollover threshold

Rule of thumb: keep shard count under control. A commonly cited guideline is around 20–25 shards per GB of heap per data node, but treat it as a starting point, not a guarantee. The real goal is to avoid a shard explosion that eats memory and slows the cluster down.

Cost Optimization

Log aggregation costs can grow faster than you expect because you pay for volume at every stage: ingestion, processing, indexing, storage, and queries.

The simplest way to control cost is to treat logs as a product with a budget, not an unlimited dump.

Reduce Log Volume

Reducing volume is the highest leverage move because it lowers costs everywhere downstream.

Techniques that work well:

  • Drop DEBUG logs in production by default
  • Sample high-volume, low-value logs (health checks, cache hits, routine success paths)
  • Trim verbose fields that are rarely used (large payloads, full headers, long stacks in INFO)
  • Aggregate repetitive logs (10,000 health checks → one periodic summary)

Optimize Storage

StrategySavingsTrade-off
Compression60-80%Slightly slower queries
Hot-warm-cold tiers50-70%Slower old log queries
Shorter retentionVariableLess historical data
Field trimming20-40%Less queryable fields

Right-Size Infrastructure

  • Start small and scale based on actual ingestion and query load
  • Use spot or preemptible instances for warm and cold tiers when safe
  • Consider managed services for smaller deployments to avoid ops overhead
  • Monitor resource usage and set budgets (ingest rate, index size growth, query rate)

Rough cost comparison (1 TB/day)

These are ballpark ranges and can swing widely based on retention, query patterns, and indexing choices.

SolutionMonthly CostNotes
Self-managed ELK$5,000-15,000Infrastructure + operations
Elastic Cloud$10,000-25,000Managed, less operations
Loki + S3$2,000-5,000Lower cost, different trade-offs
CloudWatch Logs$10,000-30,000Fully managed, pay per use
Datadog$15,000-40,000Full platform, expensive at scale

Costs vary significantly based on retention, query patterns, and feature requirements.

Observability for your Logging System

Your log pipeline is production infrastructure. If it breaks, you lose the evidence you need during incidents, so it must be monitored like any other critical system.

Key Metrics to Track

Collection

  • events per second
  • buffer utilization
  • failed deliveries

Processing

  • processing latency
  • parsing errors
  • queue depth

Storage and query

  • indexing rate and latency
  • query latency
  • disk usage and index growth

Example alerts

ComponentMetricAlert Threshold
AgentsEvents/secondSudden drop (source failure)
AgentsFailed deliveries> 0.1% (delivery issues)
KafkaConsumer lagGrowing lag (processing behind)
LogstashEvents per secondBelow baseline
ElasticsearchIndexing latency> 5 seconds
ElasticsearchSearch latencyp99 > 10 seconds
ClusterDisk usage> 80%

Common Failure Modes

FailureSymptomMitigation
Agent failureMissing logs from hostsHeartbeat monitoring
Network partitionLog delivery delaysBuffer and retry
Kafka fullProducers backing upMonitor lag, scale consumers
Elasticsearch slowQuery timeoutsScale cluster, optimize queries
Disk fullIndexing stopsILM policies, alerts

Summary

Log aggregation turns scattered logs into a centralized, queryable system:

  • Collection uses agents to gather logs reliably
  • Processing parses, enriches, and routes logs through pipelines
  • Storage indexes logs (Elasticsearch) or stores them cheaply with label indexes (Loki)
  • Query tools like Kibana and Grafana enable search, dashboards, and alerts

Key decisions that drive cost and scale:

  • ELK gives flexible full-text search but costs more
  • Loki lowers cost with label-only indexing
  • Kafka buffers ingestion and enables replay
  • Hot–warm–cold tiers keep recent logs fast and older logs cheap
  • Sampling, filtering, and retention are your main cost levers