Last Updated: January 15, 2026
Every company today generates massive amounts of data. User clicks, transactions, sensor readings, logs, events, etc..
The question is not whether you have data, but how you process it. Do you wait until you have collected a full day's worth and process it all at once? Or do you process each piece as it arrives?
This fundamental choice defines two paradigms of data processing: batch and stream. Batch processing collects data over time and processes it in bulk. Stream processing handles data as it flows, one event at a time.
Neither approach is universally better. Each has strengths that make it ideal for certain use cases and weaknesses that make it unsuitable for others. Understanding when to use which, and how to combine them, is essential for designing modern data systems.
In this chapter, you will learn:
This chapter sets the foundation for understanding big data architectures. The patterns you learn here will recur throughout this section, from MapReduce to Lambda Architecture.
Batch processing collects data over a period of time and processes it all together in a single job. Think of it like doing laundry: you accumulate dirty clothes throughout the week and wash them all on Sunday.
| Characteristic | Description |
|---|---|
| Latency | High (hours to days). Results are not available until the job completes. |
| Throughput | Very high. Optimized for processing massive volumes efficiently. |
| Data completeness | Processes complete, bounded datasets. Knows all the data upfront. |
| Fault tolerance | Can restart failed jobs. Reprocessing is straightforward. |
| Complexity | Lower. Simpler programming model with clear start and end. |
A batch job has three phases:
The key insight is that batch processing knows the entire input before starting. This allows for optimizations like sorting all data before aggregating, or making multiple passes over the dataset.
| Tool | Description | Best For |
|---|---|---|
| Apache Hadoop MapReduce | Original batch processing framework | Legacy systems, very large datasets |
| Apache Spark (Batch) | In-memory batch processing | Interactive analytics, ML, iterative algorithms |
| Apache Hive | SQL on Hadoop | Ad-hoc queries on data lake |
| Presto/Trino | Distributed SQL query engine | Fast interactive queries |
| dbt | Transformation tool for warehouses | Modern ELT pipelines |
Stream processing handles data as it arrives, one event at a time. Instead of waiting for a batch, it processes continuously. Think of it like a car factory assembly line: each car is built as it moves through the stations, not in batches of 100.
| Characteristic | Description |
|---|---|
| Latency | Low (milliseconds to seconds). Results appear almost immediately. |
| Throughput | Lower than batch for same resources. Processing overhead per event. |
| Data completeness | Processes unbounded, infinite streams. Never sees "all" the data. |
| Fault tolerance | Complex. Must handle failures without losing or duplicating events. |
| Complexity | Higher. Must handle out-of-order data, late arrivals, state management. |
Stream processing handles events one at a time, maintaining state across events:
The key challenge is that stream processing never sees the complete picture. New events keep arriving. Late events might belong to windows that have already closed. The processor must make decisions with incomplete information.
| Tool | Description | Best For |
|---|---|---|
| Apache Kafka Streams | Lightweight stream processing | Kafka-native applications |
| Apache Flink | True stream processing with state | Low-latency, exactly-once processing |
| Apache Spark Streaming | Micro-batch stream processing | Unified batch and stream |
| Amazon Kinesis | Managed streaming on AWS | AWS-native applications |
| Google Dataflow | Unified batch and stream (Beam) | GCP environments |
Stream processing introduces complexity that batch processing avoids.
Events do not always arrive in the order they occurred. Network delays, distributed systems, and retries cause reordering.
Stream processors must handle this using event time (when it happened) rather than processing time (when it arrived).
What happens when an event arrives after its window has closed?
Stream processing often needs state. Counting events, computing averages, detecting patterns all require remembering past events.
State must be checkpointed for fault tolerance. If a processor fails, it must recover its state to continue correctly.
Guaranteeing each event is processed exactly once is hard in distributed systems:
| Guarantee | Description | Complexity |
|---|---|---|
| At-most-once | Events may be lost | Low |
| At-least-once | Events may be duplicated | Medium |
| Exactly-once | Events processed exactly once | High |
Exactly-once requires coordination between the stream processor, state store, and output sink.
Batch processing optimizes for throughput by amortizing overhead across millions of records. Stream processing optimizes for latency by processing each record immediately.
| Metric | Batch | Stream |
|---|---|---|
| Time to first result | Hours | Milliseconds |
| Throughput (records/sec) | Millions | Thousands to hundreds of thousands |
| Cost per record | Lower (amortized) | Higher (per-event overhead) |
| Resource utilization | Bursty (high during job) | Steady (continuous) |
| Aspect | Batch | Stream |
|---|---|---|
| Dataset | Bounded (finite) | Unbounded (infinite) |
| Completeness | All data available | Data always incomplete |
| Ordering | Can be sorted globally | May arrive out of order |
| Late data | Not an issue | Must handle explicitly |
| Reprocessing | Easy (re-run job) | Complex (replay stream) |
Batch processing has a simpler programming model because it operates on finite datasets:
Stream processing must handle continuous data and state:
Notice how stream processing introduces the concept of windows. Since streams are infinite, you cannot compute "total count." You can only compute "count per time window."
Spark blurs the line by offering both batch and stream processing with the same API:
This unified approach means you can develop logic once and apply it to both batch and streaming contexts.
Micro-batch processing sits between batch and stream. It processes data in small batches (seconds) rather than true event-by-event.
Spark Structured Streaming uses micro-batch internally but provides a streaming API. For many use cases, seconds of latency is acceptable, and the simpler programming model is worth it.
Batch and stream processing represent two fundamental approaches to handling data:
Understanding these paradigms is essential for designing data architectures. The next chapter dives deep into MapReduce, the foundational batch processing paradigm that revolutionized how we think about processing massive datasets.