AlgoMaster Logo

Streaming Engines

Last Updated: January 8, 2026

Ashish

Ashish Pratap Singh

Throughout this section, we have discussed stream processing as a concept. We have seen it as the speed layer in Lambda Architecture and as the sole processing paradigm in Kappa Architecture.

But how do streaming engines actually work? What makes them capable of processing millions of events per second while maintaining exactly-once semantics?

In this chapter, you will learn:

  • Core concepts that all streaming engines share
  • How popular engines (Flink, Kafka Streams, Spark Streaming) work
  • State management and checkpointing mechanisms
  • Processing guarantees and exactly-once semantics
  • How to choose the right streaming engine for your use case

Core Streaming Concepts

Before diving into specific engines, let us establish the foundational concepts they all share. The engine changes, but the mental model stays the same.

Event Time vs Processing Time

In streaming, time is surprisingly tricky because events rarely arrive exactly when they happen.

Time TypeDefinitionUse Case
Event timeWhen the event actually occurredAccurate analytics
Processing timeWhen the engine processes itSimple, low-latency
Ingestion timeWhen event enters the systemCompromise

Most production analytics pipelines prefer event time because correctness matters, even if it adds complexity.

Watermarks

If you use event time, you immediately hit a problem: events can arrive late. So when do you decide a window is “done”?

That is what watermarks solve. Watermarks track progress through event time, signaling that no events earlier than the watermark will arrive.

So if your watermark is 10:00, the engine is saying it does not expect any more events earlier than 10:00. If something earlier arrives, it is considered late.

What watermarks enable:

  • Knowing when to emit window results
  • Handling late events (after watermark)
  • Balancing completeness vs latency

The key trade-off

  • A more aggressive watermark gives lower latency but more late events
  • A more conservative watermark gives higher accuracy but slower results

In real systems, you usually set an allowed lateness window and accept that sometimes data arrives outside that window.

Windows

Streaming systems often compute aggregates over time: revenue per hour, clicks per minute, error rate in the last five minutes.

Windows define how events are grouped.

Window TypeDescriptionUse Case
TumblingFixed, non-overlappingHourly reports
SlidingFixed, overlappingMoving averages
SessionDynamic, activity-basedUser sessions

A good way to think about it: tumbling windows are calendars, sliding windows are rolling cameras, and session windows follow user behavior.

Processing Guarantees

Engines also differ in how they behave under failures.

GuaranteeDescriptionComplexity
At-most-onceEvents may be lostSimple
At-least-onceEvents may be duplicatedMedium
Exactly-onceEach event processed onceComplex

Why exactly-once is hard

Distributed systems fail in awkward ways: crashes mid-write, retries, partial commits, network partitions. The challenge is not processing an event. The challenge is producing results without double-counting or losing updates.

Most “exactly-once” implementations rely on two mechanisms:

  1. Checkpointing: The engine periodically saves state so it can restart from a known consistent point.
  2. Transactional output: The engine coordinates writing results so that either the whole checkpoint’s output becomes visible, or none of it does.

Apache Flink

Flink is one of the most widely used open-source engines for stream processing, especially when you need low latency, event-time correctness, and strong state management.

It is often described as “true streaming” because it processes events continuously rather than in micro-batches.

Architecture

Flink is a distributed system with a control plane and a data plane.

Core components

JobManager

  • Coordinates job execution across the cluster
  • Schedules tasks and manages failover
  • Coordinates checkpoints and recovery

TaskManagers

  • Execute the actual work (operators)
  • Manage local state for stateful operators
  • Run tasks in parallel based on configured parallelism

Client

  • Submits jobs to the cluster
  • Retrieves execution results and logs
  • Does not run the job itself, it is mainly a submission and control interface

Flink processes events one at a time (true streaming), unlike micro-batch approaches:

Flink builds your program into a dataflow graph.

  • Events flow through operators one at a time
  • Operators are pipelined, meaning downstream operators can process as soon as upstream emits
  • There is no artificial “batch boundary” between steps

This is why Flink can deliver very low latency while still supporting complex stateful computations.

Real streaming applications almost always need state:

  • counters and aggregates
  • joins
  • deduplication
  • session tracking
  • windowed computations

Flink handles state transparently through a state backend:

Common state backends

  • MemoryStateBackend: small state, typically dev or testing
  • FsStateBackend: state snapshots stored in a filesystem
  • RocksDBStateBackend: large state, production-grade, stored locally with RocksDB and checkpointed out

In practice, production Flink deployments with large state often rely on RocksDB because it scales beyond memory.

Checkpointing

Flink achieves exactly-once through distributed snapshots:

How it works conceptually

  1. The JobManager triggers a checkpoint and injects checkpoint barriers into the streams
  2. When an operator receives a barrier, it snapshots its local state
  3. The snapshot is written to durable storage (S3 or HDFS)
  4. Barriers flow downstream so every operator snapshots a consistent cut of the stream
  5. Once all operators acknowledge completion, the checkpoint is considered successful
  6. If there is a failure, Flink restores from the latest completed checkpoint and replays from there

The important point is that the snapshot represents a consistent global state of the entire pipeline, not just one operator.

Exactly-once also depends on the sink. Flink can maintain consistent internal state, but you still need sinks that commit outputs in a way that matches checkpoint boundaries.

FeatureDescription
Exactly-onceCheckpoint-based, with transactional sinks
Event timeFirst-class support with watermarks
SavepointsManual snapshots for upgrades/migration
State TTLAutomatic expiration of state entries
Queryable stateQuery state directly without sink
SQL/Table APIHigh-level APIs alongside DataStream

Apache Kafka Streams

Kafka Streams is not a cluster or a separate service. It is a library you embed into your application. You scale it the same way you scale any stateless service: run more instances.

Architecture

Each application instance runs the Kafka Streams library. Kafka is both the input and the backing system for state and recovery.

The key difference from Flink:

  • Flink requires a cluster to run jobs
  • Kafka Streams runs inside your application process

That makes Kafka Streams appealing when you want operational simplicity and you are already fully committed to Kafka.

Stream Processing Model

Core abstractions:

  • KStream: An unbounded stream of records, typically events.
  • KTable: A changelog representation of the latest value per key.
  • GlobalKTable: A table replicated to all instances, often used for joins when you want local access everywhere.

State in Kafka Streams

Kafka Streams stores state locally, usually in RocksDB, and makes it fault tolerant by logging changes back into Kafka.

On restart:

  • the app reloads state by replaying the changelog topic
  • progress is tracked using Kafka consumer offsets

This design is elegant because Kafka is both the event log and the recovery mechanism.

Exactly-Once in Kafka Streams

Kafka Streams achieves exactly-once through Kafka transactions:

All operations are atomic:

  • Read input
  • Update state
  • Write output
  • Commit offset

If the transaction commits, the update is visible and offsets move forward. If it fails, none of it becomes visible.

Key Kafka Streams Features

FeatureDescription
No clusterLibrary, deploy with application
Exactly-onceKafka transactions
Interactive queriesQuery local state via API
Fault toleranceChangelog topics for state
Simple deploymentStandard application deployment
Tight Kafka integrationNative Kafka performance

Apache Spark Structured Streaming

Spark Structured Streaming treats streams as a sequence of small batches. It is streaming built on top of Spark’s batch engine.

Architecture

  • A Spark driver plans and coordinates the query
  • Executors run the tasks
  • A micro-batch scheduler triggers execution repeatedly

Micro-Batch Processing

Instead of processing each event as soon as it arrives, Spark collects events for a short interval, then runs a batch computation.

How it works:

  • Collect events for a trigger interval (e.g., 100ms)
  • Process as a batch using Spark's batch engine
  • Write results, commit offsets
  • Repeat

Continuous Processing Mode

Spark also introduced a continuous processing mode aimed at lower latency. It can get closer to true streaming latency, but it comes with constraints and is not as broadly used as the standard micro-batch model.

Continuous mode trades throughput for lower latency.

State Management

Spark maintains state for aggregations and joins, and persists it via checkpointing to a durable location. It also supports event-time features such as watermarks and allowed lateness.

Key Spark Streaming Features

FeatureDescription
Unified APISame code for batch and stream
Mature ecosystemSparkSQL, MLlib, GraphX
High throughputBatch efficiency for each micro-batch
Late data handlingWatermarks and allowed lateness
Output modesAppend, Complete, Update
Managed cloud optionsDatabricks, EMR, Dataproc

Comparing Streaming Engines

Feature Comparison

FeatureFlinkKafka StreamsSpark Streaming
Processing modelTrue streamingTrue streamingMicro-batch
LatencyMillisecondsMilliseconds100ms+
DeploymentClusterLibraryCluster
State sizeVery largeLargeLarge
Exactly-onceCheckpointingKafka transactionsCheckpointing
SQL supportYesKSQL separateYes
Source flexibilityAny sourceKafka onlyMany sources

Decision Guide

When to Use Each

EngineBest For
FlinkLow latency, complex event processing, large state
Kafka StreamsKafka-centric, simple deployment, microservices
Spark StreamingUnified batch/stream, ML pipelines, SQL users

Summary

Streaming engines power real-time data processing with sophisticated mechanisms:

  • Core concepts: Event time, watermarks, windows, and processing guarantees are universal.
  • Apache Flink: True streaming, low latency, excellent state management, checkpoint-based exactly-once.
  • Kafka Streams: Library-based, simple deployment, Kafka-native, transaction-based exactly-once.
  • Spark Streaming: Micro-batch, unified with batch, mature ecosystem, higher latency.
  • Exactly-once: Achieved through checkpointing (Flink, Spark) or transactions (Kafka Streams).
  • State management: All engines support large state with persistence and recovery.
  • Choosing: Flink for low latency and complexity, Kafka Streams for simplicity, Spark for unified batch/stream.