We have now covered two distinct storage paradigms: data lakes for raw, flexible storage and data warehouses for structured, fast analytics. For years, organizations ran both systems in parallel, maintaining complex ETL pipelines to move data between them.
This dual architecture creates problems. Data is duplicated. ETL pipelines add latency. Governance becomes fragmented. Teams use different tools for different systems. The cost of running and maintaining both systems adds up.
The data lakehouse emerged as a solution: a single architecture that combines the flexibility of data lakes with the performance and governance of data warehouses. Store data once in open formats on cheap object storage, but add a transaction layer that enables warehouse-like reliability and performance.
In this chapter, you will learn:
- What a data lakehouse is and why it emerged
- The table formats that enable lakehouse architecture (Delta Lake, Iceberg, Hudi)
- How ACID transactions work on data lakes
- Key lakehouse features like time travel and schema evolution
- When to use a lakehouse vs traditional architectures
The Problem with Two Systems
Traditional architectures separate lakes and warehouses:
On paper, this looks clean. In practice, it creates a permanent gap between where data lands and where data gets used.
Problems with This Approach
| Problem | Description |
|---|
| Data duplication | Same data stored in lake and warehouse |
| ETL complexity | Constant movement between systems |
| Staleness | Warehouse lags behind lake |
| Cost | Two systems to pay for and maintain |
| Governance gaps | Different security models in each system |
| Tool fragmentation | Data scientists use lake, analysts use warehouse |
The Two-Tier Pain Points
When data is split across two systems, every group hits a different wall:
- Data Scientists: “The warehouse doesn’t have the raw data I need.”
- Analysts: “The lake is too slow and too unstructured.”
- Engineers: “Keeping two systems in sync is exhausting.”
- Finance: “Why are we paying for two platforms?”
The result is predictable: more pipelines, more copies, more confusion and slower progress for everyone.
What is a Data Lakehouse?
A data lakehouse is a modern data platform that blends two worlds that used to be separate:
- Data lakes, which are great at storing huge amounts of raw data cheaply
- Data warehouses, which are great at making that data reliable, queryable, and easy for analytics teams to use
The goal is simple: one platform where data can land as raw files and still be trusted for BI, analytics, and machine learning.
A Typical Lakehouse Architecture
Think of it as three layers working together:
1. Storage Layer
This is where the actual data lives (Object storage, Files like Parquet). It is cheap, scalable, and flexible.
2. Table Format Layer
This is the missing layer that classic data lakes never had (Delta Lake, Apache Iceberg, Apache Hudi). This layer adds rules and guarantees on top of raw files.
3. Compute Layer
Multiple engines can operate on the same data (BI queries, ETL workloads, ML training, Streaming jobs). Different workloads, same underlying data.
Lakehouse = Lake + Warehouse Features
A lakehouse keeps what is good about lakes while borrowing what made warehouses reliable.
What it keeps from a Data Lake
- Open formats like Parquet
- Low-cost storage in object stores
- Supports all data types, structured and unstructured
- Works well for ML and AI, not just dashboards
- Handles batch and streaming data
What it adopts from a Data Warehouse
- ACID transactions so writes are safe and consistent
- Schema enforcement so “garbage in” is less likely
- Better SQL performance through table-level metadata and optimizations
- BI tool compatibility so analysts can use familiar tools
- Governance and security features that teams expect in production
The pitch is not “replace everything.” It is “stop copying the same data into two places just to make it usable.”
The Key Innovation: Table Formats
The lakehouse is powered by a simple but important idea: Keep data as files, but manage it like tables.
That is what table formats do.
In a traditional data lake, you typically have:
- Parquet files in object storage
- No real transactions
- No reliable schema rules
- Lots of custom glue code to keep things consistent
It works, but it is fragile at scale.
With a table format
You still have Parquet files, but now you also get:
- ACID transactions: Writes either fully succeed or do not happen at all.
- Schema enforcement and evolution: The table can reject bad writes, and schema changes are tracked cleanly.
- Time travel: You can query a table as it looked yesterday, last week, or before a bad job ran.
- Versioning and auditability: Every change becomes traceable. This is huge for debugging and governance.
- Safer concurrent writes: Multiple jobs can write without stepping on each other.
In other words, table formats turn a “folder of files” into a “managed dataset.”
Table Formats: Delta Lake, Iceberg, Hudi
Three open-source table formats dominate the lakehouse space: Delta Lake, Apache Iceberg, and Apache Hudi.
Delta Lake
Delta Lake started at Databricks and later became an open-source project with a broader ecosystem.
The defining idea is simple: All table changes are recorded in a transaction log.
Think of a Delta table as two things:
- Data files (usually Parquet)
- A transaction log that describes which files are part of the table at each version
How Delta Lake works:
- The transaction log records every change (adds, removes, schema updates).
- Each write creates new files rather than editing Parquet files in place.
- Old versions remain queryable, enabling time travel and rollback workflows.
- Optimistic concurrency control coordinates writes. Writers proceed assuming no conflict, then commit only if the table state has not changed in incompatible ways.
Delta is a strong “default” choice when you want a practical, production-friendly format with a lot of tooling and mindshare, especially if your core runtime is Spark.
Apache Iceberg
Iceberg was created at Netflix and is now an Apache project with broad adoption.
Its core philosophy is: Make metadata scalable and make reads fast across many engines.
Iceberg is built around a layered metadata structure that lets query engines plan efficiently without scanning the world.
How Iceberg works (high level)
- Hierarchical metadata enables fast planning. The engine reads metadata first to decide which files matter.
- Hidden partitioning. Users query the table normally without dealing with partition columns directly.
- Partition evolution. You can change partition strategies over time without rewriting existing data.
- Multi-engine access is a first-class goal. Iceberg is designed to work cleanly with many compute engines, not just one.
Iceberg is an excellent choice when you expect multiple query engines, large-scale tables, and frequent schema or partition changes over time.
Apache Hudi
Hudi was created at Uber and is now an Apache project.
It is best known for one specific strength: Efficient incremental processing and record-level changes.
Key features
- Designed for incremental pipelines. You can pull only what changed since the last run.
- Record-level updates and deletes. This is a major differentiator versus append-only patterns.
- Built-in compaction and clustering. Helps manage file sizes and layout over time.
- Strong Spark integration. Hudi has deep support in Spark-centric ecosystems.
Hudi is a great fit for ingestion-heavy systems, CDC pipelines, upserts, and streaming-like workloads where incremental consumption matters.
Choosing Between Them
A practical way to decide:
- Pick Delta Lake if your stack is Spark-first and you want a straightforward, widely used option for analytics and ETL.
- Pick Iceberg if you expect multiple engines and long-term table evolution (schema and partitioning) to be a normal part of life.
- Pick Hudi if updates, deletes, and incremental consumption are central requirements, especially for ingestion and CDC-style pipelines.
ACID Transactions on Data Lakes
A traditional data lake is just files in object storage. That is great for cheap storage, but it comes with a big gap: there is no real transaction boundary. Writing a dataset often means writing many files, and file writes are not atomic as a group.
So you can end up in a state where the data is half-updated, and readers have no way to tell.
Imagine a job that needs to write 10 Parquet files for today’s partition.
- The job starts writing 10 files
- File 7 fails mid-write
- Some readers scan the location and find 6 complete files and one partially written file
- Now your table is inconsistent, even if the job eventually retries
This is not rare. At scale, failures happen all the time: spot interruptions, network blips, executor crashes, permission issues, out-of-disk, throttling. File-based storage does not give you a clean “commit” concept.
Table formats (Delta Lake, Iceberg, Hudi) introduce a real table abstraction on top of files by adding:
- a transaction protocol
- a commit log and metadata
- a way for readers to always see a consistent snapshot
That is how a lakehouse gets ACID-like behavior even though the underlying storage is still object storage.
Atomicity
All changes commit together or none do.
- If the commit succeeds: all new files appear at once
- If the job fails before commit: the files might exist, but they are not part of the table, so readers never see them
Consistency
The table stays valid according to its rules.
A lakehouse is not just storing files, it is enforcing a schema and constraints at the table level.
Example:
- Table schema:
id INT, name STRING, amount DECIMAL - Attempt to write:
id INT, name STRING, amount STRING - Result: Rejected due to type mismatch
This prevents slow-moving data corruption where one bad job quietly writes a wrong type and breaks downstream dashboards a week later.
Isolation
Reads and writes do not step on each other.
Table formats typically use snapshot-based isolation.
- Writers produce a new version, like V3
- Readers continue to read V1 or V2 without being affected
- Once V3 is committed, new readers can see it
- Existing readers still finish on the snapshot they started with
So while a writer is actively writing new files, readers do not see half of them. They see a stable snapshot.
Durability
Once a commit succeeds, it sticks.
Durability in a lakehouse comes from three things working together:
- Metadata stored in durable object storage
- Object storage replication and durability guarantees
- The transaction log or metadata history, which allows recovery and auditing
If a compute cluster dies right after commit, the commit record still exists and the table can be reconstructed from metadata
Key Lakehouse Features
A lakehouse is not just “Parquet in object storage.” The table format layer turns raw files into something that behaves like a real database table.
Time Travel
Time travel lets you query the table as it existed at a specific time or version.
Common use cases:
- Debug production issues by comparing “before” and “after” states
- Reproduce ML training data so experiments are deterministic
- Audit history for compliance and reporting
- Rollback safely when a bad job corrupts a dataset
Schema Evolution
In classic data lakes, schema changes are painful because your “schema” is implied by what files happen to exist. Lakehouse table formats make schema explicit and track it as part of the table’s metadata.
Partition evolution (Iceberg)
Iceberg goes a step further by letting you evolve partitioning over time without rewriting historical data.
Schema and partition evolution allows your tables to grow with the business:
- You can add fields as product requirements change
- You can rename and standardize without breaking downstream consumers
- You can update partition strategy as scale and query patterns evolve
Upserts and Deletes
Raw Parquet is great for append-only workloads, but it falls apart when you need record-level changes. Table formats enable proper upserts and deletes by tracking file-level changes through metadata and commits.
This was impossible with raw Parquet files.
Data Compaction
Small files destroy performance. Every query has to open more files, read more metadata, and do more planning. A lakehouse provides maintenance operations to keep file layout healthy.
What compaction does
- Combines many small files into fewer large files
- Removes physically deleted records when applicable
- Improves scan efficiency and query planning speed
- Reduces load on the object store
Lakehouse Architecture Patterns
Medallion Architecture on Lakehouse
This pattern maps cleanly onto a lakehouse because every layer can use the same table format with the same guarantees.
- Bronze: raw ingestion, usually append-only
- Silver: cleaned, validated, deduplicated, often upserts
- Gold: aggregated business tables and metrics
The key benefit is consistency. You get transactions, time travel, and governance at every layer rather than only in the warehouse.
Real-Time Lakehouse
You can also combine streaming and tables:
- Kafka (or another bus)
- Stream processing engine
- Writes into lakehouse tables
- BI and ML read from the same tables
Delta Lake and Hudi are commonly used for streaming ingestion patterns, enabling near-real-time analytics without copying data into a separate warehouse first.
Lakehouse Platforms
Databricks Lakehouse Platform
A tightly integrated stack that builds around Delta Lake and adds platform features like:
- governance and cataloging
- query acceleration engines
- BI-friendly SQL layer
- ML workflows and tooling
- managed ETL orchestration
Snowflake Iceberg Tables
Snowflake has added Iceberg support so you can:
- read and write Iceberg tables stored in cloud object storage
- use Snowflake compute on external data
- keep interoperability with other engines that understand Iceberg
Amazon Athena + Iceberg
A serverless approach that is attractive when you want:
- SQL queries over Iceberg tables without managing clusters
- pay-per-query economics
- integration with AWS Glue catalog and broader AWS tooling
Lakehouse vs Traditional Architectures
When a lakehouse is a strong fit
- Unified BI + ML: one system for analytics and training pipelines
- Cost-sensitive stacks: one storage layer, fewer copies
- Multi-engine environments: Spark, Flink, Trino, Presto sharing the same tables
- Batch plus streaming: one table abstraction for both
- Auditing and rollback needs: time travel is a first-class feature
When a traditional warehouse may still be better
- Very heavy BI workloads where the warehouse is already highly optimized and tuned
- Existing investment where the warehouse ecosystem is mature and deeply integrated
- Simplicity for pure analytics where a warehouse-only approach is straightforward
- You prefer managed lock-in because “it just works” matters more than openness
The Convergence
The gap between warehouses and lakehouses keeps shrinking:
- Warehouses are supporting open table formats and external data
- Lakehouse platforms are adding warehouse-like performance and governance features
- More teams are aiming for a unified data platform, even if it is built from multiple products
The end state looks less like “lake versus warehouse” and more like “how do we build a unified, governed, high-performance data platform with open data at the center.”
Summary
The data lakehouse unifies data lakes and warehouses:
- The problem: Two-tier architectures with lake and warehouse create duplication, complexity, and cost.
- The solution: Table formats (Delta Lake, Iceberg, Hudi) add warehouse features to lakes.
- ACID transactions: Atomicity, consistency, isolation, and durability on object storage.
- Key features: Time travel, schema evolution, upserts, deletes, and compaction.
- Architecture: Bronze/silver/gold medallion architecture works on lakehouse.
- Platforms: Databricks, Snowflake, AWS, and GCP all offer lakehouse capabilities.
- Trade-offs: Lakehouses offer unified platform but warehouses may still edge ahead for pure BI.