Last Updated: January 15, 2026
Data rarely lives in the format you need it. Your production database stores normalized rows optimized for transactions. Your analytics team needs denormalized tables optimized for queries. Your machine learning models need clean, feature-engineered datasets. Your data warehouse needs aggregated historical data.
Getting data from where it is to where it needs to be, in the format it needs to be in, is the job of ETL: Extract, Transform, Load. ETL pipelines are the plumbing of data infrastructure. They run silently in the background, moving terabytes of data every night, and only get attention when something breaks.
Understanding ETL is essential for system design. Almost every data architecture involves moving data between systems, and the patterns you learn here apply whether you are using batch processing, streaming, or hybrid approaches.
ETL stands for Extract, Transform, Load. It describes the process of moving data from source systems to destination systems while changing its format.
| Phase | Purpose | Example |
|---|---|---|
| Extract | Read data from source systems | Query production database, call APIs, read files |
| Transform | Convert data to desired format | Clean, validate, aggregate, join, denormalize |
| Load | Write data to destination system | Insert into warehouse, update tables |
Without ETL, every system would need to:
ETL centralizes these concerns:
The Extract phase reads data from source systems. This sounds simple but involves significant complexity.
| Source Type | Examples | Challenges |
|---|---|---|
| Relational databases | MySQL, PostgreSQL, Oracle | Connection limits, query impact on production |
| NoSQL databases | MongoDB, Cassandra, DynamoDB | No standard query language, denormalized data |
| APIs | REST, GraphQL, SOAP | Rate limits, pagination, authentication |
| Files | CSV, JSON, Parquet, logs | Schema variations, encoding issues |
| Message queues | Kafka, RabbitMQ | Ordering, exactly-once processing |
| SaaS platforms | Salesforce, Stripe, HubSpot | API limits, data access restrictions |
| Method | Pros | Cons |
|---|---|---|
| Full extract | Simple, always consistent | Slow, wastes resources, strains source |
| Incremental extract | Fast, minimal source impact | Requires change tracking, can miss deletes |
Timestamp-based: Query rows where updated_at > last_run_time
Sequence-based: Track a monotonically increasing ID
Change Data Capture (CDC): Read from database transaction log
CDC is the most reliable for incremental extraction because it captures all changes, including deletes, without modifying the source schema.
Extracting data should not degrade production performance:
| Strategy | How It Works |
|---|---|
| Read replicas | Query a replica instead of primary |
| Off-peak scheduling | Run extracts during low-traffic hours |
| Rate limiting | Limit query concurrency and batch size |
| Snapshots | Extract from database snapshots |
| CDC | Read from log without querying tables |
The Transform phase converts raw data into the format needed by destination systems. This is where most of the business logic lives.
| Type | Description | Example |
|---|---|---|
| Cleaning | Fix data quality issues | Remove nulls, fix encoding, trim whitespace |
| Validation | Ensure data meets requirements | Check required fields, validate formats |
| Standardization | Make data consistent | Convert dates to UTC, normalize phone numbers |
| Enrichment | Add derived data | Calculate age from birthdate, add geo data from IP |
| Aggregation | Summarize data | Sum daily sales, count monthly users |
| Joining | Combine data sources | Join orders with customers |
| Filtering | Remove unwanted data | Exclude test accounts, filter by date range |
| Deduplication | Remove duplicates | Dedupe by user email |
| Denormalization | Flatten for analytics | Embed customer data in order records |
Transform is where you enforce data quality:
Common quality checks:
The Load phase writes transformed data to destination systems. The strategy depends on the destination type and use case.
| Strategy | When to Use | How It Works |
|---|---|---|
| Full load | Small tables, complete refresh needed | Truncate and reload |
| Incremental append | Event/transaction tables | Insert new rows only |
| Upsert (merge) | Dimension tables | Insert new, update existing |
| SCD Type 2 | Need historical tracking | New row for each change, track valid dates |
For large data volumes, row-by-row inserts are too slow:
| Technique | Description | Speed |
|---|---|---|
| Batch INSERT | Multi-row INSERT statements | 10x faster than single |
| COPY command | Load from files (Postgres, Redshift) | 100x faster |
| Bulk loader | Native database utilities | Fastest |
| Parallel loading | Multiple concurrent loaders | Scales with connections |
A common pattern is to load into staging tables first:
Benefits:
Modern data platforms have shifted from ETL to ELT.
Transformations happen on ETL servers before loading.
Raw data is loaded first, transformations happen in the warehouse.
| Aspect | ETL | ELT |
|---|---|---|
| Transform location | ETL server | Data warehouse |
| Compute cost | ETL infrastructure | Warehouse compute |
| Raw data | Not preserved | Preserved in warehouse |
| Flexibility | Changes require ETL updates | Transform with SQL anytime |
| Best for | Structured, well-defined transformations | Exploration, evolving requirements |
| Tools | Informatica, Talend, SSIS | dbt, Dataform, warehouse SQL |
Modern cloud warehouses (Snowflake, BigQuery, Redshift) have:
With ELT:
A popular pattern for organizing data in lakes and warehouses:
| Layer | Purpose | Users |
|---|---|---|
| Bronze (Raw) | Exact copy of source | Data engineers, debugging |
| Silver (Cleaned) | Validated, standardized | Data scientists, analysts |
| Gold (Aggregated) | Business metrics, KPIs | Dashboards, reports |
For real-time or near-real-time data sync:
CDC captures every insert, update, and delete from the source database's transaction log, enabling near-real-time replication.
Complex ETL involves many interdependent jobs:
Orchestration tools (Airflow, Dagster, Prefect) manage these dependencies.
ETL pipelines fail. Networks drop, sources become unavailable, data has unexpected formats. Robust pipelines handle failures gracefully.
| Strategy | Description |
|---|---|
| Retry with backoff | Retry transient failures with increasing delays |
| Dead letter queue | Send failed records to separate storage for investigation |
| Checkpointing | Save progress so jobs can resume from last success |
| Idempotency | Design jobs to be safely re-run without side effects |
| Partial success | Allow job to complete with some failures, report issues |
| Circuit breaker | Stop trying after repeated failures to prevent cascade |
An idempotent operation produces the same result whether run once or many times:
Making ETL idempotent allows safe retries:
| Tool | Description | Best For |
|---|---|---|
| Informatica | Enterprise ETL, drag-and-drop | Large enterprises, complex mappings |
| Talend | Open source + enterprise | Cost-conscious organizations |
| SSIS | Microsoft SQL Server | Microsoft ecosystem |
| DataStage | IBM enterprise ETL | IBM shops |
| Tool | Description | Best For |
|---|---|---|
| dbt | SQL-based transformations | Warehouse-centric, version control |
| Dataform | SQL transformations (Google) | BigQuery users |
| Fivetran | Managed data ingestion | Quick setup, SaaS sources |
| Airbyte | Open source data integration | Custom connectors, self-hosted |
| Stitch | Managed data pipelines | Simple SaaS to warehouse |
| Tool | Description | Best For |
|---|---|---|
| Apache Airflow | DAG-based workflow orchestration | Complex dependencies, Python |
| Dagster | Data-aware orchestration | Modern data platform |
| Prefect | Dynamic workflows | Python, simpler than Airflow |
| AWS Step Functions | AWS-native orchestration | AWS infrastructure |
Assume every component will fail:
Track metrics over time:
Treat ETL code like application code:
Know where data comes from:
Avoid full refreshes when possible:
ETL pipelines are the backbone of data infrastructure:
ETL moves data between systems, but where does all this data ultimately live? The next chapter explores Data Lakes, the foundational storage layer for modern big data architectures, where raw data lands before being refined for analytics.