tl  tr
  Home | Tutorials | Articles | Videos | Products | Tools | Search
Interviews | Open Source | Tag Cloud | Follow Us | Bookmark | Contact   
 Data Migration > ETL and ELT Patterns > ETL vs ELT - Patterns and When to Use Each

ETL vs ELT - Patterns and When to Use Each

Author: Venkata Sudhakar

ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) are the two dominant patterns for moving and processing data between systems. In ETL, data is extracted from the source, transformed in a separate processing layer (like Apache Spark or a dedicated ETL tool), and then loaded into the target in its final, clean form. In ELT, raw data is extracted and loaded directly into the target system (usually a cloud data warehouse like BigQuery or Snowflake) and then transformed inside the warehouse using SQL. The shift from ETL to ELT is one of the most significant trends in modern data engineering.

ETL was the dominant pattern when data warehouses were expensive and storage was the bottleneck. You had to clean and aggregate data before loading because storing raw, dirty data was too costly. Modern cloud data warehouses (BigQuery, Snowflake, Redshift) have essentially unlimited storage at very low cost and enormous compute capacity. This makes ELT the preferred pattern today: load raw data immediately (faster, no data loss), then transform it inside the warehouse where the compute is cheap, scalable, and the transformation code is plain SQL maintained with dbt.

The below example shows an Apache Spark ETL pipeline in Python that reads from MySQL, applies business transformations, and writes clean data to a Parquet data lake.


It gives the following output,

Extracted 892341 orders, 125847 customers
After transform: 847219 orders (cancelled excluded)
ETL complete. Data written to S3 data lake.

# Parquet files created at:
s3://my-data-lake/orders/year_month=2024-01/part-00000.parquet
s3://my-data-lake/orders/year_month=2024-01/part-00001.parquet
... (8 partitions per month)

The below example shows the ELT approach using dbt (data build tool), where raw data lands in BigQuery first and transformations are written as SQL models that run inside the warehouse.


It gives the following output,

09:00:00  Running with dbt=1.7.0
09:00:01  Found 12 models, 34 tests

09:00:02  1 of 2 START sql view model staging.stg_orders
09:00:03  1 of 2 OK created sql view model staging.stg_orders [CREATE VIEW in 1.2s]
09:00:03  2 of 2 START sql table model marts.fct_orders
09:00:09  2 of 2 OK created sql table model marts.fct_orders [CREATE TABLE (847219 rows) in 5.8s]

09:00:10  Running 4 tests on fct_orders
09:00:12  PASS not_null_fct_orders_order_id
09:00:12  PASS unique_fct_orders_order_id
09:00:13  PASS not_null_fct_orders_total_amount
09:00:13  PASS accepted_values_fct_orders_status

Finished running 2 models, 4 tests in 12.4s. 0 errors, 0 warnings.

ETL vs ELT - when to choose each:

Choose ETL when: data must be cleaned or masked before it enters the target (PII removal, data residency compliance), the transformation is computationally heavy and better suited to a distributed system like Spark than a SQL engine, or you are loading into an OLTP database that cannot store raw data.

Choose ELT when: you are loading into a cloud data warehouse (BigQuery, Snowflake, Redshift), you want a fast time-to-insight (raw data is available immediately for analysts to query), or you want to maintain all transformation logic as version-controlled SQL with dbt, making it transparent and auditable.


 
  


  
bl  br