In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Data Migration > ETL and ELT Patterns > ETL vs ELT - Patterns and When to Use Each

ETL vs ELT - Patterns and When to Use Each

Author: Venkata Sudhakar

ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) are the two dominant patterns for moving and processing data between systems. In ETL, data is extracted from the source, transformed in a separate processing layer (like Apache Spark or a dedicated ETL tool), and then loaded into the target in its final, clean form. In ELT, raw data is extracted and loaded directly into the target system (usually a cloud data warehouse like BigQuery or Snowflake) and then transformed inside the warehouse using SQL. The shift from ETL to ELT is one of the most significant trends in modern data engineering.

ETL was the dominant pattern when data warehouses were expensive and storage was the bottleneck. You had to clean and aggregate data before loading because storing raw, dirty data was too costly. Modern cloud data warehouses (BigQuery, Snowflake, Redshift) have essentially unlimited storage at very low cost and enormous compute capacity. This makes ELT the preferred pattern today: load raw data immediately (faster, no data loss), then transform it inside the warehouse where the compute is cheap, scalable, and the transformation code is plain SQL maintained with dbt.

The below example shows an Apache Spark ETL pipeline in Python that reads from MySQL, applies business transformations, and writes clean data to a Parquet data lake.

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import DecimalType

spark = SparkSession.builder \
    .appName("OrdersETL") \
    .config("spark.sql.session.timeZone", "UTC") \
    .getOrCreate()

# EXTRACT: Read from MySQL source database
orders_df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:mysql://mysql-host:3306/appdb") \
    .option("dbtable", "orders") \
    .option("user", "etl_user") \
    .option("password", "etl_password") \
    .option("numPartitions", 8) \
    .option("partitionColumn", "id") \
    .option("lowerBound", "1") \
    .option("upperBound", "1000000") \
    .load()

customers_df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:mysql://mysql-host:3306/appdb") \
    .option("dbtable", "customers") \
    .option("user", "etl_user") \
    .option("password", "etl_password") \
    .load()

print(f"Extracted {orders_df.count()} orders, {customers_df.count()} customers")

# TRANSFORM: Apply business rules and enrichment
transformed_df = orders_df \
    .filter(F.col("status") != "CANCELLED") \
    .withColumn("order_date", F.to_date(F.col("created_at"))) \
    .withColumn("year_month", F.date_format(F.col("order_date"), "yyyy-MM")) \
    .withColumn("total_amount", F.col("total_amount").cast(DecimalType(10, 2))) \
    .withColumn("is_high_value", F.col("total_amount") > 500) \
    .join(customers_df.select("id", "email", "country"),
          orders_df.customer_id == customers_df.id, "left") \
    .drop("id") \
    .withColumnRenamed("customer_id", "id")

print(f"After transform: {transformed_df.count()} orders (cancelled excluded)")

# LOAD: Write to Parquet data lake partitioned by year-month
transformed_df.write \
    .mode("overwrite") \
    .partitionBy("year_month") \
    .parquet("s3://my-data-lake/orders/")

print("ETL complete. Data written to S3 data lake.")

It gives the following output,

Extracted 892341 orders, 125847 customers
After transform: 847219 orders (cancelled excluded)
ETL complete. Data written to S3 data lake.

# Parquet files created at:
s3://my-data-lake/orders/year_month=2024-01/part-00000.parquet
s3://my-data-lake/orders/year_month=2024-01/part-00001.parquet
... (8 partitions per month)

The below example shows the ELT approach using dbt (data build tool), where raw data lands in BigQuery first and transformations are written as SQL models that run inside the warehouse.

-- ELT with dbt: raw data already loaded into BigQuery by Fivetran/Airbyte
-- Transformations run AS SQL inside BigQuery (not in a separate Spark cluster)

-- dbt model: models/staging/stg_orders.sql
-- This creates a "staging" view that cleans and standardises raw orders
{{ config(materialized='view') }}

SELECT
    id                                          AS order_id,
    customer_id,
    CAST(total_amount AS NUMERIC)               AS total_amount,
    DATE(created_at)                            AS order_date,
    FORMAT_DATE('%Y-%m', DATE(created_at))      AS year_month,
    total_amount > 500                          AS is_high_value,
    UPPER(status)                               AS status,
    CURRENT_TIMESTAMP()                         AS _loaded_at
FROM {{ source('raw', 'orders') }}
WHERE status != 'CANCELLED';

-- dbt model: models/marts/fct_orders.sql
-- Final fact table joining orders with customers
{{ config(materialized='table', partition_by={"field": "order_date", "data_type": "date"}) }}

SELECT
    o.order_id,
    o.order_date,
    o.year_month,
    o.total_amount,
    o.is_high_value,
    c.email          AS customer_email,
    c.country        AS customer_country
FROM {{ ref('stg_orders') }} o
LEFT JOIN {{ ref('stg_customers') }} c ON o.customer_id = c.customer_id;

It gives the following output,

09:00:00  Running with dbt=1.7.0
09:00:01  Found 12 models, 34 tests

09:00:02  1 of 2 START sql view model staging.stg_orders
09:00:03  1 of 2 OK created sql view model staging.stg_orders [CREATE VIEW in 1.2s]
09:00:03  2 of 2 START sql table model marts.fct_orders
09:00:09  2 of 2 OK created sql table model marts.fct_orders [CREATE TABLE (847219 rows) in 5.8s]

09:00:10  Running 4 tests on fct_orders
09:00:12  PASS not_null_fct_orders_order_id
09:00:12  PASS unique_fct_orders_order_id
09:00:13  PASS not_null_fct_orders_total_amount
09:00:13  PASS accepted_values_fct_orders_status

Finished running 2 models, 4 tests in 12.4s. 0 errors, 0 warnings.

ETL vs ELT - when to choose each:

Choose ETL when: data must be cleaned or masked before it enters the target (PII removal, data residency compliance), the transformation is computationally heavy and better suited to a distributed system like Spark than a SQL engine, or you are loading into an OLTP database that cannot store raw data.

Choose ELT when: you are loading into a cloud data warehouse (BigQuery, Snowflake, Redshift), you want a fast time-to-insight (raw data is available immediately for analysts to query), or you want to maintain all transformation logic as version-controlled SQL with dbt, making it transparent and auditable.

Send your comments, suggestions or queries regarding this site to [email protected].