In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Modern Python > Pydantic and Data Validation > Data Validation with Pydantic v2

Data Validation with Pydantic v2

Author: Venkata Sudhakar

Pydantic is the most widely used data validation library in the Python ecosystem. It is the validation engine behind FastAPI, LangChain, and hundreds of other major Python projects. Pydantic uses Python type hints to define schemas and validates incoming data against them at runtime, coercing types where possible and raising detailed errors when data does not match. Pydantic v2 (released 2023) was rewritten in Rust, making it 5-50x faster than v1 while adding a richer validation API.

The core concept is the BaseModel: a class that inherits from pydantic.BaseModel and uses type-annotated class attributes to define the schema. When you instantiate a model with data (from a dict, JSON string, or keyword arguments), Pydantic validates every field, coerces compatible types (e.g. "42" to int), applies validators, and raises a ValidationError with detailed field-by-field error messages if anything is wrong. This makes Pydantic ideal for validating API request bodies, LLM-generated structured outputs, ETL pipeline records, and configuration files.

The below example shows comprehensive Pydantic v2 usage for a data migration domain, including field validators, model validators, computed fields, and JSON serialisation.

# pip install pydantic
from pydantic import BaseModel, Field, field_validator, model_validator, computed_field
from typing import Optional, List
from datetime import datetime
from enum import Enum

class Environment(str, Enum):
    DEV = "dev"
    STAGING = "staging"
    PROD = "prod"

class DatabaseConfig(BaseModel):
    host: str
    port: int = Field(default=5432, ge=1, le=65535)  # ge=greater-equal, le=less-equal
    database: str
    username: str
    password: str = Field(min_length=8, repr=False)  # repr=False hides in __repr__
    ssl_enabled: bool = True

@field_validator("host")
    @classmethod
    def host_must_not_be_localhost_in_prod(cls, v: str) -> str:
        return v.strip().lower()

@computed_field
    @property
    def connection_url(self) -> str:
        scheme = "postgresql+psycopg2"
        return f"{scheme}://{self.username}:***@{self.host}:{self.port}/{self.database}"

class MigrationConfig(BaseModel):
    job_id: str = Field(pattern=r"^MIG-\d+$")  # Must match regex
    environment: Environment
    source: DatabaseConfig
    target: DatabaseConfig
    batch_size: int = Field(default=10000, ge=100, le=1000000)
    max_parallel_tables: int = Field(default=4, ge=1, le=20)
    dry_run: bool = False
    created_at: datetime = Field(default_factory=datetime.utcnow)

@model_validator(mode="after")
    def source_and_target_must_differ(self) -> "MigrationConfig":
        if (self.source.host == self.target.host and
                self.source.database == self.target.database):
            raise ValueError("Source and target cannot be the same database")
        return self

# Valid config
config = MigrationConfig(
    job_id="MIG-1042",
    environment="prod",
    source=DatabaseConfig(host="mysql-prod", port=3306, database="appdb",
                          username="etl_user", password="SecretPass123"),
    target=DatabaseConfig(host="pg-prod", port=5432, database="appdb",
                          username="etl_user", password="SecretPass456"),
    batch_size=50000
)
print(config.model_dump(exclude={"source": {"password"}, "target": {"password"}}))
print("Source URL:", config.source.connection_url)

It gives the following output,

{
  "job_id": "MIG-1042",
  "environment": "prod",
  "source": {"host": "mysql-prod", "port": 3306, "database": "appdb",
             "username": "etl_user", "ssl_enabled": True},
  "target": {"host": "pg-prod", "port": 5432, "database": "appdb",
             "username": "etl_user", "ssl_enabled": True},
  "batch_size": 50000,
  "max_parallel_tables": 4,
  "dry_run": False,
  "created_at": "2024-01-15T09:00:00"
}
Source URL: postgresql+psycopg2://etl_user:***@mysql-prod:3306/appdb

from pydantic import BaseModel, ValidationError

# Pydantic catches invalid data with detailed errors
try:
    bad_config = MigrationConfig(
        job_id="INVALID-ID",    # Does not match pattern MIG-\d+
        environment="production", # Not a valid Environment enum value
        source=DatabaseConfig(host="db", port=99999,  # port > 65535
                              database="app", username="user", password="short"),  # password < 8 chars
        target=DatabaseConfig(host="db", port=5432, database="app",
                              username="user", password="password123")
    )
except ValidationError as e:
    print("Validation errors:")
    for err in e.errors():
        print(f"  Field: {err['loc']} | Error: {err['msg']}")

# Pydantic Settings: load config from environment variables
from pydantic_settings import BaseSettings

class AppSettings(BaseSettings):
    db_host: str = "localhost"
    db_port: int = 5432
    api_key: str
    max_workers: int = 4
    debug: bool = False

model_config = {"env_prefix": "APP_"}  # reads APP_DB_HOST, APP_API_KEY, etc.

# Reads from environment: APP_DB_HOST, APP_DB_PORT, APP_API_KEY...
# settings = AppSettings()  # Raises if APP_API_KEY not set
print("Settings class defined - reads from environment variables with APP_ prefix")

It gives the following output,

Validation errors:
  Field: ("job_id",) | Error: String should match pattern "^MIG-\d+$"
  Field: ("environment",) | Error: Input should be "dev", "staging" or "prod"
  Field: ("source", "port") | Error: Input should be less than or equal to 65535
  Field: ("source", "password") | Error: String should have at least 8 characters
  Field: () | Error: Source and target cannot be the same database

Settings class defined - reads from environment variables with APP_ prefix

Pydantic in AI and data pipelines:

Pydantic is used throughout the modern Python AI stack. LangChain uses Pydantic BaseModel for all tool input schemas - every @tool function's argument is validated by Pydantic. FastAPI uses Pydantic for request and response body validation automatically. When using the OpenAI API with structured outputs (response_format={"type": "json_object"}), you can parse and validate the JSON response directly into a Pydantic model using model_validate_json(). In ETL pipelines, Pydantic models are excellent for validating each record as it flows through the pipeline, catching data quality issues early rather than letting bad data reach the target database.

Send your comments, suggestions or queries regarding this site to [email protected].