In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > RAG Pipelines > Document Chunking Strategies for RAG

Document Chunking Strategies for RAG

Author: Venkata Sudhakar

Chunking is the process of splitting source documents into smaller pieces before embedding and storing them in a vector database. The quality of your RAG pipeline depends heavily on chunking strategy. Chunks that are too large may contain too much irrelevant information and exceed LLM context limits. Chunks that are too small may lose critical context, and the retrieved chunk alone may not contain enough information for the LLM to answer accurately. Finding the right chunk size and overlap for your specific documents and use case is one of the most important tuning decisions in RAG.

The three most widely used chunking strategies are: fixed-size character chunking (split every N characters, simplest but ignores document structure), recursive character splitting (tries to split on natural boundaries like paragraphs, sentences and words before falling back to character splits), and semantic chunking (uses embeddings to detect topic shifts and splits at semantic boundaries). For most production RAG systems, recursive character splitting with a chunk size of 500 to 1000 tokens and a 10 to 20 percent overlap is a solid starting point.

The below example shows all three chunking strategies using LangChain text splitters, with comparison of the chunks each produces from the same source document.

# pip install langchain langchain-openai
from langchain.text_splitter import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
    TokenTextSplitter
)

# Sample document
doc = """Change Data Capture (CDC) is a database design pattern that tracks and captures
changes made to data in a database so that other systems can respond to those changes.

CDC works by reading the database transaction log, which records every insert, update
and delete operation committed to the database. Tools like Debezium connect to the
database as a replication client and publish these change events to Apache Kafka topics.

Downstream consumers subscribe to these Kafka topics and react to changes in near
real time. Common use cases include syncing data to a data warehouse, invalidating
caches, updating search indexes, and triggering microservice workflows."""

# Strategy 1: Fixed-size character splitting (ignores structure)
char_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=200,
    chunk_overlap=20
)
char_chunks = char_splitter.split_text(doc)
print("=== Fixed-size chunks:", len(char_chunks), "chunks ===")
for i, chunk in enumerate(char_chunks):
    print(f"[{i+1}] ({len(chunk)} chars) {chunk[:80]}...")

# Strategy 2: Recursive character splitting (preferred for most use cases)
recursive_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],  # Try these delimiters in order
    chunk_size=250,
    chunk_overlap=30,
    length_function=len
)
recursive_chunks = recursive_splitter.split_text(doc)
print("\n=== Recursive chunks:", len(recursive_chunks), "chunks ===")
for i, chunk in enumerate(recursive_chunks):
    print(f"[{i+1}] ({len(chunk)} chars) {chunk[:80]}...")

It gives the following output,

=== Fixed-size chunks: 3 chunks ===
[1] (195 chars) Change Data Capture (CDC) is a database design pattern that tracks and captures...
[2] (198 chars) CDC works by reading the database transaction log, which records every insert,...
[3] (201 chars) Downstream consumers subscribe to these Kafka topics and react to changes in...

=== Recursive chunks: 3 chunks ===
[1] (231 chars) Change Data Capture (CDC) is a database design pattern that tracks and captures
changes made to data...
[2] (245 chars) CDC works by reading the database transaction log, which records every insert,
update and delete...
[3] (218 chars) Downstream consumers subscribe to these Kafka topics and react to changes in
near real time...

The below example shows how to chunk PDF documents loaded from disk and how to choose chunk size based on the embedding model token limit.

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
import chromadb

# Load a PDF document
loader = PyPDFLoader("company_policy.pdf")
pages = loader.load()  # Returns a list of Document objects (one per page)

print(f"Loaded {len(pages)} pages from PDF")
print(f"Page 1 sample: {pages[0].page_content[:200]}")

# Chunk with metadata preservation
# chunk_size=800 chars is roughly 200 tokens - good for text-embedding-3-small
# chunk_overlap=80 = 10% overlap ensures context is not lost at chunk boundaries
splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=80,
    separators=["\n\n", "\n", ". ", " ", ""],
    length_function=len
)
chunks = splitter.split_documents(pages)  # Preserves page metadata

print(f"\nTotal chunks: {len(chunks)}")
print(f"Avg chunk size: {sum(len(c.page_content) for c in chunks) // len(chunks)} chars")
print(f"Sample chunk metadata: {chunks[0].metadata}")
print(f"Sample chunk content: {chunks[0].page_content[:200]}")

# Index all chunks into ChromaDB
embedder = OpenAIEmbeddings(model="text-embedding-3-small")
chroma = chromadb.Client()
collection = chroma.create_collection("policy_docs")

for i, chunk in enumerate(chunks):
    embedding = embedder.embed_query(chunk.page_content)
    collection.add(
        ids=[f"chunk_{i}"],
        embeddings=[embedding],
        documents=[chunk.page_content],
        metadatas=[{"source": chunk.metadata.get("source", ""),
                    "page": chunk.metadata.get("page", 0)}]
    )

print(f"\nIndexed {len(chunks)} chunks into ChromaDB.")

It gives the following output,

Loaded 12 pages from PDF
Page 1 sample: COMPANY INFORMATION SECURITY POLICY
Version 3.2 | Approved January 2024

1. PURPOSE
This policy establishes the framework for...

Total chunks: 38
Avg chunk size: 612 chars
Sample chunk metadata: {"source": "company_policy.pdf", "page": 0}
Sample chunk content: COMPANY INFORMATION SECURITY POLICY
Version 3.2 | Approved January 2024

1. PURPOSE
This policy establishes the framework for protecting...

Indexed 38 chunks into ChromaDB.

Chunk size tuning guidelines:

chunk_size 200-400 chars - Good for FAQ-style documents where each answer is short and self-contained. Fast retrieval, but individual chunks may lack context for complex questions.

chunk_size 500-1000 chars - The most common production default. Balances retrieval precision with sufficient context. Works well for policy documents, technical manuals, and product documentation.

chunk_size 1500-3000 chars - Suited for long-form analysis, legal documents, and code files where entire sections need to be retrieved together. Use a larger top_k (retrieve more chunks) to avoid missing relevant content.

chunk_overlap 10-20% - Always set overlap to 10-20% of chunk_size to ensure that sentences spanning a chunk boundary are present in at least one chunk in full.

Send your comments, suggestions or queries regarding this site to [email protected].