tl  tr
  Home | Tutorials | Articles | Videos | Products | Tools | Search
Interviews | Open Source | Tag Cloud | Follow Us | Bookmark | Contact   
 Generative AI > RAG Pipelines > Document Chunking Strategies for RAG

Document Chunking Strategies for RAG

Author: Venkata Sudhakar

Chunking is the process of splitting source documents into smaller pieces before embedding and storing them in a vector database. The quality of your RAG pipeline depends heavily on chunking strategy. Chunks that are too large may contain too much irrelevant information and exceed LLM context limits. Chunks that are too small may lose critical context, and the retrieved chunk alone may not contain enough information for the LLM to answer accurately. Finding the right chunk size and overlap for your specific documents and use case is one of the most important tuning decisions in RAG.

The three most widely used chunking strategies are: fixed-size character chunking (split every N characters, simplest but ignores document structure), recursive character splitting (tries to split on natural boundaries like paragraphs, sentences and words before falling back to character splits), and semantic chunking (uses embeddings to detect topic shifts and splits at semantic boundaries). For most production RAG systems, recursive character splitting with a chunk size of 500 to 1000 tokens and a 10 to 20 percent overlap is a solid starting point.

The below example shows all three chunking strategies using LangChain text splitters, with comparison of the chunks each produces from the same source document.


It gives the following output,

=== Fixed-size chunks: 3 chunks ===
[1] (195 chars) Change Data Capture (CDC) is a database design pattern that tracks and captures...
[2] (198 chars) CDC works by reading the database transaction log, which records every insert,...
[3] (201 chars) Downstream consumers subscribe to these Kafka topics and react to changes in...

=== Recursive chunks: 3 chunks ===
[1] (231 chars) Change Data Capture (CDC) is a database design pattern that tracks and captures
changes made to data...
[2] (245 chars) CDC works by reading the database transaction log, which records every insert,
update and delete...
[3] (218 chars) Downstream consumers subscribe to these Kafka topics and react to changes in
near real time...

The below example shows how to chunk PDF documents loaded from disk and how to choose chunk size based on the embedding model token limit.


It gives the following output,

Loaded 12 pages from PDF
Page 1 sample: COMPANY INFORMATION SECURITY POLICY
Version 3.2 | Approved January 2024

1. PURPOSE
This policy establishes the framework for...

Total chunks: 38
Avg chunk size: 612 chars
Sample chunk metadata: {"source": "company_policy.pdf", "page": 0}
Sample chunk content: COMPANY INFORMATION SECURITY POLICY
Version 3.2 | Approved January 2024

1. PURPOSE
This policy establishes the framework for protecting...

Indexed 38 chunks into ChromaDB.

Chunk size tuning guidelines:

chunk_size 200-400 chars - Good for FAQ-style documents where each answer is short and self-contained. Fast retrieval, but individual chunks may lack context for complex questions.

chunk_size 500-1000 chars - The most common production default. Balances retrieval precision with sufficient context. Works well for policy documents, technical manuals, and product documentation.

chunk_size 1500-3000 chars - Suited for long-form analysis, legal documents, and code files where entire sections need to be retrieved together. Use a larger top_k (retrieve more chunks) to avoid missing relevant content.

chunk_overlap 10-20% - Always set overlap to 10-20% of chunk_size to ensure that sentences spanning a chunk boundary are present in at least one chunk in full.


 
  


  
bl  br