In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > RAG Pipelines > What is Retrieval-Augmented Generation (RAG)

What is Retrieval-Augmented Generation (RAG)

Author: Venkata Sudhakar

Retrieval-Augmented Generation (RAG) is an architecture pattern for building AI applications that combines the reasoning ability of a Large Language Model with the ability to retrieve and use information from your own private documents, databases, or knowledge bases. A plain LLM only knows what it learned during training - it cannot answer questions about your company's internal policies, a product manual published last month, or real-time data. RAG solves this by retrieving relevant information at query time and providing it to the LLM as context in the prompt.

The RAG pipeline has two main phases. The indexing phase (offline, run once) takes your source documents, splits them into smaller chunks, converts each chunk into an embedding vector using an embedding model, and stores those vectors in a vector database. The retrieval and generation phase (online, at query time) takes the user's question, converts it to a vector using the same embedding model, searches the vector database for the most semantically similar document chunks, and then passes those chunks together with the user's question to the LLM to generate a grounded, accurate answer. The LLM never makes up facts because all the relevant facts are supplied in the prompt.

The below example shows the complete RAG pipeline using Python: ingesting documents, creating embeddings, storing them in a ChromaDB vector database, and then querying with semantic search followed by LLM generation.

# Install: pip install openai chromadb
from openai import OpenAI
import chromadb

client = OpenAI(api_key="your-api-key-here")
chroma = chromadb.Client()
collection = chroma.create_collection("company_docs")

# ---- INDEXING PHASE (run once to build the knowledge base) ----

# Step 1: Your source documents (in practice these come from PDFs, wikis, DBs)
documents = [
    "Our refund policy allows returns within 30 days of purchase with original receipt.",
    "Premium plan customers get 24/7 phone support and a dedicated account manager.",
    "The API rate limit is 1000 requests per minute for Pro accounts.",
    "To reset your password, click Forgot Password on the login page and check your email.",
    "We offer a 14-day free trial with no credit card required for all plans."
]

# Step 2: Generate embeddings for each document chunk
def embed(text):
    response = client.embeddings.create(model="text-embedding-3-small", input=text)
    return response.data[0].embedding

# Step 3: Store chunks and their embeddings in the vector database
for i, doc in enumerate(documents):
    collection.add(
        ids=[f"doc_{i}"],
        embeddings=[embed(doc)],
        documents=[doc]
    )
print(f"Indexed {len(documents)} documents into ChromaDB.")

It gives the following output,

Indexed 5 documents into ChromaDB.

The below example shows the retrieval and generation phase - taking a user question, finding relevant chunks, and generating a grounded answer.

# ---- RETRIEVAL AND GENERATION PHASE (run at query time) ----

def rag_query(user_question, top_k=3):
    # Step 1: Embed the user question
    query_embedding = embed(user_question)

# Step 2: Retrieve the top_k most semantically similar document chunks
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k
    )
    retrieved_chunks = results["documents"][0]

print("Retrieved chunks:")
    for i, chunk in enumerate(retrieved_chunks):
        print(f"  [{i+1}] {chunk}")

# Step 3: Build the prompt with retrieved context
    context = "\n".join([f"- {chunk}" for chunk in retrieved_chunks])
    prompt = f"""You are a helpful customer support assistant.
Answer the question using ONLY the information provided in the context below.
If the answer is not in the context, say "I don't have that information."

Context:
{context}

Question: {user_question}
Answer:"""

# Step 4: Generate answer using the LLM
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )
    return response.choices[0].message.content

# Test the RAG pipeline
questions = [
    "Can I return a product after 30 days?",
    "What is the API rate limit for Pro accounts?",
    "Do I need a credit card for the trial?"
]

for q in questions:
    print(f"\nQ: {q}")
    print(f"A: {rag_query(q)}")
    print("-" * 50)

It gives the following output,

Q: Can I return a product after 30 days?
Retrieved chunks:
  [1] Our refund policy allows returns within 30 days of purchase with original receipt.
  [2] We offer a 14-day free trial with no credit card required for all plans.
  [3] Premium plan customers get 24/7 phone support and a dedicated account manager.
A: Our refund policy only allows returns within 30 days of purchase with the original receipt. Returns after 30 days are not covered by our policy.

Q: What is the API rate limit for Pro accounts?
Retrieved chunks:
  [1] The API rate limit is 1000 requests per minute for Pro accounts.
  [2] Premium plan customers get 24/7 phone support and a dedicated account manager.
  [3] Our refund policy allows returns within 30 days of purchase with original receipt.
A: The API rate limit for Pro accounts is 1000 requests per minute.

Q: Do I need a credit card for the trial?
Retrieved chunks:
  [1] We offer a 14-day free trial with no credit card required for all plans.
  [2] Our refund policy allows returns within 30 days of purchase with original receipt.
  [3] The API rate limit is 1000 requests per minute for Pro accounts.
A: No, you do not need a credit card to start the free trial. We offer a 14-day free trial with no credit card required for all plans.

Why RAG beats fine-tuning for knowledge-based Q&A:

Fine-tuning a model on your documents embeds knowledge into the model weights, which then becomes stale as your documents change. RAG always retrieves from the latest version of your document store, making it ideal for dynamic knowledge bases. RAG is also transparent - you can show users exactly which source chunks were retrieved to generate the answer, providing citations and explainability that fine-tuning cannot offer.

Send your comments, suggestions or queries regarding this site to [email protected].