In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > RAG Pipelines > Building a Production RAG API with FastAPI and ChromaDB

Building a Production RAG API with FastAPI and ChromaDB

Author: Venkata Sudhakar

Building a RAG REST API with FastAPI gives ShopMax India a production-ready endpoint that their mobile app, website, and internal tools can all call for product Q and A. Rather than embedding the RAG pipeline inside each client application, a central FastAPI service manages the document index, handles retrieval, and calls the LLM - ensuring all clients use the same retrieval logic and making upgrades seamless. The API accepts a query string and returns a structured JSON response with the answer and source documents.

A well-designed RAG API has three endpoints: POST /query (submit a question and get an answer), GET /health (check if the service and dependencies are up), and POST /index (add new documents to the knowledge base). FastAPI's async support allows multiple concurrent queries without blocking, and Pydantic models enforce request and response schemas. ChromaDB or a similar vector store runs as a dependency, and the LLM client is initialized once at startup and reused across requests.

The following example builds a complete FastAPI RAG service for ShopMax India. The API exposes a /query endpoint backed by BM25 retrieval and Claude for answer generation, with a health check and structured JSON responses.

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import anthropic
from rank_bm25 import BM25Okapi
from typing import List

app = FastAPI(title="ShopMax India RAG API")
client = anthropic.Anthropic(api_key="sk-ant-...")

product_docs = [
    "Sony WH-1000XM5: 30-hour battery, noise-cancelling, Rs 29990, Mumbai and Bangalore.",
    "Samsung Galaxy S24 Ultra: 200MP camera, 12GB RAM, Rs 134999, pan-India.",
    "Dell XPS 15: 32GB RAM, 1TB SSD, Rs 135000, Delhi and Mumbai.",
    "Apple iPhone 15 Pro: 48MP camera, titanium, Rs 134900, pan-India."
]
tokenized = [doc.lower().split() for doc in product_docs]
bm25 = BM25Okapi(tokenized)

class QueryRequest(BaseModel):
    question: str
    top_k: int = 3

class QueryResponse(BaseModel):
    answer: str
    sources: List[str]
    query: str

class HealthResponse(BaseModel):
    status: str
    docs_indexed: int

@app.get("/health", response_model=HealthResponse)
def health():
    return {"status": "ok", "docs_indexed": len(product_docs)}

@app.post("/query", response_model=QueryResponse)
def query(request: QueryRequest):
    if not request.question.strip():
        raise HTTPException(status_code=400, detail="Question cannot be empty")
    scores = bm25.get_scores(request.question.lower().split())
    idx = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:request.top_k]
    sources = [product_docs[i] for i in idx]
    context = "\n".join(sources)
    msg = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=200,
        system="You are ShopMax India assistant. Answer using only the provided context.",
        messages=[{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {request.question}"}]
    )
    return QueryResponse(
        answer=msg.content[0].text,
        sources=sources,
        query=request.question
    )

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

It gives the following output,

# GET /health
{"status": "ok", "docs_indexed": 4}

# POST /query
# Request: {"question": "Which headphones are available in Mumbai?"}
{
  "answer": "The Sony WH-1000XM5 headphones are available in Mumbai at Rs 29,990 with 30-hour battery life and noise-cancelling.",
  "sources": [
    "Sony WH-1000XM5: 30-hour battery, noise-cancelling, Rs 29990, Mumbai and Bangalore."
  ],
  "query": "Which headphones are available in Mumbai?"
}

For ShopMax India in production, add request authentication using FastAPI's HTTPBearer dependency to ensure only authorized clients can query the API. Add rate limiting with slowapi to prevent abuse. Use async LLM calls (anthropic.AsyncAnthropic) and async BM25 retrieval to handle concurrent requests without blocking. Deploy on Cloud Run or Kubernetes with horizontal scaling - the RAG service is stateless since the index is loaded at startup, making it trivially scalable. Log all queries, retrieved sources, and response times to BigQuery for monitoring and continuous improvement.

Send your comments, suggestions or queries regarding this site to [email protected].