|
|
RAG with Table Data - Extracting and Querying Tabular Content
Author: Venkata Sudhakar
RAG with table data extends retrieval pipelines beyond plain text to handle structured information like product pricing sheets, sales reports, and inventory tables. ShopMax India stores category-level pricing data in spreadsheets and database exports - these tabular sources contain precise numbers that LLMs must not hallucinate. Parsing tables into a retrievable format and injecting them with their row-column structure intact ensures the model can accurately answer questions like 'which laptop has the best price-to-RAM ratio under Rs 100000'.
The key challenge with tabular RAG is preserving row context during chunking. Naive text chunking splits a CSV row across chunks, losing the relationship between column headers and values. The recommended approach converts each row into a natural language sentence (e.g. 'Dell XPS 15: 32GB RAM, Rs 135000, in stock in Mumbai') and stores these as retrievable documents. For wide tables with many columns, a summary embedding plus the raw row stored as metadata gives the best retrieval accuracy.
The following example converts a ShopMax India product pricing table into sentence-format documents, retrieves relevant rows for a query, and uses Claude to answer questions about pricing and specifications from the structured data.
It gives the following output,
Q: Which laptops have 32GB RAM and what are their prices?
A: Two laptops have 32GB RAM: the Dell XPS 15 9530 at Rs 135,000 and the ASUS ROG Zephyrus at Rs 145,000.
Q: What is the cheapest laptop available in Delhi?
A: The Lenovo ThinkPad X1 is available in Delhi at Rs 98,000. The Acer Swift 5 is also available in Delhi at Rs 72,000, making it the cheapest option there.
Q: Which laptop has the most units in stock in Mumbai?
A: The Dell XPS 15 9530 has the most units in stock in Mumbai with 8 units available.
For ShopMax India at scale, keep the raw table rows in a structured database and regenerate sentence embeddings whenever prices or stock levels change. Do not embed prices as static text - they change daily. Instead, fetch the current row from the database after retrieval and inject it fresh into the context. For complex analytical queries like 'best price-per-GB laptop under Rs 100000', add a calculation step before the LLM call that computes derived metrics from the raw table data and appends them to the context.
|
|