In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > RAG Pipelines > Multimodal RAG - Retrieving Text and Image Content Together

Multimodal RAG - Retrieving Text and Image Content Together

Author: Venkata Sudhakar

Multimodal RAG extends retrieval pipelines to handle both text and image content, enabling ShopMax India to answer customer questions that require visual product information. When a customer asks 'does this TV have a slim bezel?' or 'show me the ports on the back of this laptop', text-only retrieval cannot help. Multimodal RAG indexes product images alongside their captions and retrieves both text specs and relevant images to ground the LLM answer.

The standard approach for multimodal RAG uses a CLIP-style model to embed both images and text into the same vector space, enabling cross-modal retrieval. Alternatively, product images can be captioned at index time using a vision LLM, and the captions stored as text embeddings. At query time, the matching captions are retrieved and the corresponding images passed to the multimodal LLM for answer generation. The caption-based approach works with any text vector store without requiring a dedicated multimodal index.

The following example demonstrates the caption-based multimodal RAG pattern for ShopMax India. Images are described upfront, stored with metadata, retrieved by text similarity, and then the image bytes plus query are sent to Claude for a visually-grounded answer.

import anthropic
import base64
from rank_bm25 import BM25Okapi

client = anthropic.Anthropic(api_key="sk-ant-...")

image_index = [
    {
        "product": "Samsung 55-inch QLED TV",
        "image_path": "samsung_tv_front.jpg",
        "caption": "Samsung 55-inch QLED TV front view showing ultra-slim 8mm bezel design, 4K display, and minimal frame."
    },
    {
        "product": "Samsung 55-inch QLED TV",
        "image_path": "samsung_tv_ports.jpg",
        "caption": "Samsung 55-inch QLED TV rear panel showing 4 HDMI 2.1 ports, 2 USB 3.0 ports, optical audio, and ethernet."
    },
    {
        "product": "Dell XPS 15 Laptop",
        "image_path": "dell_xps_ports.jpg",
        "caption": "Dell XPS 15 side panel with 2 Thunderbolt 4 ports, USB-C, SD card slot, and headphone jack."
    }
]

captions = [item["caption"] for item in image_index]
tokenized = [c.lower().split() for c in captions]
bm25 = BM25Okapi(tokenized)

def load_image_base64(path):
    with open(path, "rb") as f:
        return base64.standard_b64encode(f.read()).decode("utf-8")

def multimodal_rag(query, top_k=1):
    scores = bm25.get_scores(query.lower().split())
    idx = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
    best = image_index[idx[0]]
    image_data = load_image_base64(best["image_path"])
    msg = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": image_data}},
                {"type": "text", "text": f"Product: {best['product']}\nQuestion: {query}"}
            ]
        }]
    )
    return msg.content[0].text, best["product"]

queries = [
    "Does the Samsung TV have a slim bezel design?",
    "What ports are available on the Dell XPS 15?"
]

for q in queries:
    answer, product = multimodal_rag(q)
    print(f"Q: {q}")
    print(f"Product: {product}")
    print(f"A: {answer}")
    print()

It gives the following output,

Q: Does the Samsung TV have a slim bezel design?
Product: Samsung 55-inch QLED TV
A: Yes, the Samsung 55-inch QLED TV features an ultra-slim 8mm bezel design with a minimal frame, giving the screen a near-borderless appearance.

Q: What ports are available on the Dell XPS 15?
Product: Dell XPS 15 Laptop
A: The Dell XPS 15 includes 2 Thunderbolt 4 ports, one USB-C port, an SD card slot, and a headphone jack on the side panel.

For ShopMax India at scale, automate the captioning step at product onboarding time - run each new product image through a vision model to generate structured captions covering design, ports, dimensions, and color. Store captions in your existing text vector store alongside spec documents so both are retrieved in the same pipeline. For high-value products, maintain multiple image angles (front, rear, side, ports, packaging) as separate indexed entries so customers can ask about any visual aspect of the product.

Send your comments, suggestions or queries regarding this site to [email protected].