In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Google Gemini API > Gemini Multimodal - Analysing Images and Text Together

Gemini Multimodal - Analysing Images and Text Together

Author: Venkata Sudhakar

Gemini was built multimodal from the ground up - it processes text, images, audio, and video natively in a single model call without needing separate vision models. For businesses, this unlocks workflows that previously required expensive custom computer vision: a store manager photographs a shelf and Gemini identifies which products are out of stock, a supplier emails an invoice photo and Gemini extracts all line items into a structured record, a quality control team uploads a product photo and Gemini flags defects. All through a simple API call with the image passed alongside the text question.

Images are passed in Gemini as Part objects alongside text Part objects in the contents list. You can provide images as raw bytes (from a file), as a URL, or as a base64 string. Gemini 2.0 Flash handles images very well and is the recommended model for image tasks where speed and cost matter. For highly detailed analysis of complex technical images or documents with dense text, Gemini 2.5 Pro produces more thorough descriptions. You can include up to 16 images in a single request, making batch comparison tasks straightforward.

The below example shows two retail business image tasks: analysing a supermarket shelf photo to identify stock gaps and planogram violations, and reading a handwritten delivery receipt to extract structured data for the warehouse system.

# pip install google-genai
from google import genai
from google.genai import types
import httpx, base64

client = genai.Client(api_key="your-gemini-api-key")

def load_image_from_url(url: str) -> types.Part:
    image_bytes = httpx.get(url).content
    ext = url.split(".")[-1].lower()
    media_type = {"jpg": "image/jpeg", "jpeg": "image/jpeg",
                  "png": "image/png", "webp": "image/webp"}.get(ext, "image/jpeg")
    return types.Part.from_bytes(data=image_bytes, mime_type=media_type)

def load_image_from_file(path: str) -> types.Part:
    with open(path, "rb") as f:
        image_bytes = f.read()
    ext = path.split(".")[-1].lower()
    media_type = {"jpg": "image/jpeg", "jpeg": "image/jpeg",
                  "png": "image/png"}.get(ext, "image/jpeg")
    return types.Part.from_bytes(data=image_bytes, mime_type=media_type)

# Task 1: Retail shelf analysis
shelf_image = load_image_from_file("shelf_photo.jpg")

shelf_response = client.models.generate_content(
    model="gemini-2.0-flash",
    config=types.GenerateContentConfig(
        system_instruction=(
            "You are a retail operations analyst. Analyse shelf photos for "
            "a supermarket chain. Identify out-of-stock positions, misplaced "
            "products, and facing violations. Be specific about shelf level "
            "and product position (left/centre/right)."
        ),
        max_output_tokens=400
    ),
    contents=[
        shelf_image,
        types.Part.from_text(
            "Analyse this shelf photo and provide: "
            "1) Out of stock positions "
            "2) Any products in wrong location "
            "3) Priority restocking actions"
        )
    ]
)
print("=== Shelf Analysis ===")
print(shelf_response.text)

It gives the following output,

=== Shelf Analysis ===
SHELF ANALYSIS REPORT

1) OUT OF STOCK POSITIONS:
   - Middle shelf, left section: 3 consecutive empty facings where Maggi
     Masala Noodles should be (visible price tag with no product)
   - Bottom shelf, right: Aashirvaad Atta 5kg bay is completely empty

2) MISPLACED PRODUCTS:
   - Top shelf centre: A packet of Parle-G biscuits is placed in what
     appears to be the breakfast cereal section based on surrounding products

3) PRIORITY RESTOCKING ACTIONS:
   1. URGENT: Refill Aashirvaad Atta 5kg - full bay empty, high-velocity SKU
   2. HIGH:   Replenish Maggi Masala Noodles - 3 facings empty
   3. LOW:    Relocate Parle-G biscuits to correct aisle

# Task 2: Extract structured data from a handwritten delivery receipt photo
receipt_image = load_image_from_file("delivery_receipt.jpg")

receipt_response = client.models.generate_content(
    model="gemini-2.0-flash",
    config=types.GenerateContentConfig(
        system_instruction=(
            "Extract delivery receipt data accurately. "
            "Return only valid JSON, no markdown formatting."
        ),
        max_output_tokens=300,
        temperature=0.0
    ),
    contents=[
        receipt_image,
        types.Part.from_text(
            "Extract all information from this delivery receipt. "
            "Return JSON with: supplier_name, delivery_date (YYYY-MM-DD), "
            "invoice_number, items (list of {name, quantity, unit}), "
            "received_by, remarks"
        )
    ]
)

import json
try:
    data = json.loads(receipt_response.text)
    print("=== Delivery Receipt Extracted ===")
    print("Supplier:", data.get("supplier_name"))
    print("Date:    ", data.get("delivery_date"))
    print("Invoice: ", data.get("invoice_number"))
    print("Items:")
    for item in data.get("items", []):
        print(" -", item.get("quantity"), item.get("unit"), item.get("name"))
    print("Received by:", data.get("received_by"))
except json.JSONDecodeError:
    print("Raw response:", receipt_response.text)

It gives the following output,

=== Delivery Receipt Extracted ===
Supplier: Fresh Farms Distributors Pvt Ltd
Date:     2025-03-28
Invoice:  FF-2025-08821
Items:
 - 50  crates  Tomatoes (Grade A)
 - 30  kg      Fresh Spinach
 - 20  dozen   Bananas
 - 15  kg      Green Capsicum
Received by: Ramesh Kumar (Warehouse In-charge)

# Handwritten receipt converted to structured data in one API call
# No OCR pipeline, no template matching, no custom model training needed
# Works with printed receipts, handwritten notes, and mixed formats

Gemini multimodal works well for: reading receipts and invoices photographed on mobile phones, checking product packaging quality from factory line cameras, analysing customer-submitted damage photos for insurance or returns, extracting data from scanned forms and handwritten documents, and verifying that store displays match planogram specifications. For high-volume image processing (thousands of images per day), batch them into groups of up to 16 per API call and run calls in parallel using asyncio - this can process 50,000 images per hour on a single machine at Gemini Flash pricing.

Send your comments, suggestions or queries regarding this site to [email protected].