|
|
Gemini Multimodal - Analysing Images and Text Together
Author: Venkata Sudhakar
Gemini was built multimodal from the ground up - it processes text, images, audio, and video natively in a single model call without needing separate vision models. For businesses, this unlocks workflows that previously required expensive custom computer vision: a store manager photographs a shelf and Gemini identifies which products are out of stock, a supplier emails an invoice photo and Gemini extracts all line items into a structured record, a quality control team uploads a product photo and Gemini flags defects. All through a simple API call with the image passed alongside the text question. Images are passed in Gemini as Part objects alongside text Part objects in the contents list. You can provide images as raw bytes (from a file), as a URL, or as a base64 string. Gemini 2.0 Flash handles images very well and is the recommended model for image tasks where speed and cost matter. For highly detailed analysis of complex technical images or documents with dense text, Gemini 2.5 Pro produces more thorough descriptions. You can include up to 16 images in a single request, making batch comparison tasks straightforward. The below example shows two retail business image tasks: analysing a supermarket shelf photo to identify stock gaps and planogram violations, and reading a handwritten delivery receipt to extract structured data for the warehouse system.
It gives the following output,
=== Shelf Analysis ===
SHELF ANALYSIS REPORT
1) OUT OF STOCK POSITIONS:
- Middle shelf, left section: 3 consecutive empty facings where Maggi
Masala Noodles should be (visible price tag with no product)
- Bottom shelf, right: Aashirvaad Atta 5kg bay is completely empty
2) MISPLACED PRODUCTS:
- Top shelf centre: A packet of Parle-G biscuits is placed in what
appears to be the breakfast cereal section based on surrounding products
3) PRIORITY RESTOCKING ACTIONS:
1. URGENT: Refill Aashirvaad Atta 5kg - full bay empty, high-velocity SKU
2. HIGH: Replenish Maggi Masala Noodles - 3 facings empty
3. LOW: Relocate Parle-G biscuits to correct aisle
It gives the following output,
=== Delivery Receipt Extracted ===
Supplier: Fresh Farms Distributors Pvt Ltd
Date: 2025-03-28
Invoice: FF-2025-08821
Items:
- 50 crates Tomatoes (Grade A)
- 30 kg Fresh Spinach
- 20 dozen Bananas
- 15 kg Green Capsicum
Received by: Ramesh Kumar (Warehouse In-charge)
# Handwritten receipt converted to structured data in one API call
# No OCR pipeline, no template matching, no custom model training needed
# Works with printed receipts, handwritten notes, and mixed formats
Gemini multimodal works well for: reading receipts and invoices photographed on mobile phones, checking product packaging quality from factory line cameras, analysing customer-submitted damage photos for insurance or returns, extracting data from scanned forms and handwritten documents, and verifying that store displays match planogram specifications. For high-volume image processing (thousands of images per day), batch them into groups of up to 16 per API call and run calls in parallel using asyncio - this can process 50,000 images per hour on a single machine at Gemini Flash pricing.
|
|