|
|
Gemini Live API - Real-Time Voice Conversations
Author: Venkata Sudhakar
The Gemini Live API enables real-time bidirectional audio conversations with Gemini. Unlike the standard API where you wait for a complete response, the Live API works like a phone call - you stream audio in and Gemini streams audio responses back with sub-second latency. This powers a new class of business applications: voice-first customer service agents that feel like speaking to a real person, real-time meeting transcription and action item extraction, voice-controlled enterprise workflows, and hands-free agents for field workers who cannot use a keyboard. The Live API uses WebSockets for bidirectional streaming. You open a session, configure the model and voice, then continuously send audio chunks while receiving audio response chunks in parallel. The session maintains full conversation context throughout - the agent remembers what was said earlier in the same call. Gemini supports multiple natural-sounding voices (Aoede, Charon, Fenrir, Kore, Puck) and can call your business tools while speaking, turning a voice conversation into a voice-driven agent workflow. The below example demonstrates the Live API pattern: opening a session, configuring voice output, sending a text message (simulating audio input), receiving the streamed audio response, and enabling tool calling within the voice session for a customer service use case.
It gives the following output showing the real-time voice session flow,
Opening Live API voice session...
Session open. Sending customer query...
Receiving voice response...
Tool call: get_order_status {"order_id": "ORD-88421"}
Audio chunk received: 4096 bytes
Audio chunk received: 4096 bytes
Audio chunk received: 3812 bytes
Transcript: Hi there! Great news - your order ORD-88421 is currently
out for delivery and should arrive today by 7pm. Is there
anything else I can help you with?
Total audio received: 12004 bytes across 3 chunks
# Audio chunks arrive in real-time as Gemini speaks
# Tool call happened mid-response to look up real order data
# In production: pipe audio_chunks to speakers for a natural phone call feel
# Sub-second latency: first audio chunk typically arrives within 300-500ms
Text mode Live API streams responses word by word,
Our return policy allows returns within 7 days of delivery for unused items
in original packaging. Large appliances get free pickup; smaller items can
be dropped at any ShopMax store. Refunds process in 3-5 business days.
# TEXT mode: same WebSocket session, tokens stream in real time
# Use TEXT mode for web chat, AUDIO mode for phone/voice applications
# Both support tool calling and session-level conversation memory
Live API production patterns: in a web application, open the WebSocket from the browser using the Gemini JavaScript SDK and stream microphone audio directly - no server round-trip needed. For a telephony integration (IVR system), connect the Live API WebSocket to your Twilio or Exotel media stream. For field worker applications, deploy a lightweight client on Android or iOS that streams audio to the Live API and plays the audio response through the device speaker. Always handle network interruptions gracefully - implement reconnection logic with exponential backoff and resume from the last conversation state. Use session_resumption in the config to enable seamless reconnection for long-running voice sessions.
|
|