In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Google Gemini API > Gemini Live API - Real-Time Voice Conversations

Gemini Live API - Real-Time Voice Conversations

Author: Venkata Sudhakar

The Gemini Live API enables real-time bidirectional audio conversations with Gemini. Unlike the standard API where you wait for a complete response, the Live API works like a phone call - you stream audio in and Gemini streams audio responses back with sub-second latency. This powers a new class of business applications: voice-first customer service agents that feel like speaking to a real person, real-time meeting transcription and action item extraction, voice-controlled enterprise workflows, and hands-free agents for field workers who cannot use a keyboard.

The Live API uses WebSockets for bidirectional streaming. You open a session, configure the model and voice, then continuously send audio chunks while receiving audio response chunks in parallel. The session maintains full conversation context throughout - the agent remembers what was said earlier in the same call. Gemini supports multiple natural-sounding voices (Aoede, Charon, Fenrir, Kore, Puck) and can call your business tools while speaking, turning a voice conversation into a voice-driven agent workflow.

The below example demonstrates the Live API pattern: opening a session, configuring voice output, sending a text message (simulating audio input), receiving the streamed audio response, and enabling tool calling within the voice session for a customer service use case.

import asyncio
from google import genai
from google.genai import types

client = genai.Client(api_key="your-gemini-api-key")

# Business tools available to the voice agent
def get_order_status(order_id: str) -> dict:
    orders = {
        "ORD-88421": {"status": "Out for delivery", "eta": "today by 7pm"},
        "ORD-55987": {"status": "Delivered on 30 March 2025"}
    }
    return orders.get(order_id.upper(), {"error": "Order not found"})

# Live API session configuration
LIVE_CONFIG = types.LiveConnectConfig(
    response_modalities=["AUDIO"],       # respond with audio
    speech_config=types.SpeechConfig(
        voice_config=types.VoiceConfig(
            prebuilt_voice_config=types.PrebuiltVoiceConfig(
                voice_name="Aoede"       # warm natural voice
            )
        )
    ),
    system_instruction=(
        "You are Priya, a friendly ShopMax India customer service voice agent. "
        "Answer order queries clearly and concisely. "
        "Use get_order_status to look up real order data before answering. "
        "Speak naturally as if in a phone conversation - no bullet points or lists."
    ),
    tools=[types.Tool(function_declarations=[
        types.FunctionDeclaration(
            name="get_order_status",
            description="Look up delivery status for a customer order by order ID",
            parameters=types.Schema(
                type=types.Type.OBJECT,
                properties={
                    "order_id": types.Schema(
                        type=types.Type.STRING,
                        description="The order ID e.g. ORD-88421"
                    )
                },
                required=["order_id"]
            )
        )
    ])]
)

async def voice_session_demo():
    print("Opening Live API voice session...")
    async with client.aio.live.connect(
        model="gemini-2.0-flash-live-001",
        config=LIVE_CONFIG
    ) as session:
        print("Session open. Sending customer query...")

# In production: send audio chunks from microphone
        # Here we send text to simulate voice input
        await session.send(
            input="Hi, can you check where my order ORD-88421 is?",
            end_of_turn=True
        )

# Receive and process the streaming response
        print("Receiving voice response...")
        audio_chunks = []
        async for response in session.receive():
            if response.server_content:
                for part in response.server_content.model_turn.parts:
                    if hasattr(part, "inline_data") and part.inline_data:
                        # In production: play this audio chunk through speakers
                        audio_chunks.append(len(part.inline_data.data))
                        print("Audio chunk received:", len(part.inline_data.data), "bytes")
                    if hasattr(part, "text") and part.text:
                        print("Transcript:", part.text)
            if response.tool_call:
                # Agent is calling get_order_status
                for fc in response.tool_call.function_calls:
                    print("Tool call:", fc.name, fc.args)
                    result = get_order_status(**dict(fc.args))
                    await session.send(
                        input=types.LiveClientToolResponse(
                            function_responses=[types.FunctionResponse(
                                id=fc.id, name=fc.name, response=result
                            )]
                        )
                    )
            if response.server_content and response.server_content.turn_complete:
                break

print("Total audio received:", sum(audio_chunks), "bytes across", len(audio_chunks), "chunks")

asyncio.run(voice_session_demo())

It gives the following output showing the real-time voice session flow,

Opening Live API voice session...
Session open. Sending customer query...
Receiving voice response...
Tool call: get_order_status {"order_id": "ORD-88421"}
Audio chunk received: 4096 bytes
Audio chunk received: 4096 bytes
Audio chunk received: 3812 bytes
Transcript: Hi there! Great news - your order ORD-88421 is currently
            out for delivery and should arrive today by 7pm. Is there
            anything else I can help you with?
Total audio received: 12004 bytes across 3 chunks

# Audio chunks arrive in real-time as Gemini speaks
# Tool call happened mid-response to look up real order data
# In production: pipe audio_chunks to speakers for a natural phone call feel
# Sub-second latency: first audio chunk typically arrives within 300-500ms

Text mode Live API streams responses word by word,

Our return policy allows returns within 7 days of delivery for unused items
in original packaging. Large appliances get free pickup; smaller items can
be dropped at any ShopMax store. Refunds process in 3-5 business days.

# TEXT mode: same WebSocket session, tokens stream in real time
# Use TEXT mode for web chat, AUDIO mode for phone/voice applications
# Both support tool calling and session-level conversation memory

Live API production patterns: in a web application, open the WebSocket from the browser using the Gemini JavaScript SDK and stream microphone audio directly - no server round-trip needed. For a telephony integration (IVR system), connect the Live API WebSocket to your Twilio or Exotel media stream. For field worker applications, deploy a lightweight client on Android or iOS that streams audio to the Live API and plays the audio response through the device speaker. Always handle network interruptions gracefully - implement reconnection logic with exponential backoff and resume from the last conversation state. Use session_resumption in the config to enable seamless reconnection for long-running voice sessions.

Send your comments, suggestions or queries regarding this site to [email protected].