In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Google Gemini API > Gemini Streaming Responses for Chat Apps

Gemini Streaming Responses for Chat Apps

Author: Venkata Sudhakar

When a Gemini response takes 3-5 seconds to generate, showing a blank screen until completion makes your application feel broken. Streaming delivers the first words to the user in under a second, with the rest flowing in continuously. For any customer-facing chat application - a support bot, a sales assistant, a product advisor - streaming is the difference between a product that feels alive and one that feels slow. Gemini 2.0 Flash streaming is particularly fast, with first token times often under 300 milliseconds for short prompts.

In the google-genai SDK, streaming uses generate_content_stream() instead of generate_content(). It returns an iterator of response chunks, each with a text attribute containing the next piece of generated text. You print or yield each chunk as it arrives. The SDK also provides a resolve() call to wait for the complete response metadata (token counts, finish reason) after the stream ends. For web applications, combine the stream iterator with FastAPI StreamingResponse and Server-Sent Events to push chunks to the browser in real time.

The below example builds a streaming customer support agent for an insurance company - showing how chunks arrive progressively, measuring time-to-first-token, and demonstrating the FastAPI SSE pattern for web deployment.

import time
from google import genai
from google.genai import types

client = genai.Client(api_key="your-gemini-api-key")

INSURANCE_SYSTEM = (
    "You are a helpful customer service agent for SafeGuard Insurance India. "
    "Answer policy and claims questions clearly and accurately. "
    "Always be warm and mention next steps the customer should take."
)

def stream_support_response(question: str) -> None:
    print("Agent: ", end="", flush=True)
    start = time.time()
    first_token_time = None
    full_response = ""

for chunk in client.models.generate_content_stream(
        model="gemini-2.0-flash",
        config=types.GenerateContentConfig(
            system_instruction=INSURANCE_SYSTEM,
            max_output_tokens=300,
            temperature=0.3
        ),
        contents=[question]
    ):
        if chunk.text:
            if first_token_time is None:
                first_token_time = time.time()
            print(chunk.text, end="", flush=True)
            full_response += chunk.text

total_time = time.time() - start
    ttft = round(first_token_time - start, 2) if first_token_time else 0
    print()  # newline after stream ends
    print("[TTFT: " + str(ttft) + "s | Total: " + str(round(total_time, 2)) + "s | "
          + str(len(full_response.split())) + " words]")

questions = [
    "My car was scratched in a parking lot. How do I raise a claim?",
    "I missed my premium payment by 5 days. Has my policy lapsed?"
]

for q in questions:
    print("Customer:", q)
    stream_support_response(q)
    print()

It gives the following output with words appearing progressively as they are generated,

Customer: My car was scratched in a parking lot. How do I raise a claim?
Agent: I am sorry to hear about the scratch on your car. Here is how to
raise a claim with SafeGuard: First, take clear photos of the damage from
multiple angles before moving the vehicle. Then log in to the SafeGuard app
or call 1800-SAFEGUARD within 48 hours of the incident. You will need your
policy number, the date and location of the incident, and the photos. A
survey will be arranged within 24 hours at your preferred location.
[TTFT: 0.28s | Total: 3.1s | 78 words]

Customer: I missed my premium payment by 5 days. Has my policy lapsed?
Agent: Do not worry - SafeGuard provides a 30-day grace period for premium
payments, so a 5-day delay will not lapse your policy. Your coverage remains
active during this grace period. Please make the payment as soon as possible
through the SafeGuard app, net banking, or any UPI app. If you need to set up
auto-pay to avoid this in future, I can guide you through that.
[TTFT: 0.31s | Total: 2.8s | 72 words]

# First words appear in 0.28-0.31 seconds - customer sees immediate response
# Without streaming: customer would wait the full 3 seconds seeing nothing

# FastAPI endpoint - streams Gemini chunks to browser via SSE
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from google import genai
from google.genai import types

app = FastAPI()
client = genai.Client(api_key="your-gemini-api-key")

@app.get("/support/chat")
async def chat_stream(question: str):
    def generate():
        for chunk in client.models.generate_content_stream(
            model="gemini-2.0-flash",
            config=types.GenerateContentConfig(
                system_instruction=INSURANCE_SYSTEM,
                max_output_tokens=300
            ),
            contents=[question]
        ):
            if chunk.text:
                # SSE format: data: <text> followed by double newline
                safe = chunk.text.replace("\n", " ")
                yield "data: " + safe + "\n\n"
        yield "data: [DONE]\n\n"
    return StreamingResponse(generate(), media_type="text/event-stream")

# Browser JavaScript to consume:
# const es = new EventSource("/support/chat?question=How+do+I+raise+a+claim");
# es.onmessage = e => {
#   if (e.data === "[DONE]") { es.close(); return; }
#   document.getElementById("response").innerText += e.data;
# };

The FastAPI SSE endpoint streams each Gemini token directly to the browser,

GET /support/chat?question=How+do+I+raise+a+claim

data: I
data:  am
data:  sorry
data:  to
data:  hear
...(tokens stream continuously)...
data: [DONE]

# Each SSE event fires onmessage in the browser
# Text accumulates word by word - smooth typewriter effect
# No polling, no long-held connections beyond the stream duration

Streaming best practices: always flush stdout when printing chunks (flush=True) otherwise output buffers and defeats the purpose. For FastAPI, use async generators and StreamingResponse with media_type="text/event-stream" for true SSE. Add a heartbeat every 15 seconds for long-running streams to prevent proxy timeout disconnections. Handle stream interruptions gracefully on the client side - if the user navigates away, close the EventSource to free server resources. For mobile apps, use chunked HTTP transfer encoding rather than SSE - the streaming pattern is the same but the transport differs.

Send your comments, suggestions or queries regarding this site to [email protected].