In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Ollama > Streaming Responses with Ollama in Python

Streaming Responses with Ollama in Python

Author: Venkata Sudhakar

Ollama supports streaming responses, which means the model output is delivered token by token as it is generated instead of waiting for the full response to complete. This is essential for building chat interfaces and interactive tools where users expect to see output appearing in real time. The Ollama Python client makes streaming simple by accepting a stream parameter in the chat call. At ShopMax India, the internal customer support bot uses streaming to display responses progressively, improving the perceived response speed.

When streaming is enabled, the ollama.chat() function returns a generator. Each item yielded by the generator is a chunk containing a message with a content field. You accumulate the chunks to build the full response, or process each chunk as it arrives to display it immediately.

The below example shows how to use Ollama streaming to print model output as it is generated.

It gives the following output,

ShopMax Assistant: For a student budget of Rs 50,000, consider the
ShopMax EduBook with an Intel Core i5, 8GB RAM, and 512GB SSD.
It handles college assignments, coding, and video calls comfortably.
Available at ShopMax stores in Mumbai, Bangalore, and Delhi.

Total characters: 231

The flush=True argument in the print call is important - without it, Python may buffer the output and not display it immediately. The stream=True mode uses the same Ollama REST endpoint but keeps the HTTP connection open and reads each JSON chunk as it arrives. For web applications at ShopMax India, streaming can be forwarded to the browser using server-sent events (SSE) to give customers a real-time typing effect in the chat UI.

Send your comments, suggestions or queries regarding this site to [email protected].