|
|
Streaming Responses with Ollama in Python
Author: Venkata Sudhakar
Ollama supports streaming responses, which means the model output is delivered token by token as it is generated instead of waiting for the full response to complete. This is essential for building chat interfaces and interactive tools where users expect to see output appearing in real time. The Ollama Python client makes streaming simple by accepting a stream parameter in the chat call. At ShopMax India, the internal customer support bot uses streaming to display responses progressively, improving the perceived response speed. When streaming is enabled, the ollama.chat() function returns a generator. Each item yielded by the generator is a chunk containing a message with a content field. You accumulate the chunks to build the full response, or process each chunk as it arrives to display it immediately. The below example shows how to use Ollama streaming to print model output as it is generated.
It gives the following output,
ShopMax Assistant: For a student budget of Rs 50,000, consider the
ShopMax EduBook with an Intel Core i5, 8GB RAM, and 512GB SSD.
It handles college assignments, coding, and video calls comfortably.
Available at ShopMax stores in Mumbai, Bangalore, and Delhi.
Total characters: 231
The flush=True argument in the print call is important - without it, Python may buffer the output and not display it immediately. The stream=True mode uses the same Ollama REST endpoint but keeps the HTTP connection open and reads each JSON chunk as it arrives. For web applications at ShopMax India, streaming can be forwarded to the browser using server-sent events (SSE) to give customers a real-time typing effect in the chat UI.
|
|