Why Streaming Matters
Exit

Why Streaming Matters

Understand the difference between waiting for a complete response and printing tokens as they arrive

The problem with waiting

Right now, generate_answer waits for Gemini to finish generating the entire response before returning anything. For a three-second response, the terminal is blank the entire time — then the full text appears at once.

This feels slow even when it isn't. The model started generating text almost immediately, but the function buffered the entire response before returning it.

How streaming works

The Gemini API supports a streaming mode: instead of returning one complete response, it returns an iterator. As the model generates each piece of text (called a chunk), it sends that chunk to your code immediately. Your code prints it and moves on to the next chunk.

The result: text appears in the terminal word by word, the moment the model produces it — just like watching someone type.

What changes

You will replace generate_answer with a new function called stream_answer. The key differences:

  • generate_answer: calls generate_content, waits for the full response, returns the text
  • stream_answer: calls generate_content_stream, receives an iterator, prints each chunk immediately, and returns the accumulated full text when the stream ends

The API call changes from generate_content to generate_content_stream. Everything else in the assistant — search, prompts, history — stays the same.

Next Chapter →