LLM applications have two practical problems that show up as soon as you build a chat interface.
The first is waiting. Without streaming, your app sends a request and receives the answer only after the model finishes generating it. That is fine for short answers, but it feels slow for longer ones. Streaming lets your app show partial text as it arrives.
The second is memory. With the Chat Completions style API used in this course, the model does not remember earlier turns by itself. If you want the model to understand an ongoing conversation, your application must store the message history and send the relevant parts with each request.
In this chapter, you will learn how to stream responses, store conversation history, keep long chats within the model's context window, and decide what to do when old context no longer fits.