LLM calls are slow and expensive compared with ordinary API requests. Many applications also repeat the same work: system prompts, similar user questions, identical retrieval results, repeated embeddings, and common support answers. Caching can reduce that waste, but only when the cached result is still correct for the user, tenant, permissions, model, and data version.
In this chapter, we will look at caching strategies that work well for LLM applications, and the places where caching can go wrong.