An AI demo can work with one happy path. A production system has to handle provider errors, malformed outputs, stale retrieval, policy refusals, slow generations, tool failures, and users who do not ask questions the way your test cases do. Reliability is the practice of making those failures bounded, visible, and recoverable.
In this chapter, we will design AI systems that degrade deliberately instead of failing in confusing ways.