An AI application can be up and still be wrong.
The HTTP endpoint may return 200, the database may be healthy, and the model provider may be responding. Users can still receive stale answers, unsupported claims, malformed JSON, unsafe tool calls, or responses that cost ten times more than expected.
Monitoring tells you whether the system looks healthy. Observability helps you explain what happened when it does not. AI systems need both traditional service telemetry and AI-specific signals: prompts, retrieved context, model configuration, token usage, tool calls, safety decisions, and sampled quality checks.
This chapter covers what to capture in a production AI system, what to aggregate, what to alert on, and what to avoid logging unless you have a clear reason and permission.