Some constraints cannot be solved by tuning API calls. Data residency, strict isolation, offline operation, custom model weights, predictable high-volume traffic, and low-level latency control can all push a team toward self-hosting.
Self-hosting means your team runs the model server, chooses the hardware, manages scaling, and owns the failure modes. It can reduce cost at scale and give you more control, but it also moves work from the provider's platform team to yours.
The question is not whether self-hosting is "better." The question is whether your workload, risk profile, team, and traffic pattern justify owning inference infrastructure.