Last Updated: May 29, 2026
Embedding-based retrieval works at serving time by encoding queries and items into vectors, searching an ANN index, and returning candidates. What that leaves open is how you train encoders whose vectors actually encode relevance, because the quality of every candidate depends on that and on whether the serving index matches the training objective.
The two-tower architecture is the answer most large-scale retrieval systems converge on. It gives you a clean serving split: item representations are precomputed offline, while query or user representations are computed online.