Last Updated: May 29, 2026
Single-GPU training runs into three distinct limits: the run takes too long, the model no longer fits in memory, or the input pipeline can't feed the device fast enough.
Distributed training spreads computation across multiple GPUs or machines. It also adds new concerns that don't exist on one device: gradient synchronization, larger effective batches, checkpoint coordination, stragglers, and network bottlenecks.
This chapter focuses on the choices an interviewer expects you to reason about: data parallelism, model-state sharding, model parallelism, communication, and the convergence trade-offs that appear at scale.