Last Updated: May 29, 2026
Production datasets rarely arrive ready to train on. They tend to be too large to use in full, skewed heavily toward common cases, and shaped by how the data happened to be collected.
Sampling decides which examples the model sees and how often it sees them. Augmentation creates label-preserving variations of examples you already have. Both improve the training signal without simply asking for more labels.
What matters here is being able to build a training set that matches the product objective, covers the rare cases, and keeps the label intact, rather than reciting a catalog of techniques.