An LLM does not write a whole answer all at once. It builds the answer one token at a time.

At each step, the model reads the tokens it has so far, runs them through the transformer, and produces a raw score for every token in its vocabulary. Those scores are turned into probabilities. A decoding strategy then chooses the next token.

That loop is the mechanical core of text generation:

context -> model -> token scores -> decoding -> next token -> updated context

The details matter in real applications. Temperature, top-p, stop tokens, context length, and output limits all affect this loop. When a model repeats itself, drifts away from the task, invents a citation, or stops mid-sentence, the reason is often connected to how generation is configured.

The Next-Token Prediction Loop

Premium Content

This content is for premium members only.

How LLMs Generate Text

The Next-Token Prediction Loop

Premium Content

Get Premium