Last Updated: March 14, 2026
Large language models like GPT, Claude, and Llama are all built on the same core idea: the Transformer architecture. Introduced in the 2017 paper “Attention Is All You Need”, the transformer fundamentally changed how machines process language and is the foundation behind modern AI systems.
At a high level, a transformer reads a sequence of tokens and learns how each token relates to the others. Instead of processing text one word at a time like older models, it can look at the entire sequence simultaneously and decide which parts of the input matter most for predicting the next token.
The key innovation that enables this is self-attention.
In this chapter, we will walk through the transformer architecture in a simplified way and understand the core components that power modern large language models.
To appreciate what transformers solved, you need to understand what came before them. Before 2017, the dominant architecture for processing sequences of text was the Recurrent Neural Network (RNN) and its variants like LSTMs and GRUs.
RNNs process text one word at a time, left to right, passing a "hidden state" from one step to the next. Think of it like a game of telephone: the first word whispers its information to the second word, which combines it with its own information and whispers to the third, and so on.
This sequential processing had two major problems.
By the time the model reaches word 100, the information from word 1 has been passed through 99 intermediate steps. Like a game of telephone, the message degrades. The model "forgets" earlier context. LSTMs improved this, but they never fully solved it for very long sequences.
Because each step depends on the previous step's output, you have to process words one at a time. You cannot use a GPU's thousands of cores to process them simultaneously. Training was painfully slow.
In 2017, a team at Google published "Attention Is All You Need" and introduced the transformer architecture. Its key innovation was simple but powerful: instead of processing words sequentially, let every word look at every other word directly, all at once. No more telephone. Every word gets to talk to every other word in a single step.
This is the attention mechanism, and it changed everything.
Attention is the core idea behind transformers. Before getting into formulas, it helps to build an intuition for what it is doing.
Imagine you are reading a sentence and trying to understand the meaning of a particular word. You do not treat every other word in the sentence as equally important. Instead, your brain naturally focuses on the words that provide the most useful context.
For example, when you read the word “it” in a sentence, you instinctively look back to earlier words to figure out what “it” refers to. Some words are clearly more relevant than others.
This is essentially what the attention mechanism allows a transformer to do.
For every token in the input, the model creates three vectors:
You can think of them as three different perspectives of the same token.
Every token produces all three vectors.
Once these vectors are created, the model performs a series of comparisons.
For a given token, its Query is compared with the Keys of every token in the sequence. This produces a set of attention scores that measure how relevant each token is to the current one.
Tokens with higher scores receive more attention. Their Values contribute more strongly when the model builds the final representation.
In simplified form, the process looks like this:
Input tokens → each token generates Q, K, V → compare Q with all K vectors → compute attention scores → combine V vectors using those scores
You can think of it as a weighted aggregation of information from the entire sentence.
Let’s make that concrete with a sentence like: "The cat sat on the mat because it was tired."
When the model processes the word “it,” it needs to figure out what “it” refers to. To do that, it compares the Query from “it” against the Keys of the other words in the sentence. If the model has learned useful language patterns, the Query for “it” will align more strongly with “cat” than with “mat.” As a result, “cat” receives a higher attention score, so its Value has a larger influence on how “it” is represented.
Internally, the model is not reasoning in plain English phrases like “this word refers to that noun.” Everything happens through learned numerical vectors.
But the effect is powerful.
Attention allows every token to dynamically gather information from the most relevant parts of the input. Instead of treating a sentence as a simple left-to-right sequence, the model can examine relationships across the entire context and focus on the words that matter most.
This ability to selectively focus on relevant context is what makes transformers so effective at understanding and generating language.
Now let’s walk through how self-attention works step by step. We will keep the numbers small so the process is easy to follow conceptually. Real models use vectors with hundreds or thousands of dimensions, but the idea is exactly the same.
Each input word starts as an embedding vector, which is simply a list of numbers representing the word in a high-dimensional space. These embeddings capture semantic meaning and are learned during training.
From this embedding, the model produces three new vectors:
This is done using three learned weight matrices. In practice, this is just matrix multiplication.
Every token in the sentence goes through the same transformation, producing its own Query, Key, and Value vectors.
Next, the model determines how much attention one word should pay to another.
It does this by comparing a token’s Query with the Keys of all tokens in the sequence. The comparison is done using a dot product, which measures how similar two vectors are.
If two vectors point in similar directions, the dot product is large, indicating high relevance. If they point in different directions, the value is smaller.
These scores represent how strongly one token should attend to another.
The raw dot product values can become very large, especially when vector dimensions grow. Large values can make the next step unstable.
To avoid this, the scores are scaled down by dividing them by the square root of the key dimension.
Next, we apply softmax, which converts the scores into probabilities that sum to 1.
In this example, the token “cat” assigns:
These numbers represent how much influence each word will have on the final representation.
Finally, the model combines the Value vectors using the attention weights.
This produces a new representation for “cat” that incorporates information from other relevant tokens in the sentence.
Instead of treating words independently, the model now has a context-aware representation.
The model performs this entire process for every token in the sequence at the same time. This parallel computation is one reason transformers train efficiently on GPUs.
In research papers, the entire operation is often written as a single equation:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V
You do not need to memorize this formula. The key idea is the four-step process:
This mechanism allows every word to gather information from the most relevant parts of the sentence, giving the model a much richer understanding of context.
Single-head attention is useful, but it has a limitation: a token may need to relate to other tokens for different reasons at the same time.
Take this sentence: “The animal didn’t cross the street because it was too wide.”
To understand “it,” the model may need to pay attention to multiple parts of the sentence:
A single attention operation produces one pattern of attention weights. That can be limiting, because language contains multiple overlapping relationships: syntax, meaning, reference, position, and more.
That is why transformers use multi-head attention. Instead of running one attention operation, the model runs several attention operations in parallel, each with its own learned projection matrices for Query, Key, and Value. The original Transformer paper introduced this idea specifically so the model could attend to information from different representation subspaces at different positions.
You can think of each head as giving the model a different perspective on the same sentence:
These are useful intuitions, not hard-coded roles. Heads are learned during training, and real models do not come with a label saying “this is the syntax head.” But in practice, different heads often end up specializing in different kinds of patterns.
In practice, GPT-4 and similar large models use 96 or more attention heads. Each head operates on a smaller slice of the embedding dimension (if your embedding is 12,288 dimensions and you have 96 heads, each head works with 128 dimensions). The outputs of all heads are concatenated and projected back to the original dimension.
The key idea is simple:
Multi-head attention lets the model examine the same sentence through multiple learned lenses at once. No single head has to capture every relationship. Different heads can focus on different signals, and their combined output gives the model a richer, more nuanced understanding of the text.
A transformer block contains more than just attention. A modern language model is built by stacking many of these blocks on top of one another, with each block refining the token representations a little further.
In decoder-only models such as GPT-style architectures, a block typically includes self-attention, a feed-forward network, residual connections, and layer normalization.
Each transformer block contains four components:
The skip paths around attention and the feed-forward layer are the residual connections. They are important because they help information and gradients flow more easily through deep networks, which makes optimization more stable as the model gets deeper. Without them, the gradient signal (used for learning) would vanish after passing through dozens of layers, and the model would fail to learn. The residual connection provides a "shortcut" that lets the gradient flow directly through the network.
The feed-forward network is easy to overlook, but it is an important part of the block. Attention allows tokens to exchange information with one another. The feed-forward network then processes each token’s updated representation independently, helping the model reshape that information into something more useful for the next layer.
When many of these blocks are stacked, the representation becomes progressively richer. Earlier layers often capture more local or surface-level patterns, while deeper layers can represent more abstract relationships.
There is an important limitation in the attention mechanism we have discussed so far. By itself, attention compares tokens to one another, but it does not inherently tell the model where those tokens appear in the sequence. If we only looked at token embeddings, the relationship between “cat” and “sat” would look the same whether the input was “the cat sat” or “sat cat the.”
That is a problem, because word order carries meaning. “The dog bit the man” and “The man bit the dog” use the same words, but mean very different things.
Older sequence models like RNNs picked up order naturally because they processed tokens one at a time. Transformers process tokens in parallel, so they need a separate way to represent position. That is why transformers add positional information to the input.
Before the first transformer block, each token already has a token embedding, which captures something about the token’s meaning. The model then combines that with a position signal so the representation contains both:
So instead of feeding only the embedding for “cat” into the model, we feed:
token embedding + positional information
That way, the model can distinguish:
even though the token itself is the same.
The original transformer paper used fixed mathematical functions (sine and cosine waves of different frequencies) for positional encoding. Modern models like GPT and Llama use learned positional embeddings instead, where the position vectors are trained along with the rest of the model. Some newer models use Rotary Positional Encoding (RoPE), which encodes relative positions rather than absolute ones, making it easier to generalize to longer sequences than the model was trained on.
Positional information is one of the reasons language models can understand order-sensitive patterns like:
It also connects directly to context length. Models are trained with some maximum context window, such as a few thousand tokens or much more. Performance beyond the training regime is not guaranteed, because positional behavior becomes harder to maintain as sequence length grows. This is one reason long-context modeling remains an active research and engineering area. RoPE-based methods help here, but they do not magically remove all long-context challenges.
You will commonly see three transformer variants: encoder-only, decoder-only, and encoder-decoder. They are built on the same core ideas, but they are optimized for different kinds of tasks.
Understanding the difference helps you choose the right model for the job.
An encoder-only transformer reads the full input at once. Each token can attend to the tokens on both its left and right, so the model builds a bidirectional understanding of the input. This makes encoder-only models especially good at tasks where the goal is to understand text rather than generate it. BERT is the classic example.
These models are commonly used for tasks like: classification, sentiment analysis, named entity recognition, semantic search, embeddings and similarity.
Examples include BERT, RoBERTa, and DeBERTa. BERT-style models are strong when you want a rich representation of the entire input.
A decoder-only transformer generates text one token at a time. It uses causal attention, which means each token can only attend to earlier tokens, not future ones. That fits text generation naturally: when predicting the next token, the model should only look at what has already been written. OpenAI’s original GPT paper explicitly describes GPT as a decoder-only transformer with masked self-attention.
This is the architecture family most people associate with modern LLMs. GPT models are decoder-only, and many widely used chat and completion models are built around the same autoregressive idea.
Decoder-only models are especially good at: text generation, conversation, code generation and instruction following.
An encoder-decoder transformer combines both pieces.
This setup is ideal for sequence-to-sequence tasks, where you transform one piece of text into another. T5 is a well-known example, and Google describes it as a text-to-text framework that can be applied to translation, summarization, question answering, and classification.
Typical use cases include: translation, summarization, paraphrasing, and structured text transformation
Examples include T5, BART, and mBART. The original Transformer architecture was also introduced in an encoder-decoder form.
Here is a comparison table to help you remember:
For the rest of this course, we will almost exclusively work with decoder-only models, since they power the LLM APIs you use every day. But knowing the encoder and encoder-decoder architectures matters when you work with embeddings or fine-tune models for specific tasks.
When you see model specifications like “128K context window” or “200K context window,” that limit comes directly from how the attention mechanism works.
In self-attention, every token compares itself with every other token in the input. If the input contains n tokens, the model computes roughly n × n attention scores.
That means the amount of computation grows quadratically as the sequence gets longer.
In complexity terms, this is written as O(n²). It is one of the main practical constraints of the standard transformer architecture.
This quadratic scaling affects three things that matter directly when you build applications with LLMs.
Longer inputs require more computation. API pricing usually scales with the number of tokens you send and receive, so a 100K-token prompt costs far more than a 1K-token prompt.
The attention computation also becomes heavier internally as the sequence grows, which is one reason long-context models are expensive to run.
Longer inputs take more time to process. Before a model can generate the first output token, it must compute attention across the entire prompt.
As a result, time-to-first-token increases significantly with longer inputs.
Attention requires storing intermediate data structures that scale with sequence length. In practice, this means the GPU must hold large attention-related tensors during computation.
For very long sequences, the memory requirements can exceed what is available even on high-end GPUs.
Because this quadratic cost is such a major limitation, researchers and engineers have developed several techniques to make long contexts more practical.
Instead of allowing every token to attend to every other token, the model restricts attention to patterns such as local windows plus a few global tokens. This can reduce the computation from O(n²) to roughly O(n).
FlashAttention does not change the attention algorithm itself. Instead, it reorganizes the computation to be far more memory-efficient on GPUs, allowing larger sequences to be processed on the same hardware.
Each token attends only to a fixed window of nearby tokens. Even though individual layers only see part of the sequence, stacking many layers allows information to propagate across longer contexts.
The attention computation can be split across multiple GPUs so that sequences longer than a single device’s memory limit can still be processed.
If your application processes long documents, you usually cannot send everything to the model at once. Instead, you will rely on techniques such as: chunking documents, summarization pipelines and RAG.
These strategies let you work within context window limits while still giving the model access to the information it needs.
Let’s walk through what actually happens inside a large language model when you send a prompt like: “What is the capital of France?”
This example ties together everything we have discussed so far: tokenization, embeddings, attention, transformer layers, and token generation.
The first step is converting raw text into tokens.
The model cannot process plain text directly. Instead, a tokenizer splits the input into smaller units called tokens and maps each token to an integer ID from the model’s vocabulary.
For example:
Each token is then mapped to a number such as:
The exact tokens depend on the tokenizer used by the model.
Each token ID is converted into a token embedding, which is a high-dimensional vector representing the token in a continuous space.
For large models, these vectors often have thousands of dimensions.
At this stage, the model also incorporates positional information, which allows it to understand the order of tokens in the sequence. Without positional signals, the model would not be able to distinguish between:
After this step, each token is represented by a vector that encodes both meaning and position.
The sequence of vectors is then processed by many stacked transformer blocks.
Each block contains:
During the attention step, every token can gather information from other tokens in the sequence. This allows the model to build contextual representations.
As the sequence moves through many layers, the representations become progressively richer.
You can loosely think of the layers as building understanding in stages:
This division is only approximate. In practice, the model distributes information across layers in complex ways.
After the final transformer layer, the model produces a vector for each token position.
For text generation, the model looks at the last token position and projects that vector into a space whose size equals the vocabulary size.
If the vocabulary contains 50,000 tokens, the model produces a vector of 50,000 scores.
Each score represents how likely that token is to appear next.
These scores are passed through a softmax function, which converts them into probabilities that sum to 1.
For example:
Most tokens receive extremely small probabilities. Only a handful are plausible next tokens.
The model then chooses the next token using a sampling strategy.
Common methods include:
With low temperature and greedy decoding, the model will almost always output “Paris.”
Once the model selects the next token, that token is appended to the input sequence.
So the prompt becomes:
The model then repeats the entire process to generate the next token, such as a period.
This loop continues until:
Unlike the prompt processing step, output tokens must be generated one at a time. Each new token requires another full forward pass through the transformer layers.
This is why longer responses take longer to produce.
It is also why streaming responses exist. Instead of waiting for the entire output to be generated, the system can send each token to the user as soon as it is produced.