{"title":"Transformer Architecture (Simplified)","description":"","content":"Most modern language models are built from the same architectural family: the **Transformer**. The original Transformer was introduced in the 2017 paper *\"Attention Is All You Need\"* for machine translation. Since then, decoder-only variants have become the dominant architecture for text generation, chat models, coding assistants, and tool-using LLM systems.\n\nThe core idea is straightforward: represent text as a sequence of token vectors, then repeatedly let each token update its representation by looking at relevant tokens around it.\n\nThe mechanism that makes this work is **self-attention**.\n\nThis chapter explains transformers at the level an AI engineer needs for building systems: enough detail to reason about context windows, latency, model behavior, and architecture choices, without turning the chapter into a linear algebra lecture.\n\n---\n\n# Before Transformers: The Sequence Bottleneck\n\nBefore transformers, many language systems used **recurrent neural networks** (RNNs), including LSTMs and GRUs. These models processed text from left to right, passing a hidden state from one step to the next.\n\n\n```mermaid\nflowchart LR\n W1[\"Token 1\"]:::primary --> H1[\"Hidden
State 1\"]:::orange\n H1 --> H2[\"Hidden
State 2\"]:::orange\n W2[\"Token 2\"]:::primary --> H2\n H2 --> H3[\"Hidden
State 3\"]:::orange\n W3[\"Token 3\"]:::primary --> H3\n H3 --> H4[\"Hidden
State 4\"]:::orange\n W4[\"Token 4\"]:::primary --> H4\n H4 --> OUT[\"Output\"]:::green\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n classDef green fill:#69db7c,stroke:#000,color:#000\n```\n\n\nThis design had two practical limitations.\n\n#### **Problem 1: Long-range information was hard to preserve**\n\nIf useful information appeared near the beginning of a long sequence, it had to pass through many intermediate updates before influencing later tokens. LSTMs helped, but long-range dependencies were still difficult.\n\n#### **Problem 2: Training was hard to parallelize**\n\nBecause each step depended on the previous step, the model could not process all token positions independently during training. That limited how efficiently it could use GPUs.\n\nTransformers changed that tradeoff. Instead of passing information through a single hidden state, self-attention lets each token read from other token positions in the sequence. During training, this makes token positions much easier to process in parallel. During generation, decoder-only models still produce one token at a time, but the prompt processing step is highly parallel.\n\n---\n\n# The Attention Mechanism\n\nAttention is a learned routing mechanism. For each token position, the model estimates which other positions should influence its representation, and by how much.\n\n\n\n\n\nConsider:\n\n\n```javascript\nThe animal did not cross the street because it was too wide.\n```\n\n\nTo interpret \"it\", the model needs context. In this sentence, \"wide\" points more naturally to \"street\" than to \"animal\". Attention gives the model a way to give more weight to the relevant token positions.\n\nInstead of applying a hand-written grammar rule, the model compares learned vectors.\n\n### Queries, Keys, and Values\n\nFor every token representation, the model creates three vectors:\n\n- **Query (Q):** what this position is looking for\n- **Key (K):** what this position offers for matching\n- **Value (V):** what information this position contributes if another position attends to it\n\nEvery token position produces all three. Attention compares queries to keys, turns those comparisons into weights, and uses the weights to mix values.\n\nIn simplified form:\n\n**Input token vectors -> Q, K, V -> compare Q with K -> softmax weights -> weighted sum of V**\n\n\n```mermaid\nflowchart TD\n subgraph Input[\"Input Tokens\"]\n T1[\"The\"]:::primary\n T2[\"animal\"]:::primary\n T3[\"street\"]:::primary\n T4[\"it\"]:::primary\n T5[\"wide\"]:::primary\n end\n\n subgraph QKV[\"Each token produces Q, K, V\"]\n Q[\"Query
What this position
is looking for\"]:::orange\n K[\"Key
What this position
can match\"]:::teal\n V[\"Value
Information this
position contributes\"]:::green\n end\n\n SCORE[\"Compare Q with K
to get attention scores\"]:::purple\n WEIGHT[\"Softmax
scores to weights\"]:::red\n OUT[\"Weighted sum
of Values\"]:::green\n\n T1 --> QKV\n T2 --> QKV\n T3 --> QKV\n T4 --> QKV\n T5 --> QKV\n Q --> SCORE\n K --> SCORE\n SCORE --> WEIGHT\n WEIGHT --> OUT\n V --> OUT\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n classDef teal fill:#38d9a9,stroke:#000,color:#000\n classDef green fill:#69db7c,stroke:#000,color:#000\n classDef purple fill:#f783ac,stroke:#000,color:#000\n classDef red fill:#ff8787,stroke:#000,color:#000\n```\n\n\nFor a decoder-only LLM, attention is usually **causal**. A token can attend to earlier tokens, but not future tokens. This masking is essential for next-token prediction: when the model predicts token 20, it must not use token 21.\n\n\n\n\n\nThat distinction matters. Encoder models such as BERT use bidirectional attention over the full input. Generative LLMs use causal attention so they can generate text from left to right.\n\n---\n\n# Self-Attention Step by Step\n\nReal models use large matrices and thousands of dimensions. The mechanics are easier to see with a small example.\n\n#### **Step 1: Create Q, K, and V**\n\nEach token starts as a vector. The model multiplies that vector by learned matrices to produce a query, key, and value.\n\n\n**main.py**\n\n```python\nimport numpy as np\n\n# Toy token representations. Real models use much larger vectors.\ncat = np.array([1.0, 0.5, 0.2, 0.8])\nsat = np.array([0.3, 0.9, 0.7, 0.1])\n\n# Learned projection matrices.\nW_q = np.random.randn(4, 4) * 0.1\nW_k = np.random.randn(4, 4) * 0.1\nW_v = np.random.randn(4, 4) * 0.1\n\ncat_q = cat @ W_q\ncat_k = cat @ W_k\ncat_v = cat @ W_v\n\nsat_q = sat @ W_q\nsat_k = sat @ W_k\nsat_v = sat @ W_v\n```\n\n\nThe same matrices are applied across token positions. The model learns those matrices during training.\n\n#### **Step 2: Compute attention scores**\n\nThe model compares a query with keys using a dot product.\n\n\n**main.py**\n\n```python\nscore_cat_to_cat = np.dot(cat_q, cat_k)\nscore_cat_to_sat = np.dot(cat_q, sat_k)\n```\n\n\nA larger dot product means the query and key point in more similar directions.\n\n#### **Step 3: Scale, mask, and normalize**\n\nAttention scores are divided by the square root of the key dimension. This keeps the softmax from becoming too sharp as vector dimensions grow.\n\nIn decoder-only models, a causal mask is also applied so a token cannot attend to future positions.\n\n\n**main.py**\n\n```python\nfrom scipy.special import softmax\n\nd_k = 4\nscores = np.array([score_cat_to_cat, score_cat_to_sat]) / np.sqrt(d_k)\n\n# In a decoder-only model, future-token scores would be set to -inf\n# before softmax. This toy example has no future token to mask.\nattention_weights = softmax(scores)\n```\n\n\nSoftmax converts scores into weights that sum to 1.\n\n#### **Step 4: Mix the values**\n\nThe final output for a token is the weighted sum of value vectors.\n\n\n**main.py**\n\n```python\ncat_output = attention_weights[0] * cat_v + attention_weights[1] * sat_v\n```\n\n\nThis gives the token position a new context-aware representation.\n\nThe standard attention equation is:\n\n\n```javascript\nAttention(Q, K, V) = softmax(QK^T / sqrt(d_k))V\n```\n\n\nFor decoder-only models, add one more idea: a mask is applied before softmax so future positions receive zero attention weight.\n\n---\n\n# Multi-Head Attention\n\nA single attention operation creates one pattern of information flow. Language usually needs many patterns at once: syntactic dependencies, quotation boundaries, variable references in code, pronouns, list structure, and positional relationships.\n\n**Multi-head attention** runs several attention operations in parallel. Each head has its own learned projections for Q, K, and V. The outputs are concatenated and projected back to the model dimension.\n\n\n\n\n\n\n```mermaid\nflowchart TD\n INPUT[\"Input Representations\"]:::primary\n\n subgraph Heads[\"Attention Heads\"]\n H1[\"Head 1\"]:::orange\n H2[\"Head 2\"]:::teal\n H3[\"Head 3\"]:::green\n H4[\"Head 4\"]:::purple\n end\n\n CONCAT[\"Concatenate\"]:::red\n LINEAR[\"Output Projection\"]:::primary\n OUTPUT[\"Updated Representations\"]:::green\n\n INPUT --> H1\n INPUT --> H2\n INPUT --> H3\n INPUT --> H4\n H1 --> CONCAT\n H2 --> CONCAT\n H3 --> CONCAT\n H4 --> CONCAT\n CONCAT --> LINEAR\n LINEAR --> OUTPUT\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n classDef teal fill:#38d9a9,stroke:#000,color:#000\n classDef green fill:#69db7c,stroke:#000,color:#000\n classDef purple fill:#f783ac,stroke:#000,color:#000\n classDef red fill:#ff8787,stroke:#000,color:#000\n```\n\n\nIt is tempting to say \"this head handles syntax\" and \"that head handles coreference.\" Sometimes researchers do find interpretable heads, but the roles are not hard-coded and not always clean. Treat head specialization as a useful intuition, not a guarantee.\n\nThe engineering point is more direct: multi-head attention gives the model several learned ways to route information between token positions in the same layer.\n\n---\n\n# The Transformer Block\n\nA transformer layer contains more than attention. A typical decoder-only block contains:\n\n\n\n\n\n1. **Self-attention:** token positions exchange information.\n2. **Residual connection:** the block adds its input back to its output, which helps preserve information and improve gradient flow.\n3. **Normalization:** layer normalization or RMS normalization keeps activations stable.\n4. **Feed-forward network:** each token representation is transformed independently.\n\nModern LLMs often use **pre-norm** blocks, where normalization happens before attention and before the feed-forward network. Many diagrams show the older post-norm layout because it is easier to draw. The exact ordering varies by architecture, but the main parts are the same.\n\n\n```mermaid\nflowchart TD\n INPUT[\"Input
Representations\"]:::primary\n NORM1[\"Norm\"]:::teal\n ATTN[\"Causal Multi-Head
Attention\"]:::orange\n ADD1[\"Residual Add\"]:::teal\n NORM2[\"Norm\"]:::teal\n FFN[\"Feed-Forward
Network\"]:::purple\n ADD2[\"Residual Add\"]:::teal\n OUTPUT[\"Output to
Next Layer\"]:::green\n\n INPUT --> NORM1\n NORM1 --> ATTN\n ATTN --> ADD1\n INPUT -.->|Residual| ADD1\n ADD1 --> NORM2\n NORM2 --> FFN\n FFN --> ADD2\n ADD1 -.->|Residual| ADD2\n ADD2 --> OUTPUT\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n classDef teal fill:#38d9a9,stroke:#000,color:#000\n classDef purple fill:#f783ac,stroke:#000,color:#000\n classDef green fill:#69db7c,stroke:#000,color:#000\n```\n\n\nThe feed-forward network is not a side detail. In many transformer models, it accounts for a large share of the parameters and computation. Attention moves information between positions. The feed-forward network transforms each position's representation after that information has been gathered.\n\nStacking many blocks gives the model depth. The model does not build knowledge in neat stages like \"layer 1 is syntax, layer 20 is facts, layer 40 is reasoning.\" Some patterns are more common in earlier or later layers, but real features are distributed across layers, heads, and feed-forward channels.\n\n---\n\n# Positional Information\n\nSelf-attention compares tokens, but by itself it does not know where tokens appear. Without position information, these two sequences would contain the same token set:\n\n\n\n\n\n\n```javascript\nthe dog bit the man\nthe man bit the dog\n```\n\n\nThe order changes the meaning. Transformers need a way to represent position.\n\n### How position is represented\n\nThe model combines token identity with position information before or during attention. Different architectures do this in different ways:\n\n- **Sinusoidal position encodings:** fixed sine and cosine functions used in the original Transformer.\n- **Learned absolute position embeddings:** learned vectors for each position, used by models such as GPT-2.\n- **Rotary positional embeddings (RoPE):** position is applied by rotating query and key vectors, used by Llama and many modern decoder-only models.\n- **Relative position methods:** attention depends on distance between tokens rather than only absolute index.\n\n\n```mermaid\nflowchart TB\n TOKEN[\"Token ID\"]:::primary --> EMB[\"Token
Embedding\"]:::orange\n POS[\"Position
Information\"]:::teal --> COMBINE[\"Combine\"]:::purple\n EMB --> COMBINE\n COMBINE --> MODEL[\"Transformer
Blocks\"]:::green\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n classDef teal fill:#38d9a9,stroke:#000,color:#000\n classDef purple fill:#f783ac,stroke:#000,color:#000\n classDef green fill:#69db7c,stroke:#000,color:#000\n```\n\n\nPosition handling connects directly to context length. A model trained and tuned for a certain context window may behave poorly when pushed far beyond that range. Long-context extensions, RoPE scaling, retrieval, summarization, and attention optimizations all exist because long sequences remain difficult in practice.\n\n---\n\n# Encoder, Decoder, and Encoder-Decoder Models\n\nTransformers come in three common forms. They share the same building blocks but use different attention patterns.\n\n### Encoder-only models\n\nEncoder-only models read the full input with bidirectional attention. Each token can attend to tokens on both the left and the right.\n\nThis is useful when the goal is to understand or represent an input:\n\n- classification\n- named entity recognition\n- reranking\n- embedding generation\n- semantic similarity\n\nExamples include **BERT**, **RoBERTa**, and **DeBERTa**.\n\n### Decoder-only models\n\nDecoder-only models use causal attention and generate text one token at a time. Each token can attend only to previous tokens.\n\nThis is the architecture family most associated with modern LLM APIs:\n\n- chat\n- completion\n- code generation\n- tool calling\n- instruction following\n\nExamples include GPT-style models, Llama, Mistral, Qwen, and many other open-weight chat models. Some proprietary systems do not publish full architecture details, but the dominant production pattern is decoder-only or decoder-style autoregressive generation.\n\n### Encoder-decoder models\n\nEncoder-decoder models use one transformer stack to read the input and another to generate the output. The decoder attends to its own previous outputs and to the encoder's representation of the input.\n\nThis architecture is useful for sequence-to-sequence tasks:\n\n- translation\n- summarization\n- paraphrasing\n- text-to-text transformation\n\nExamples include **T5**, **BART**, and **mBART**. The original Transformer paper used an encoder-decoder architecture.\n\n\n| Architecture | Attention Pattern | Good Fit | Example Models |\n|---|---|---|---|\n| Encoder-only | Bidirectional | Understanding, embeddings, classification | BERT, RoBERTa |\n| Decoder-only | Causal | Generation, chat, code, tool use | GPT-style, Llama, Mistral |\n| Encoder-decoder | Bidirectional encoder + causal decoder | Translation, summarization, transformation | T5, BART |\n\n\nFor most of this course, decoder-only models are the default because they power the LLM APIs and open-weight chat models used in AI applications. Encoder models still matter for embeddings, reranking, and classification.\n\n---\n\n# Why Context Windows Have Limits\n\nStandard self-attention compares token positions with other token positions. For a sequence of `n` tokens, full attention computes roughly `n x n` attention scores.\n\n\n| Input Tokens | Attention Scores | Relative Scale |\n|---|---|---|\n| 1,000 | 1,000,000 | 1x |\n| 4,000 | 16,000,000 | 16x |\n| 16,000 | 256,000,000 | 256x |\n| 128,000 | 16,384,000,000 | 16,384x |\n\n\nThis **O(n^2)** scaling is one of the central constraints of the standard transformer.\n\nThe practical consequences show up in three places.\n\n#### **1. Prefill latency**\n\nBefore the model generates the first output token, it has to process the prompt. Long prompts increase time-to-first-token.\n\n#### **2. Memory**\n\nAttention and the KV cache consume memory. During generation, the serving system stores keys and values for previous tokens so it does not recompute them from scratch. Long contexts increase KV cache size.\n\n#### **3. Cost**\n\nLonger prompts require more computation and more memory bandwidth. Even when an API prices by token count, the serving system is paying a real compute cost underneath.\n\n### How modern systems handle long contexts\n\nEngineers use several techniques to make long contexts practical:\n\n- **FlashAttention:** computes exact attention more efficiently by reducing memory traffic on GPUs.\n- **Sliding window attention:** lets each token attend to a fixed local window instead of all previous tokens.\n- **Sparse attention:** restricts attention to selected local and global patterns.\n- **Grouped-query or multi-query attention:** reduces KV cache size by sharing keys and values across query heads.\n- **Distributed attention:** splits attention across devices for very long sequences.\n- **RAG and summarization:** avoid sending everything by retrieving or compressing the relevant parts.\n\nLong context is useful, but it is not a substitute for good context engineering. A 200K-token prompt full of irrelevant material can be slower, more expensive, and less reliable than a 10K-token prompt with the right evidence.\n\n---\n\n# End-to-End Flow\n\nHere is what happens when a decoder-only LLM receives a prompt such as:\n\n\n```javascript\nWhat is the capital of France?\n```\n\n\n\n```mermaid\nflowchart TD\n INPUT[\"Prompt
What is the capital of France?\"]:::primary\n TOK[\"Tokenizer
Token IDs\"]:::orange\n EMB[\"Embeddings
+ Position Info\"]:::teal\n\n subgraph Layers[\"Stacked Transformer Blocks\"]\n L1[\"Causal Attention\"]:::purple\n L2[\"Feed-Forward\"]:::purple\n L3[\"Residuals + Norm\"]:::purple\n L4[\"Repeated Many Times\"]:::purple\n end\n\n LAST[\"Final Hidden State
at Last Position\"]:::teal\n LOGITS[\"Vocabulary Scores
(Logits)\"]:::orange\n PROB[\"Softmax / Sampling
Paris, London, ...\"]:::green\n OUT[\"Next Token\"]:::red\n\n INPUT --> TOK\n TOK --> EMB\n EMB --> L1\n L1 --> L2\n L2 --> L3\n L3 --> L4\n L4 --> LAST\n LAST --> LOGITS\n LOGITS --> PROB\n PROB --> OUT\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n classDef teal fill:#38d9a9,stroke:#000,color:#000\n classDef purple fill:#f783ac,stroke:#000,color:#000\n classDef green fill:#69db7c,stroke:#000,color:#000\n classDef red fill:#ff8787,stroke:#000,color:#000\n```\n\n\n#### Step 1: Tokenization\n\nThe tokenizer maps text to token IDs. The exact token boundaries depend on the model's tokenizer.\n\n\n```javascript\n\"What is the capital of France?\"\n-> token IDs\n```\n\n\n#### Step 2: Embeddings and position information\n\nEach token ID is mapped to a vector. The model also incorporates position information so it can distinguish order-sensitive sequences.\n\n#### Step 3: Transformer blocks\n\nThe token vectors pass through many transformer blocks. In each block, causal self-attention moves information across positions, and the feed-forward network transforms each position independently.\n\n#### Step 4: Vocabulary projection\n\nAfter the final layer, the model uses the hidden state at the last position to produce scores over the vocabulary. These scores are called **logits**.\n\n#### Step 5: Decoding\n\nThe logits are converted into a next-token choice using a decoding strategy such as greedy decoding, temperature sampling, top-p sampling, or provider-specific constrained decoding.\n\n#### Step 6: Repeat\n\nThe selected token is appended to the context. The model repeats the process until it emits a stop token, reaches a length limit, or the application stops generation.\n\nThis is why streaming works: the server can send tokens as they are generated. It is also why long outputs take time: generation is sequential, even though prompt processing is parallelized.\n\n---\n\n# Quiz\n\n---\n\n### References\n\n- [Attention Is All You Need (Original Transformer Paper)](https://arxiv.org/abs/1706.03762)\n- [The Illustrated Transformer (Jay Alammar)](https://jalammar.github.io/illustrated-transformer/)\n- [The Illustrated GPT-2 (Jay Alammar)](https://jalammar.github.io/illustrated-gpt2/)\n- [Formal Algorithms for Transformers (DeepMind)](https://arxiv.org/abs/2207.09238)\n- [FlashAttention: Fast and Memory-Efficient Exact Attention](https://arxiv.org/abs/2205.14135)","pageType":"ai-engineering"}

Get Premium