{"title":"What are Large Language Models (LLMs)?","description":"","content":"Large Language Models (LLMs) are the language models behind many modern AI systems: chat assistants, coding tools, search assistants, document analysis products, support agents, and workflow automation systems.\n\nAn LLM is a neural network trained to model sequences of tokens. Given some context, it estimates which token should come next. Repeating that step lets the model write paragraphs, answer questions, translate language, generate code, summarize documents, and work inside tool-using systems.\n\nMany AI products are now multimodal: they accept text, images, audio, video, files, or tool outputs. The language model is often still the coordination layer. It receives text tokens and other model-readable representations, then produces text, structured output, or tool calls.\n\nThat description is useful, but incomplete. Production LLM systems are not just \"next-word predictors.\" They combine a pretrained model, post-training, safety policies, tool interfaces, retrieval systems, decoding settings, and application logic. This chapter focuses on the model itself: what it is, how it works, and what an AI engineer should understand before building on top of it.\n\n---\n\n# What is a Large Language Model?\n\nA Large Language Model is a neural network trained on large collections of text and code to predict the next token in a sequence.\n\nThe word **token** matters. LLMs do not usually operate on whole words. They operate on pieces of text: words, word fragments, punctuation, whitespace, code symbols, or bytes, depending on the tokenizer.\n\nYou provide:\n\n\n```javascript\nThe cat sat on the\n```\n\n\nThe model may assign high probability to:\n\n\n```javascript\nmat\n```\n\n\nIt may also assign lower probability to:\n\n\n```javascript\nroof\nfloor\ncouch\n```\n\n\n\n\n\n\n\n```mermaid\nflowchart LR\n A[\"Input Text\"]:::primary --> B[\"Tokenize\"]:::orange\n B --> C[\"Model
(Neural Network)\"]:::rose\n C --> D[\"Probability
Distribution\"]:::teal\n D --> E[\"Select
Next Token\"]:::green\n E --> F[\"Append to
Input\"]:::orange\n F -->|\"Repeat\"| B\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n classDef rose fill:#f783ac,stroke:#000,color:#000\n classDef teal fill:#38d9a9,stroke:#000,color:#000\n classDef green fill:#69db7c,stroke:#000,color:#000\n```\n\n\nThis is the basic generation loop. The system tokenizes the input, the model computes a probability distribution over its vocabulary, a decoding step selects a token, and that token is appended to the context. Then the loop repeats.\n\nThe training objective is simple. The behavior that emerges at scale is not. A model trained on trillions of tokens from books, websites, code, technical documentation, conversations, and other curated sources learns much more than local word order. It learns grammar, style, factual associations, code patterns, mathematical notation, common task formats, and many procedures that appear in text.\n\nThis does not mean the model \"knows\" in the same way a database knows, or \"reasons\" in the same way a human expert reasons. It means the model has learned internal representations that can support useful reasoning-like behavior. For an engineer, that distinction matters. LLMs can be capable and still be wrong, inconsistent, overconfident, or sensitive to phrasing.\n\n---\n\n# Why \"Large\"?\n\nThe word **large** in Large Language Model does not refer to one thing. It usually reflects three dimensions that scale together: **model size, training data, and compute**.\n\n\n```mermaid\nflowchart TD\n P[\"Parameters
(Billions)\"]:::primary --> C[\"LLM\"]:::green\n D[\"Training Data
(Trillions of tokens)\"]:::orange --> C\n G[\"Compute
(Thousands of GPUs)\"]:::rose --> C\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n classDef rose fill:#f783ac,stroke:#000,color:#000\n classDef green fill:#69db7c,stroke:#000,color:#000\n```\n\n\nA model with many parameters but poor data will underperform. A small model trained on excellent data can be useful, but it eventually runs into capacity limits. A strong training recipe still needs enough compute to run at scale. Modern LLMs improved because researchers pushed all three dimensions together, then added better data filtering, architectures, optimization methods, and post-training.\n\n### **1. Model Size (Parameters)**\n\nParameters are the learned numerical weights inside the neural network. They are not a lookup table of facts. They are distributed weights that encode statistical structure learned during training: language patterns, code patterns, factual associations, problem-solving procedures, formatting conventions, and behavioral tendencies.\n\nMore parameters usually give a model more capacity, but parameter count alone is a poor measure of quality. Data quality, architecture, context length, tokenizer, optimization, post-training, inference stack, and evaluation target all matter. A smaller model tuned for coding, retrieval, or tool use can beat a larger general model on that specific workload.\n\nHere are examples from public model releases and reports. Treat them as reference points, not as a permanent leaderboard:\n\n\n| Model | Year | Parameters |\n|-------|------|-----------|\n| GPT-2 | 2019 | 1.5 billion |\n| GPT-3 | 2020 | 175 billion |\n| Llama 3.1 | 2024 | 8B to 405B |\n| DeepSeek-V3 | 2024 | 671B total (37B active, MoE) |\n| Llama 4 Scout / Maverick | 2025 | 109B / 400B total (17B active, MoE) |\n| Kimi K2 | 2025 | 1T total (32B active, MoE) |\n\n\nThese newer models show why model size has become harder to summarize with one number. Several use a **Mixture-of-Experts (MoE)** architecture, where the full model has many parameters but only a fraction are active for any given token. DeepSeek-V3 has 671B parameters with about 37B active. Llama 4 Maverick has 400B total with 17B active. Kimi K2 has 1 trillion total parameters with 32B active. Sparse MoE can increase total capacity without paying the full inference cost of a dense model of the same total size, which is why many recent large open-weight models use it.\n\nOne caveat when comparing these numbers: parameter counts are public mainly for open-weight models. Frontier closed models such as GPT-family, Claude, and Gemini models generally do not publish parameter counts, so public comparisons skew toward the open-weight models that disclose them.\n\n### **2. Training Data**\n\nLLMs are trained on large mixtures of text and code: web pages, books, papers, documentation, code repositories, reference material, math data, conversation data, and licensed or synthetic datasets. Raw scale helps, but quality matters more than people expect. Deduplication, filtering, data balancing, contamination control, and domain coverage are central parts of modern model training.\n\nSome reference points:\n\n- GPT-3 was trained on roughly 300 billion tokens\n- Llama 3.1 models were pretrained on roughly 15 trillion tokens\n- DeepSeek-V3 was trained on 14.8 trillion tokens\n- Kimi K2 was pretrained on roughly 15.5 trillion tokens\n\nMulti-trillion-token pretraining is now common for large frontier and open-weight models.\n\nA token is not the same as a word. In English prose, one token is often around three-quarters of a word, but the ratio changes across languages, code, punctuation-heavy text, and tokenizer design.\n\n### **3. Compute**\n\nTraining frontier LLMs requires extraordinary compute. This usually means large clusters of high-end GPUs running for weeks or months, along with sophisticated networking, storage, checkpointing, and fault-tolerance systems.\n\nFor example, Meta says **Llama 3.1 405B** was trained on more than **16,000 H100 GPUs**. DeepSeek reports that **DeepSeek-V3** required **2.664 million H800 GPU hours** for pretraining and **2.788 million H800 GPU hours** for the full training run. These numbers describe the reported successful training effort. Real programs also pay for data pipelines, ablations, failed runs, evaluation, safety work, post-training, and serving infrastructure.\n\nThat is why only a small number of organizations train frontier-scale models from scratch. Most AI engineering teams do something different: they choose a hosted model, adapt an open-weight model, build retrieval and tool systems around it, or fine-tune a smaller model for a narrow workload.\n\n---\n\n# How Do LLMs Work?\n\nLet's trace what happens when you send a prompt to an LLM. Each step has its own lesson later in the course. Here, the goal is to build a clear mental model of the process.\n\n### Step 1: Tokenization\n\nBefore a model can process text, the text must be converted into token IDs. This is called tokenization.\n\nMost LLM tokenizers use subword units: pieces of text small enough to handle rare words, names, typos, and code, but large enough to keep sequences reasonably compact.\n\nFor example, the word \"understanding\" might be split into:\n\n\n```mermaid\nflowchart TB\n A[\"'understanding'\"]:::primary --> B[\"'under'\"]:::orange\n A --> C[\"'stand'\"]:::orange\n A --> D[\"'ing'\"]:::orange\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n```\n\n\nWhy use subwords instead of full words?\n\n- They handle rare words, names, and typos without needing a vocabulary entry for every possible word\n- They keep vocabulary size manageable, often in the tens or hundreds of thousands of tokens\n- They work better across code, punctuation, multilingual text, and mixed-format documents\n\nCommon tokenizer families include Byte Pair Encoding (BPE), WordPiece, and SentencePiece-style unigram models. The exact tokenizer matters in production because it affects context length, cost, latency, truncation, and how well the model handles specific languages or codebases.\n\n### Step 2: Embeddings\n\nOnce tokenized, each token ID is mapped to an embedding: a dense vector of numbers. This is the model's starting representation for that token.\n\n\n```mermaid\nflowchart TB\n subgraph Embedding Space\n K[\"king\"]:::primary\n Q[\"queen\"]:::primary\n P[\"prince\"]:::primary\n B[\"banana\"]:::orange\n A[\"apple\"]:::orange\n M[\"mango\"]:::orange\n end\n\n K -.->|\"close\"| Q\n Q -.->|\"close\"| P\n B -.->|\"close\"| A\n A -.->|\"close\"| M\n K -.->|\"far\"| B\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n```\n\n\nAt a high level, related tokens tend to have related representations. Tokens for \"king\" and \"queen\" are likely to be closer than tokens for \"king\" and \"banana.\"\n\nBut the important representation is not just the initial embedding. As the token passes through transformer layers, it becomes contextual. The token \"bank\" should end up with a different internal representation in \"river bank\" than in \"bank account.\" That contextualization is where much of the model's capability comes from.\n\nThe model also needs position information. A bag of tokens is not enough; \"dog bites man\" and \"man bites dog\" contain the same words but mean different things. Modern LLMs use positional encoding methods, often variants of rotary position embeddings, so attention can account for token order.\n\n### Step 3: The Transformer Architecture\n\nMost modern LLMs are built on the Transformer architecture, introduced in the 2017 paper \"Attention Is All You Need.\" The dominant text-generation design is a **decoder-only transformer**, used by GPT-style models, Llama, Mistral, and many others.\n\nBefore transformers, many language models used RNNs and LSTMs that processed text sequentially. Long-range information had to pass through many intermediate states, and training was harder to parallelize.\n\nTransformers changed this with **self-attention**. Each token builds its representation by attending to relevant tokens in the context. In a decoder-only LLM, attention is usually **causal**: a token can attend to earlier tokens, but not to future tokens. During training, this masking lets the model learn next-token prediction efficiently. During generation, the model produces one token at a time.\n\nFor example:\n\n\n```javascript\nThe animal did not cross the street because it was too tired.\n```\n\n\nSelf-attention helps the model connect \"it\" to \"the animal\" rather than \"the street.\" It does not do this with a hand-coded grammar rule. It learns attention patterns and internal features from data.\n\n\n```mermaid\nflowchart LR\n A[\"Input
Tokens\"]:::primary --> B[\"Token
Embeddings\"]:::orange\n B --> C[\"Transformer
Layer 1\"]:::teal\n C --> D[\"Transformer
Layer 2\"]:::teal\n D --> E[\"...
(N Layers)\"]:::teal\n E --> F[\"Output
Probabilities\"]:::green\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n classDef teal fill:#38d9a9,stroke:#000,color:#000\n classDef green fill:#69db7c,stroke:#000,color:#000\n```\n\n\nEach transformer layer refines the hidden representation. Attention moves information across positions. Feed-forward blocks transform that information. Normalization and residual connections keep training stable. Stacking many layers gives the model enough depth to build more abstract representations.\n\nA common simplification is \"early layers learn syntax and later layers learn reasoning.\" That can be a useful intuition, but real models are messier. Features are distributed across layers and heads, and the same component can participate in different behaviors depending on context.\n\n### Step 4: Generating Output\n\nOnce the input passes through all transformer layers, the model produces scores over its vocabulary for the next token. Those scores are converted into probabilities. For a vocabulary of 100,000 tokens, that means 100,000 possible next-token choices.\n\nThe decoder selects a token using a decoding strategy, appends it to the context, and continues. This is called **autoregressive generation**: generate one token, add it to the context, generate the next token, repeat.\n\nThis is why streamed responses appear piece by piece. The model is producing tokens sequentially.\n\nLonger responses cost more because each output token requires another forward step through the model. In production inference, servers use a **KV cache** so the model does not recompute all previous attention keys and values from scratch on every token. Even with caching, output length, context length, model size, batching, and hardware all affect latency and cost.\n\n---\n\n# Training an LLM\n\nBuilding an LLM is not one monolithic training run. It is a pipeline. Different phases shape different behavior.\n\n\n```mermaid\nflowchart LR\n A[\"Raw Text
(Web, Books, Code)\"]:::primary --> B[\"Pre-training
Predict Next Token\"]:::orange\n B --> C[\"Base Model\"]:::primary\n C --> D[\"SFT
Learn Instruction
Format\"]:::teal\n D --> E[\"Instruct Model\"]:::primary\n E --> F[\"Preference Optimization
(RLHF, DPO, etc.)\"]:::green\n F --> G[\"Post-trained Model
(Assistant, Tool Use,
Safety Policies)\"]:::rose\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n classDef teal fill:#38d9a9,stroke:#000,color:#000\n classDef green fill:#69db7c,stroke:#000,color:#000\n classDef rose fill:#f783ac,stroke:#000,color:#000\n```\n\n\n### Phase 1: Pretraining\n\nPretraining is where the model learns the broad structure of language, code, factual associations, formats, and procedures. The loop is conceptually simple: take a sequence, predict the next token, compare the prediction with the actual next token, adjust the weights, and repeat at enormous scale.\n\nAfter pretraining, you have a **base model**. A base model can generate fluent text and solve many tasks, but it is not necessarily a good assistant. It tends to continue text rather than follow a chat protocol. If you prompt it with a question, it may answer, continue with another question, imitate a web page, or produce several possible completions.\n\n### Phase 2: Supervised Fine-Tuning (SFT)\n\nSupervised fine-tuning teaches the model the desired interaction format. Training examples show instructions, conversations, tool calls, refusals, code edits, summaries, or domain-specific outputs. The model learns what a good response should look like for each type of request.\n\nEarly instruction-tuning used many human-written examples. Modern pipelines often mix human data, synthetic data, model-generated data, filtered transcripts, tool-use traces, and domain-specific examples.\n\n### Phase 3: Preference Optimization\n\nPreference optimization pushes the model toward responses people or automated evaluators prefer. A common pattern is to generate multiple responses to the same prompt, rank them, and train the model to prefer the better responses. RLHF, DPO, and related methods are different ways to apply this idea.\n\nThis stage can improve helpfulness, reduce unsafe behavior, improve instruction following, and make the model easier to use through an API. It does not make the model truthful by itself. Post-trained models can still hallucinate, ignore constraints, leak hidden assumptions, or fail on edge cases.\n\n### Phase 4: Product-Specific Adaptation\n\nMany production models go through additional adaptation before users see them:\n\n- Tool-use training so the model can call functions, browse documents, execute code, or operate software\n- Safety and policy training for refusals, sensitive content, and abuse prevention\n- Long-context training or tuning so the model behaves well with large prompts\n- Distillation so a smaller model can imitate a stronger model for lower cost or lower latency\n- Domain fine-tuning for legal, medical, support, coding, or enterprise workflows\n\nThis is why two models with similar parameter counts can behave very differently. Post-training and product design often matter as much as the pretrained base model.\n\n---\n\n# What LLMs Are Good At\n\nLLMs are strongest when the task can be expressed as language, code, or structured text transformation. They are useful for:\n\n- Drafting, rewriting, summarizing, and explaining text\n- Translating between natural language, code, schemas, and structured formats\n- Classifying and extracting information from documents\n- Writing, reviewing, and modifying code\n- Answering questions when given the right context\n- Planning tool calls inside a controlled workflow\n- Acting as a natural-language interface over search, databases, APIs, and internal systems\n\nThe best production systems do not treat the model as an oracle. They give it context, constrain the output format, connect it to tools, validate its work, and measure behavior on real tasks.\n\n# What LLMs Are Not\n\nLLMs are not databases. They do not store facts in rows with timestamps and provenance. Their training data is represented indirectly in weights, and those weights do not expose reliable source attribution.\n\nLLMs are not deterministic programs. Even with low temperature, serving infrastructure, model updates, tokenizer differences, and provider-side changes can affect outputs.\n\nLLMs are not a substitute for system design. A model can draft an answer, choose a tool, or transform data, but the surrounding application must handle permissions, retries, observability, evaluation, security, cost control, and user experience.\n\nThis understanding is the foundation for the rest of AI engineering: use the model for what it is good at, and build the system around the places where it is weak.\n\n---\n\n# Further Reading\n\n- [Attention is All You Need](https://arxiv.org/abs/1706.03762) - The original Transformer paper\n- [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/) - Visual explanation by Jay Alammar\n- [GPT-3 Paper](https://arxiv.org/abs/2005.14165) - Language Models are Few-Shot Learners\n- [InstructGPT Paper](https://arxiv.org/abs/2203.02155) - Training models to follow instructions with human feedback\n- [The Llama 3 Herd of Models](https://arxiv.org/abs/2407.21783) - Detailed report on Meta's Llama 3 and Llama 3.1 models\n- [DeepSeek-V3 Technical Report](https://arxiv.org/abs/2412.19437) - Technical report for a modern sparse MoE model\n- [Kimi K2: Open Agentic Intelligence](https://arxiv.org/abs/2507.20534) - Technical report for a trillion-parameter sparse MoE model\n- [Build a Large Language Model (From Scratch)](https://www.manning.com/books/build-a-large-language-model-from-scratch) - Book by Sebastian Raschka\n\n---\n\n# Quiz","pageType":"ai-engineering"}

Get Premium