AlgoMaster Logo

What are Large Language Models (LLMs)?

Last Updated: March 8, 2026

Ashish

Ashish Pratap Singh

Large Language Models (LLMs) are the engines behind modern AI systems like chatbots, coding assistants, and AI-powered search. They are trained on massive amounts of text data and learn patterns in language, allowing them to generate human-like responses, answer questions, write code, summarize documents, and much more.

At their core, LLMs are neural networks that predict the next word in a sequence. By repeatedly making these predictions, they can produce coherent paragraphs, follow instructions, and even reason through complex problems.

In this chapter, we will explore what large language models are, how they work, and why they have become such a powerful tool for building modern AI applications.

What is a Large Language Model?

A Large Language Model is a neural network trained to predict the next word (or more precisely, the next token) in a sequence.

That is it. At its core, an LLM is a next-token prediction machine.

You give it: "The cat sat on the"

It predicts: "mat" (or "roof", "floor", "couch", etc.)

This is the generation loop. The model tokenizes input, produces a probability distribution over its vocabulary, selects a token, appends it to the input, and repeats.

On the surface, this objective sounds simple. But when you train a model like this on trillions of tokens from books, websites, code, and conversations, something remarkable happens. The model starts to learn much more than just word sequences. It picks up patterns in language, facts about the world, problem-solving behavior, and the ability to follow instructions.

That is what makes LLMs so fascinating. A seemingly simple task, predicting the next token, when scaled with enough data, compute, and parameters, leads to surprisingly powerful behavior.

Why "Large"?

The word large in Large Language Model does not refer to just one thing. It usually reflects three dimensions that scale together: model size, training data, and compute.

A model with a huge parameter count but too little data will underperform. A smaller model trained on enormous amounts of data will eventually hit a ceiling. And none of this is possible without massive compute infrastructure. In practice, modern LLMs become powerful because all three dimensions are pushed upward together.

1. Model Size (Parameters)

Parameters are the learned numerical weights inside the neural network. They store patterns the model picks up during training, including patterns about language, code, facts, and behavior.

In general, more parameters give a model more capacity to represent complex relationships, though parameter count alone does not determine quality. Data quality, training recipe, architecture, and post-training matter a lot too.

Scroll
ModelYearParameters
GPT-220191.5 billion
GPT-32020175 billion
Llama 3.120248B to 405B
DeepSeek V32024671B (37B active, MoE)

DeepSeek-V3 is a good example of how the field has evolved. Because it uses a Mixture-of-Experts (MoE) design, only a subset of the model is active for each token. That lets researchers scale total parameter count without increasing inference cost in the same way a dense model would.

2. Training Data

LLMs are trained on vast amounts of text collected from sources such as web pages, books, code, reference material, and other public or licensed datasets. The training objective is simple, but the scale is enormous.

Some reference points:

  • GPT-3 was trained on roughly 300 billion tokens
  • Llama 3 was pretrained on more than 15 trillion tokens
  • Llama 3.1 405B was also trained on over 15 trillion tokens
  • DeepSeek-V3 was trained on 14.8 trillion tokens

A token is not exactly the same as a word, but as a rough rule of thumb, one token is often around three-quarters of a word in English text. That means trillions of tokens correspond to an almost unimaginable amount of reading material.

3. Compute

Training frontier LLMs requires extraordinary compute. This usually means large clusters of high-end GPUs running for weeks or months, along with sophisticated networking, storage, checkpointing, and fault-tolerance systems.

For example, Meta says Llama 3.1 405B was trained using 16,384 H100 GPUs. DeepSeek says DeepSeek-V3 required 2.664 million H800 GPU hours for pretraining. These numbers give you a sense of the scale involved, even before you account for research overhead, failed runs, data processing, and post-training.

That is why only a small number of organizations can train frontier models from scratch. It is also why open-weight releases matter so much. When a company releases model weights publicly, it is effectively sharing the result of a training effort that required enormous data, engineering, and compute investment.

How Do LLMs Work?

Let's trace what actually happens when you send a prompt to an LLM. There are four key steps, and each one has its own dedicated lesson later in the course. Here, we will build just enough intuition to see the full picture.

Step 1: Tokenization

Before an LLM can process text, it needs to convert words into numbers. This is called tokenization.

But LLMs do not use whole words. Instead, they use subword tokens, pieces of words that are small enough to be reusable but large enough to be efficient.

For example, the word "understanding" might be split into:

Why subwords instead of words?

  • Handles rare words and typos better
  • Keeps vocabulary size manageable (50,000-100,000 tokens)
  • Works across languages

The most common tokenization algorithm is Byte Pair Encoding (BPE), which iteratively merges frequent character pairs to build a vocabulary.

Step 2: Embeddings

Once tokenized, each token gets converted into an embedding, a dense vector of numbers (typically 768 to 12,288 dimensions) that represents its meaning in a high-dimensional space.

Similar words end up close together in this space. "King" and "queen" are neighbors. "King" and "banana" are far apart.

But here is the key insight: these embeddings are not static. They get updated as the model processes more context. The word "bank" has a different representation in "river bank" versus "bank account."

Step 3: The Transformer Architecture

The core of every modern LLM is the Transformer architecture, introduced in the 2017 paper "Attention is All You Need."

Before Transformers, language models used RNNs (Recurrent Neural Networks) that processed text sequentially, one word at a time. By the time the model reached word 100, information from word 1 had degraded through 99 intermediate steps. And because each step depended on the previous one, you could not parallelize the computation.

Transformers solved both problems with a mechanism called self-attention. Instead of processing words one at a time, every token gets to look at every other token directly, all at once. The model learns which tokens are relevant to each other. When processing the sentence "The animal did not cross the street because it was too tired," self-attention lets the model figure out that "it" refers to "the animal," not "the street."

Each transformer layer refines the model's understanding. Early layers tend to capture basic syntax. Middle layers capture semantic relationships. Later layers handle complex reasoning and task-specific patterns. GPT-3 has 96 of these layers stacked on top of each other, with 96 attention heads per layer, meaning over 9,000 attention operations run every time your prompt is processed.

Step 4: Generating Output

Once the input passes through all transformer layers, the model produces a probability distribution over its entire vocabulary for the next token. For a vocabulary of 100,000 tokens, that means 100,000 probabilities, one for each possible next token.

The model selects a token using a sampling strategy, appends it to the input, and repeats the entire process. This is called autoregressive generation: generate one token, add it to the context, generate the next, repeat.

This is why LLM responses appear word by word when streaming is enabled. The model is literally producing one token at a time. It is also why longer responses cost more, each token requires a full forward pass through the model.

Training an LLM

Building an LLM is not a single training run. It is a pipeline with distinct phases, each serving a different purpose.

Phase 1: Pretraining

This is where the model learns language. It processes trillions of tokens and learns to predict the next token. The process is simple in concept: take a sequence like "The cat sat on the", predict the next token, compare against the actual next token, adjust parameters, repeat billions of times.

After pre-training, you have a base model. It can generate fluent, coherent text, but it is not helpful. It just continues text in the style of the internet. Ask it "What is 2+2?" and it might respond with another quiz question, because that is what web pages with quiz questions look like.

Phase 2: Supervised Fine-Tuning (SFT)

To teach the model the instruction-following format, humans create thousands of example conversations: "User asks X, assistant responds with Y." The model trains on these examples and learns to respond like a helpful assistant instead of autocompleting web text.

Phase 3: Alignment (RLHF / DPO)

The final phase teaches the model human preferences. Generate multiple responses to the same prompt, have humans rank which is better, then train the model to produce responses that score higher. This is how models learn to be helpful, harmless, and honest rather than just predicting statistically likely text.

The difference between a base model and an aligned model is dramatic. Base models are powerful but unpredictable. Aligned models are the polished assistants you interact with through APIs.

Further Reading