Last Updated: March 14, 2026
Before a language model can understand or generate text, it must first convert that text into a form it can process. Neural networks do not work directly with characters or words. They operate on numbers. Tokenization is the step that bridges this gap.
Tokenization is the process of breaking raw text into smaller units called tokens and mapping those tokens to numeric IDs that the model can understand. A token might be a word, part of a word, a punctuation mark, or even a single character, depending on the tokenizer.
For example, the sentence: "LLMs are transforming software development" might be broken into tokens like:
["LL", "Ms", " are", " transforming", " software", " development", "."]
Each token is then converted into a number and fed into the model.
In this chapter, we will explore how tokenization works, why modern models use subword tokens instead of full words, and how tokenization affects context length, cost, and model performance.
The first question worth asking is: why do LLMs need tokens at all? Why not just work with words directly?
There are three fundamental problems with using words as the basic unit.
English has roughly 170,000 words in current use. Add technical jargon, proper nouns, misspellings, slang, and other languages, and you are looking at millions of unique strings. Every word needs its own entry in the model's vocabulary, and every entry comes with parameters the model has to learn. A vocabulary of millions would make training impossibly expensive.
No matter how large your vocabulary is, users will always type something the model has never seen. A new product name, a typo, a word from a language that was underrepresented in training data. A word-level model has no way to handle these. It can only shrug and output an "unknown" placeholder.
Words like "run", "running", "runner", and "runs" are clearly related, but a word-level model treats them as completely independent entries. It has to learn the meaning of each one separately, wasting capacity on patterns that are obvious to humans.
Tokens solve all three problems. Instead of splitting text at word boundaries, tokenization algorithms find subword units, chunks that are smaller than words but larger than individual characters. Common words like "the" stay as single tokens. Rare words get split into pieces. The word "unhappiness" might become ["un", "happiness"], or ["un", "happi", "ness"], depending on the algorithm.
This gives you the best of both worlds: a manageable vocabulary size (typically 32,000 to 100,000 entries) that can still represent any possible input, including words it has never seen before.
Character-level tokenization can handle any input, but it produces very long sequences (each character is a token), which makes the model slow and makes it harder to learn word-level meaning. Word-level tokenization is compact but brittle. Subword tokenization hits the sweet spot.
Byte Pair Encoding is the most widely used tokenization algorithm in modern LLMs. GPT-2, GPT-3, GPT-4, LLaMA, and Mistral all use variants of BPE. The algorithm is surprisingly simple, and understanding it gives you strong intuition for how tokenization behaves in practice.
BPE starts with the smallest possible vocabulary: individual bytes (or characters). Then it repeatedly finds the most frequent pair of adjacent tokens in the training data and merges them into a single new token. It keeps doing this until the vocabulary reaches a target size.
That is the entire algorithm. Let's walk through it step by step.
Imagine we have a tiny training corpus with just these words (with frequencies):
We split every word into individual characters and add a special end-of-word marker (let's use _). Our initial vocabulary is every unique character in the corpus.
We look at every pair of neighboring tokens across all words, weighted by frequency:
The pairs (e, s), (s, t), and (t, _) are all tied at 9. We pick one, say (e, s), and merge it into a new token es.
Now we count pairs again with the updated tokens. The pair (es, t) appears 9 times. Merge it into est.
Next, (est, ) appears 9 times. Merge into `est`.
Then (l, o) appears 7 times. Merge into lo.
Then (lo, w) appears 7 times. Merge into low.
And so on, until we hit our target vocabulary size.
Loading simulation...
The beauty of BPE is that it is purely data-driven. It does not need any linguistic knowledge. Common words like "the" get merged into single tokens early because their character pairs are extremely frequent. Rare or technical words get split into smaller pieces that the model can still process.
This means that:
The vocabulary size is a hyperparameter that model creators choose. GPT-2 uses about 50,257 tokens. GPT-4 uses roughly 100,000. LLaMA uses 32,000. Larger vocabularies mean fewer tokens per text (more efficient), but more parameters to train.
WordPiece is another subword tokenization algorithm that works similarly to BPE but with a key difference in how it chooses which pair to merge. BERT, DistilBERT, and other models from Google's ecosystem use WordPiece.
BPE merges the pair that appears most frequently in the corpus. WordPiece merges the pair that maximizes the likelihood of the training data. In simpler terms, WordPiece asks: "Which merge would make the training corpus most probable under our current model?"
The practical difference is subtle. BPE favors pairs that are common in absolute terms. WordPiece favors pairs where the combination appears much more often than you would expect given the individual frequencies of each piece. If token A appears 1,000 times and token B appears 1,000 times, but AB appears 900 times, WordPiece considers that a very strong signal to merge, even if another pair appears 950 times in absolute terms.
WordPiece uses a special prefix ## to mark tokens that are continuations of a previous token (i.e., not the start of a word). For example:
The ## prefix tells the model "this piece is attached to the previous token, not a standalone word." This helps the model distinguish between the word "a" (the article) and the "a" at the end of "extra".
For most practical work as an AI engineer, the difference between BPE and WordPiece rarely matters. What matters is knowing that different models use different tokenizers, so the same text produces different token counts and different token boundaries. This directly affects cost and behavior.
Enough theory. Let's get our hands dirty with actual tokenization libraries.
OpenAI's tiktoken library is the go-to tool for working with GPT tokenizers. It is fast (written in Rust under the hood) and straightforward to use.
Notice a few things after running this. "Tokenization" gets split into "Token" and "ization", which makes sense since "Token" is a common subword. Spaces are attached to the beginning of words, not treated as separate tokens. The period is its own token. "AI" stays as a single token because it is extremely common in the training data.
Different models use different tokenizers with different vocabularies. This means the same text produces different token counts. Let's see this in action.
For simple English text, the differences are usually small. But try it with code, JSON, or non-English text, and the gaps become significant. Let's see that.
You will see that GPT-2's tokenizer produces significantly more tokens for non-English text and special characters. GPT-4o's tokenizer has been trained on a more diverse dataset, so it handles these cases more efficiently. Fewer tokens means lower cost and more room within the context window.
For non-OpenAI models, the transformers library from Hugging Face gives you access to their tokenizers.
Notice how LLaMA uses ▁ (a special Unicode character) to mark the beginning of words, while BERT uses ## to mark continuations. Both are solving the same problem, "where do word boundaries fall?", but with different conventions.
Tokenization gets interesting (and frustrating) when you move beyond English. This is not an edge case. If your application serves a global audience, tokenization quirks directly affect cost and user experience.
BPE tokenizers are trained on large corpora. These corpora are heavily weighted toward English. As a result, the tokenizer learns efficient representations for English words but not for words in other languages.
A single English word like "hello" is typically one token. But the equivalent word in Hindi, "नमस्ते" (namaste), might be three or four tokens. The Chinese word for "hello", "你好", might be two tokens with a modern tokenizer but could be six or more with an older one.
This has real consequences:
The same greeting in three languages produces vastly different token counts. The exact numbers depend on the tokenizer, but the pattern is consistent: English is almost always the most efficient.
It is worth noting that this gap has been shrinking. GPT-4o's tokenizer was specifically trained on more multilingual data, and it handles non-Latin scripts significantly better than GPT-2 or even GPT-3.5. LLaMA 3's tokenizer also expanded its vocabulary to better cover non-English languages.
But the gap has not closed entirely. If you are building a multilingual application, you should always test token counts across your target languages and factor the overhead into your cost estimates.
Understanding tokenization is not just academic. It directly affects how much money you spend. Every LLM API charges by the token, and input tokens and output tokens often have different prices.
LLM providers publish prices in "dollars per million tokens." To estimate the cost of a single request:
Let's make this concrete with a Python function.
A single request is cheap. The cost adds up when you scale. If your application handles 100,000 requests per day, even small inefficiencies in prompt design matter. Shaving 100 tokens off your system prompt across 100K daily requests saves you real money over a month.
Every model has a context window, the maximum number of tokens it can process in a single request (input + output combined). Here are some common limits:
The context window must fit your entire input (system prompt + conversation history + user message) plus the model's output. If you set max_tokens to 4,000 and your input is 125,000 tokens, you need a context window of at least 129,000 tokens.
This is why tokenization matters for architecture decisions. If you are building a RAG system that stuffs retrieved documents into the prompt, you need to know exactly how many tokens those documents consume. Estimating by word count is unreliable because the words-to-tokens ratio varies by content type.
This kind of utility function is something you will use constantly in production AI applications. It is much better to check before sending a request than to get a truncation error back from the API.