AlgoMaster Logo

Tokenization

Last Updated: March 14, 2026

Ashish

Ashish Pratap Singh

Before a language model can understand or generate text, it must first convert that text into a form it can process. Neural networks do not work directly with characters or words. They operate on numbers. Tokenization is the step that bridges this gap.

Tokenization is the process of breaking raw text into smaller units called tokens and mapping those tokens to numeric IDs that the model can understand. A token might be a word, part of a word, a punctuation mark, or even a single character, depending on the tokenizer.

For example, the sentence: "LLMs are transforming software development" might be broken into tokens like:

["LL", "Ms", " are", " transforming", " software", " development", "."]

Each token is then converted into a number and fed into the model.

In this chapter, we will explore how tokenization works, why modern models use subword tokens instead of full words, and how tokenization affects context length, cost, and model performance.

Why Not Just Use Words?

The first question worth asking is: why do LLMs need tokens at all? Why not just work with words directly?

There are three fundamental problems with using words as the basic unit.

The vocabulary problem

English has roughly 170,000 words in current use. Add technical jargon, proper nouns, misspellings, slang, and other languages, and you are looking at millions of unique strings. Every word needs its own entry in the model's vocabulary, and every entry comes with parameters the model has to learn. A vocabulary of millions would make training impossibly expensive.

The unknown word problem

No matter how large your vocabulary is, users will always type something the model has never seen. A new product name, a typo, a word from a language that was underrepresented in training data. A word-level model has no way to handle these. It can only shrug and output an "unknown" placeholder.

The morphology problem

Words like "run", "running", "runner", and "runs" are clearly related, but a word-level model treats them as completely independent entries. It has to learn the meaning of each one separately, wasting capacity on patterns that are obvious to humans.

Tokens solve all three problems. Instead of splitting text at word boundaries, tokenization algorithms find subword units, chunks that are smaller than words but larger than individual characters. Common words like "the" stay as single tokens. Rare words get split into pieces. The word "unhappiness" might become ["un", "happiness"], or ["un", "happi", "ness"], depending on the algorithm.

This gives you the best of both worlds: a manageable vocabulary size (typically 32,000 to 100,000 entries) that can still represent any possible input, including words it has never seen before.

Character-level tokenization can handle any input, but it produces very long sequences (each character is a token), which makes the model slow and makes it harder to learn word-level meaning. Word-level tokenization is compact but brittle. Subword tokenization hits the sweet spot.

Byte Pair Encoding (BPE)

Byte Pair Encoding is the most widely used tokenization algorithm in modern LLMs. GPT-2, GPT-3, GPT-4, LLaMA, and Mistral all use variants of BPE. The algorithm is surprisingly simple, and understanding it gives you strong intuition for how tokenization behaves in practice.

The Core Idea

BPE starts with the smallest possible vocabulary: individual bytes (or characters). Then it repeatedly finds the most frequent pair of adjacent tokens in the training data and merges them into a single new token. It keeps doing this until the vocabulary reaches a target size.

That is the entire algorithm. Let's walk through it step by step.

BPE Step by Step

Imagine we have a tiny training corpus with just these words (with frequencies):

WordFrequency
low5
lower2
newest6
widest3

Step 1: Start with characters

We split every word into individual characters and add a special end-of-word marker (let's use _). Our initial vocabulary is every unique character in the corpus.

Step 2: Count all adjacent pairs

We look at every pair of neighboring tokens across all words, weighted by frequency:

Step 3: Merge the most frequent pair

The pairs (e, s), (s, t), and (t, _) are all tied at 9. We pick one, say (e, s), and merge it into a new token es.

Step 4: Repeat

Now we count pairs again with the updated tokens. The pair (es, t) appears 9 times. Merge it into est.

Next, (est, ) appears 9 times. Merge into `est`.

Then (l, o) appears 7 times. Merge into lo.

Then (lo, w) appears 7 times. Merge into low.

And so on, until we hit our target vocabulary size.

BPE Simulation

Loading simulation...

Why BPE Works Well

The beauty of BPE is that it is purely data-driven. It does not need any linguistic knowledge. Common words like "the" get merged into single tokens early because their character pairs are extremely frequent. Rare or technical words get split into smaller pieces that the model can still process.

This means that:

  • Common words are single tokens (efficient, cheap)
  • Rare words are split into known subwords (still representable)
  • Completely new words fall back to character-level tokens (never truly "unknown")

The vocabulary size is a hyperparameter that model creators choose. GPT-2 uses about 50,257 tokens. GPT-4 uses roughly 100,000. LLaMA uses 32,000. Larger vocabularies mean fewer tokens per text (more efficient), but more parameters to train.

WordPiece

WordPiece is another subword tokenization algorithm that works similarly to BPE but with a key difference in how it chooses which pair to merge. BERT, DistilBERT, and other models from Google's ecosystem use WordPiece.

How It Differs from BPE

BPE merges the pair that appears most frequently in the corpus. WordPiece merges the pair that maximizes the likelihood of the training data. In simpler terms, WordPiece asks: "Which merge would make the training corpus most probable under our current model?"

The practical difference is subtle. BPE favors pairs that are common in absolute terms. WordPiece favors pairs where the combination appears much more often than you would expect given the individual frequencies of each piece. If token A appears 1,000 times and token B appears 1,000 times, but AB appears 900 times, WordPiece considers that a very strong signal to merge, even if another pair appears 950 times in absolute terms.

The ## Prefix Convention

WordPiece uses a special prefix ## to mark tokens that are continuations of a previous token (i.e., not the start of a word). For example:

The ## prefix tells the model "this piece is attached to the previous token, not a standalone word." This helps the model distinguish between the word "a" (the article) and the "a" at the end of "extra".

BPE vs WordPiece

Scroll
AspectBPEWordPiece
Merge criterionMost frequent pairHighest likelihood gain
Prefix markingNo prefix conventionUses ## for continuations
Used byGPT-2/3/4, LLaMA, MistralBERT, DistilBERT, Electra
Vocabulary size32K - 100K typical30K typical
SpeedVery fastSlightly slower (likelihood calc)

For most practical work as an AI engineer, the difference between BPE and WordPiece rarely matters. What matters is knowing that different models use different tokenizers, so the same text produces different token counts and different token boundaries. This directly affects cost and behavior.

Tokenization in Practice: Using tiktoken

Enough theory. Let's get our hands dirty with actual tokenization libraries.

OpenAI's tiktoken library is the go-to tool for working with GPT tokenizers. It is fast (written in Rust under the hood) and straightforward to use.

Installation and Basic Usage

main.py
Loading...

Notice a few things after running this. "Tokenization" gets split into "Token" and "ization", which makes sense since "Token" is a common subword. Spaces are attached to the beginning of words, not treated as separate tokens. The period is its own token. "AI" stays as a single token because it is extremely common in the training data.

Comparing Tokenizers Across Models

Different models use different tokenizers with different vocabularies. This means the same text produces different token counts. Let's see this in action.

main.py
Loading...

For simple English text, the differences are usually small. But try it with code, JSON, or non-English text, and the gaps become significant. Let's see that.

main.py
Loading...

You will see that GPT-2's tokenizer produces significantly more tokens for non-English text and special characters. GPT-4o's tokenizer has been trained on a more diverse dataset, so it handles these cases more efficiently. Fewer tokens means lower cost and more room within the context window.

Using Hugging Face Tokenizers

For non-OpenAI models, the transformers library from Hugging Face gives you access to their tokenizers.

main.py
Loading...

Output:

Notice how LLaMA uses  (a special Unicode character) to mark the beginning of words, while BERT uses ## to mark continuations. Both are solving the same problem, "where do word boundaries fall?", but with different conventions.

The Multilingual Problem

Tokenization gets interesting (and frustrating) when you move beyond English. This is not an edge case. If your application serves a global audience, tokenization quirks directly affect cost and user experience.

Why Non-English Text Uses More Tokens

BPE tokenizers are trained on large corpora. These corpora are heavily weighted toward English. As a result, the tokenizer learns efficient representations for English words but not for words in other languages.

A single English word like "hello" is typically one token. But the equivalent word in Hindi, "नमस्ते" (namaste), might be three or four tokens. The Chinese word for "hello", "你好", might be two tokens with a modern tokenizer but could be six or more with an older one.

This has real consequences:

  • Cost: A Japanese user's request costs 2-3x more tokens than the equivalent English request, even though the semantic content is the same.
  • Context limits: That 128K token context window holds far less Japanese text than English text. A document that fits comfortably in English might get truncated in Korean.
  • Response quality: When the model uses more tokens to represent the input, it has fewer tokens left for generating the output, which can lead to shorter or lower-quality responses.

The same greeting in three languages produces vastly different token counts. The exact numbers depend on the tokenizer, but the pattern is consistent: English is almost always the most efficient.

Newer Models Are Getting Better

It is worth noting that this gap has been shrinking. GPT-4o's tokenizer was specifically trained on more multilingual data, and it handles non-Latin scripts significantly better than GPT-2 or even GPT-3.5. LLaMA 3's tokenizer also expanded its vocabulary to better cover non-English languages.

But the gap has not closed entirely. If you are building a multilingual application, you should always test token counts across your target languages and factor the overhead into your cost estimates.

Token Counting and Cost Estimation

Understanding tokenization is not just academic. It directly affects how much money you spend. Every LLM API charges by the token, and input tokens and output tokens often have different prices.

The Cost Formula

LLM providers publish prices in "dollars per million tokens." To estimate the cost of a single request:

Let's make this concrete with a Python function.

main.py
Loading...

A single request is cheap. The cost adds up when you scale. If your application handles 100,000 requests per day, even small inefficiencies in prompt design matter. Shaving 100 tokens off your system prompt across 100K daily requests saves you real money over a month.

Context Window Math

Every model has a context window, the maximum number of tokens it can process in a single request (input + output combined). Here are some common limits:

ModelContext Window
GPT-4o128,000 tokens
Claude 3.5 Sonnet200,000 tokens
LLaMA 3.1 (8B)128,000 tokens
Gemini 1.5 Pro1,000,000 tokens
GPT-3.5 Turbo16,384 tokens

The context window must fit your entire input (system prompt + conversation history + user message) plus the model's output. If you set max_tokens to 4,000 and your input is 125,000 tokens, you need a context window of at least 129,000 tokens.

This is why tokenization matters for architecture decisions. If you are building a RAG system that stuffs retrieved documents into the prompt, you need to know exactly how many tokens those documents consume. Estimating by word count is unreliable because the words-to-tokens ratio varies by content type.

main.py
Loading...

This kind of utility function is something you will use constantly in production AI applications. It is much better to check before sending a request than to get a truncation error back from the API.

References