{"title":"Tokenization","description":"","content":"Before a language model can process text, the text has to be converted into numbers. Tokenization is the step that does that conversion.\n\n\n\n\n\nTokenization splits raw text into smaller units called **tokens** and maps each token to a numeric ID from the model's vocabulary. Depending on the tokenizer, a token might be a word, part of a word, punctuation, whitespace, a code symbol, an emoji, a byte sequence, or a single character.\n\nFor example, the sentence:\n\n\n```javascript\nLLMs are transforming software development.\n```\n\n\nmight be split into tokens like:\n\n`[\"LL\", \"Ms\", \" are\", \" transforming\", \" software\", \" development\", \".\"]`\n\nEach token is converted into an integer ID. The model then maps those IDs to embeddings and processes the sequence.\n\nThis chapter explains how tokenization works, why modern models use subword and byte-aware tokenizers, and why tokenization affects context length, cost, latency, retrieval, multilingual behavior, and production reliability.\n\n---\n\n# Why Not Just Use Words?\n\nWhy do LLMs need tokens at all? Why not use words directly?\n\nThere are three practical problems with using words as the basic unit.\n\n#### **The vocabulary problem**\n\nReal user input does not come from a clean dictionary. It includes technical terms, product names, URLs, filenames, code, punctuation, slang, misspellings, identifiers, and many languages. A word-level vocabulary would either become extremely large or fail on ordinary inputs.\n\n#### **The unknown word problem**\n\nNo matter how large the vocabulary is, users will type strings the tokenizer has never seen: a new library name, a random order ID, a typo, a domain-specific acronym, or a word from an underrepresented language. A strict word-level tokenizer has to map those strings to an unknown token or lose information.\n\n#### **The morphology problem**\n\nWords such as \"run\", \"running\", \"runner\", and \"runs\" are related. So are \"serialize\", \"serializer\", and \"deserialization\". A word-level tokenizer treats each surface form as a separate vocabulary item. The model can still learn relationships from context, but the tokenizer is not helping it reuse the shared pieces.\n\nSubword tokenization is the practical compromise. Instead of splitting only at word boundaries, the tokenizer learns reusable pieces that are smaller than many words but larger than individual characters. Common words like \"the\" may stay as single tokens. Rare words get split into pieces. The word \"unhappiness\" might become `[\"un\", \"happiness\"]` or `[\"un\", \"happi\", \"ness\"]`, depending on the tokenizer.\n\nThis gives the model a manageable vocabulary, often tens or hundreds of thousands of tokens, while still handling rare and new strings. Byte-level and byte-fallback tokenizers can represent arbitrary text without a dedicated unknown-word token. Some older tokenizers can still emit an unknown token, which is one reason tokenizer choice matters.\n\n\n```mermaid\nflowchart TD\n INPUT[\"Input: 'unhappiness'\"]:::primary\n\n subgraph Strategies\n WORD[\"Word-Level
['unhappiness']
1 token\"]:::orange\n CHAR[\"Character-Level
['u','n','h','a','p',...]
11 tokens\"]:::red\n SUB[\"Subword (BPE)
['un','happiness']
2 tokens\"]:::green\n end\n\n INPUT --> WORD\n INPUT --> CHAR\n INPUT --> SUB\n\n WORD --> W_ISSUE[\"Huge vocabulary
Weak handling of
unknown words\"]:::red\n CHAR --> C_ISSUE[\"Too many tokens
Harder to learn
word patterns\"]:::red\n SUB --> S_WIN[\"Manageable vocabulary
Good coverage of
rare strings\"]:::green\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n classDef green fill:#69db7c,stroke:#000,color:#000\n classDef red fill:#ff8787,stroke:#000,color:#000\n```\n\n\nCharacter-level tokenization can handle arbitrary input, but it creates long sequences and makes the model spend many steps reconstructing word-level patterns. Word-level tokenization is compact for common words but brittle on real input. Subword tokenization usually gives the best engineering tradeoff.\n\n---\n\n# Byte Pair Encoding (BPE)\n\nByte Pair Encoding (BPE) is one of the main tokenization families used by modern language models. GPT-style tokenizers, Llama tokenizers, and several open model families use BPE or BPE-like variants, often with byte-level handling or SentencePiece conventions. The exact implementation differs by model, but the core idea is straightforward.\n\n### The Core Idea\n\nBPE starts with a small vocabulary, often individual bytes or characters. It repeatedly finds a frequent pair of adjacent tokens in the training corpus and merges that pair into a new token. It continues until the vocabulary reaches the target size.\n\nThat merge rule creates a vocabulary where common strings become short token sequences and rare strings fall back to smaller pieces.\n\n### BPE Step by Step\n\nImagine a tiny training corpus with these words and frequencies:\n\n\n| Word | Frequency |\n|------|-----------|\n| low | 5 |\n| lower | 2 |\n| newest | 6 |\n| widest | 3 |\n\n\n#### **Step 1: Start with characters**\n\nWe split every word into individual characters and add a special end-of-word marker. In this toy example, we use `_`. The initial vocabulary is every unique character in the corpus.\n\n\n```plaintext\nInitial tokens:\nl o w _ (frequency: 5)\nl o w e r _ (frequency: 2)\nn e w e s t _ (frequency: 6)\nw i d e s t _ (frequency: 3)\n\nInitial vocabulary: {l, o, w, e, r, n, s, t, i, d, _}\n```\n\n\n#### **Step 2: Count all adjacent pairs**\n\nWe count neighboring token pairs across all words, weighted by word frequency:\n\n\n```plaintext\nPair counts:\n(e, s) = 6 + 3 = 9 (from \"newest\" and \"widest\")\n(s, t) = 6 + 3 = 9 (from \"newest\" and \"widest\")\n(t, _) = 6 + 3 = 9 (from \"newest\" and \"widest\")\n(l, o) = 5 + 2 = 7 (from \"low\" and \"lower\")\n(o, w) = 5 + 2 = 7 (from \"low\" and \"lower\")\n(n, e) = 6 (from \"newest\")\n(e, w) = 6 (from \"newest\")\n(w, _) = 5 (from \"low\")\n...\n```\n\n\n#### **Step 3: Merge the most frequent pair**\n\nThe pairs `(e, s)`, `(s, t)`, and `(t, _)` are tied at 9. We pick one, say `(e, s)`, and merge it into a new token, `es`.\n\n\n```plaintext\nAfter merging (e, s) -> es:\nl o w _ (frequency: 5)\nl o w e r _ (frequency: 2)\nn e w es t _ (frequency: 6)\nw i d es t _ (frequency: 3)\n\nVocabulary: {l, o, w, e, r, n, s, t, i, d, _, es}\n```\n\n\n#### **Step 4: Repeat**\n\nNow we count pairs again with the updated tokens. The pair `(es, t)` appears 9 times. Merge it into `est`.\n\n\n```plaintext\nAfter merging (es, t) -> est:\nl o w _ (frequency: 5)\nl o w e r _ (frequency: 2)\nn e w est _ (frequency: 6)\nw i d est _ (frequency: 3)\n\nVocabulary: {l, o, w, e, r, n, s, t, i, d, _, es, est}\n```\n\n\nNext, `(est, _)` appears 9 times. Merge it into `est_`.\n\nThen `(l, o)` appears 7 times. Merge it into `lo`.\n\nThen `(lo, w)` appears 7 times. Merge it into `low`.\n\nThe process continues until the tokenizer reaches its target vocabulary size.\n\n\n```mermaid\nflowchart LR\n A[\"Characters
l,o,w,e,s,t...\"]:::primary\n B[\"Merge (e,s)
→ es\"]:::orange\n C[\"Merge (es,t)
→ est\"]:::orange\n D[\"Merge (est,_)
→ est_\"]:::orange\n E[\"Merge (l,o)
→ lo\"]:::teal\n F[\"Merge (lo,w)
→ low\"]:::teal\n G[\"Final
Vocabulary\"]:::green\n\n A --> B --> C --> D --> E --> F --> G\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n classDef teal fill:#38d9a9,stroke:#000,color:#000\n classDef green fill:#69db7c,stroke:#000,color:#000\n```\n\n\n---\n\n### BPE Simulation\n\n\n\n\n\n---\n\n### Why BPE Works Well\n\nBPE is data-driven. It does not need a hand-written grammar or language-specific dictionary. Common strings like \"the\", \"ing\", `\": \"`, or `\"def \"` can become efficient tokens because they appear often. Rare or technical strings get split into smaller known pieces.\n\nIn practice:\n\n- **Common strings** become short token sequences\n- **Rare words** split into reusable subwords\n- **Code and structured text** can get efficient tokens for repeated patterns\n- **New strings** can be represented if the tokenizer has byte-level coverage or byte fallback\n\nVocabulary size is a model-design choice. GPT-2 used about 50K tokens. Many newer OpenAI models use `cl100k_base` or `o200k_base` tokenizers with roughly 100K to 200K tokens. Llama 2 used a 32K-token SentencePiece vocabulary; Llama 3 expanded to a much larger vocabulary to improve multilingual and code efficiency. Larger vocabularies can reduce token counts, but they also increase embedding and output-layer size and can change model behavior around rare strings.\n\n---\n\n# WordPiece\n\nWordPiece is another subword tokenization algorithm. BERT, DistilBERT, and many older encoder-style models use WordPiece.\n\n### How It Differs from BPE\n\nBPE is usually explained as merging frequent adjacent pairs. WordPiece chooses tokens using a likelihood-based objective: it prefers pieces that improve the probability of the training corpus under the tokenizer's model.\n\nThe practical difference is subtle for most application engineers. BPE tends to favor pairs that are common in absolute terms. WordPiece favors pieces that are informative relative to their components. If token A and token B are common separately but appear together far more often than expected, WordPiece treats that as a useful signal.\n\n### The ## Prefix Convention\n\nWordPiece commonly uses the prefix `##` to mark tokens that continue a word rather than start one. For example:\n\n\n```plaintext\nWord: \"tokenization\"\n\nBPE output: [\"token\", \"ization\"]\nWordPiece output: [\"token\", \"##ization\"]\n```\n\n\nThe `##` prefix marks a piece as attached to the previous token. That helps distinguish a standalone token from the same characters inside a larger word.\n\n### BPE vs WordPiece\n\n\n| Aspect | BPE | WordPiece |\n|--------|-----|-----------|\n| Merge criterion | Most frequent pair | Highest likelihood gain |\n| Prefix marking | No prefix convention | Uses `##` for continuations |\n| Used by | GPT-style models, Llama, Mistral-style models | BERT, DistilBERT, Electra |\n| Vocabulary size | Often 32K-200K+ | Often around 30K |\n| Main concern | Simple merge-based training | Likelihood-based token selection |\n\n\nFor most AI engineering work, the algorithm name matters less than the operational fact: different models use different tokenizers. The same text can produce different token counts, different token boundaries, and different failure modes.\n\n---\n\n# Tokenization in Practice: Using tiktoken\n\nA good way to build intuition is to inspect real tokenization output.\n\nOpenAI's `tiktoken` library is a common tool for working with OpenAI tokenizers. It is fast and exposes tokenizer families such as `cl100k_base` and `o200k_base`.\n\n### Installation and Basic Usage\n\n\n**main.py**\n\n```python\nimport tiktoken\n\n# Use OpenAI's o200k_base tokenizer directly.\nencoder = tiktoken.get_encoding(\"o200k_base\")\n\ntext = \"Tokenization affects cost and latency.\"\n\n# Encode text to token IDs.\ntoken_ids = encoder.encode(text)\nprint(f\"Token IDs: {token_ids}\")\nprint(f\"Number of tokens: {len(token_ids)}\")\n\n# Decode back to text.\ndecoded = encoder.decode(token_ids)\nprint(f\"Decoded: {decoded}\")\n\n# Inspect individual tokens as strings.\ntokens = [encoder.decode([tid]) for tid in token_ids]\nprint(f\"Tokens: {tokens}\")\n```\n\n\nWhen you run this, look at the token strings, not only the count. You will see that spaces are often attached to the beginning of words, punctuation may be separate, and common strings such as \"AI\" may be compact tokens. Do not assume a word always maps to one token. Inspect the tokenizer for the model you plan to use.\n\n### Comparing Tokenizers Across Models\n\nDifferent models use different tokenizers with different vocabularies. This means the same text can produce different token counts. Here is a small comparison.\n\n\n**main.py**\n\n```python\nimport tiktoken\n\ntext = \"The quick brown fox jumps over the lazy dog.\"\n\n# Newer OpenAI models often use the o200k_base family.\no200k_enc = tiktoken.get_encoding(\"o200k_base\")\n\n# Many earlier OpenAI models used cl100k_base.\ncl100k_enc = tiktoken.get_encoding(\"cl100k_base\")\n\n# Older GPT-2 uses a different encoding.\ngpt2_enc = tiktoken.get_encoding(\"gpt2\")\n\nprint(f\"o200k tokens: {len(o200k_enc.encode(text))} tokens\")\nprint(f\"cl100k tokens: {len(cl100k_enc.encode(text))} tokens\")\nprint(f\"GPT-2 tokens: {len(gpt2_enc.encode(text))} tokens\")\n```\n\n\nFor simple English text, differences are often small. With code, JSON, emoji, mixed scripts, or non-English text, the gaps can be much larger.\n\n\n**main.py**\n\n```python\nimport tiktoken\n\no200k_enc = tiktoken.get_encoding(\"o200k_base\")\ncl100k_enc = tiktoken.get_encoding(\"cl100k_base\")\ngpt2_enc = tiktoken.get_encoding(\"gpt2\")\n\ntest_cases = {\n \"English\": \"The quick brown fox jumps over the lazy dog.\",\n \"Python code\": \"def fibonacci(n):\\n if n <= 1:\\n return n\\n return fibonacci(n-1) + fibonacci(n-2)\",\n \"Japanese\": \"Large Language Modelは自然言語処理の分野で革命を起こしました。\",\n \"JSON\": '{\"name\": \"Alice\", \"age\": 30, \"scores\": [95, 87, 92]}',\n \"Emoji\": \"Deploying tonight 🚀💻 Logs look clean ✅\",\n}\n\nprint(f\"{'Input':<15} {'o200k':<10} {'cl100k':<10} {'GPT-2':<10} {'Ratio':<10}\")\nprint(\"-\" * 60)\nfor label, text in test_cases.items():\n o200k_count = len(o200k_enc.encode(text))\n cl100k_count = len(cl100k_enc.encode(text))\n gpt2_count = len(gpt2_enc.encode(text))\n ratio = gpt2_count / o200k_count\n print(f\"{label:<15} {o200k_count:<10} {cl100k_count:<10} {gpt2_count:<10} {ratio:<10.2f}\")\n```\n\n\nOlder tokenizers often produce more tokens for non-English text, emoji, and some structured data, while newer tokenizers tend to have broader multilingual and code coverage. Fewer tokens usually means lower cost, lower latency, and more room in the context window.\n\n### Using Hugging Face Tokenizers\n\nFor many non-OpenAI models, the `transformers` library from Hugging Face gives you access to the tokenizer packaged with the model.\n\n\n**main.py**\n\n```python\nfrom transformers import AutoTokenizer\n\n# Load an open Llama-family tokenizer.\nllama_tokenizer = AutoTokenizer.from_pretrained(\"TinyLlama/TinyLlama-1.1B-Chat-v1.0\")\n\n# Load BERT's WordPiece tokenizer for comparison.\nbert_tokenizer = AutoTokenizer.from_pretrained(\"bert-base-uncased\")\n\ntext = \"Tokenization algorithms are important.\"\n\n# Tokenize with the Llama-family tokenizer.\nllama_tokens = llama_tokenizer.tokenize(text)\nprint(f\"Llama tokens: {llama_tokens}\")\nprint(f\"Llama count: {len(llama_tokens)}\")\n\n# Tokenize with BERT (WordPiece).\nbert_tokens = bert_tokenizer.tokenize(text)\nprint(f\"BERT tokens: {bert_tokens}\")\nprint(f\"BERT count: {len(bert_tokens)}\")\n```\n\n\n#### **Example output:**\n\n\n```plaintext\nLlama tokens: ['▁Token', 'ization', '▁algorithms', '▁are', '▁important', '.']\nLlama count: 6\nBERT tokens: ['token', '##ization', 'algorithms', 'are', 'important', '.']\nBERT count: 6\n```\n\n\nToken boundary markers vary. Some SentencePiece tokenizers use `▁` to mark word starts. GPT-style tokenizers often expose space-aware tokens such as `Ġ` or tokens that decode with a leading space. BERT uses `##` to mark continuations. These conventions solve the same practical problem: preserving word-boundary information after text has been split into subword pieces.\n\n---\n\n# The Multilingual Problem\n\nTokenization becomes especially important when you move beyond English. For products with a global audience, tokenizer behavior affects cost, latency, truncation, and sometimes quality.\n\n### Why Non-English Text Uses More Tokens\n\nTokenizers are trained on text corpora. If the tokenizer training data is heavily weighted toward English, English strings get efficient representations and other scripts may be split into more pieces.\n\nA single English word like \"hello\" is often one token. The Hindi greeting \"नमस्ते\" may be multiple tokens. The Chinese greeting \"你好\" may be compact with a modern multilingual tokenizer and much less compact with an older tokenizer.\n\nThis has real consequences:\n\n- **Cost:** A request in one language may use far more tokens than an equivalent request in another language.\n- **Context limits:** A nominal 128K-token context window holds different amounts of content depending on language and file type.\n- **Latency:** Longer token sequences require more work during prefill and can slow down requests.\n- **Truncation risk:** A document that fits in English may exceed limits after translation or when represented in a less efficient script.\n\n\n```mermaid\nflowchart TB\n subgraph English\n\t\tdirection TB\n E_IN[\"'Hello, how are you?'\"]:::green\n E_TOK[\"5 tokens\"]:::green\n end\n\n subgraph Japanese\n\t\tdirection TB\n J_IN[\"'こんにちは、お元気ですか？'\"]:::red\n J_TOK[\"13 tokens\"]:::red\n end\n\n subgraph Hindi\n\t\tdirection TB\n H_IN[\"'नमस्ते, आप कैसे हैं?'\"]:::red\n H_TOK[\"17 tokens\"]:::red\n end\n\n E_IN --> E_TOK\n J_IN --> J_TOK\n H_IN --> H_TOK\n\n classDef green fill:#69db7c,stroke:#000,color:#000\n classDef red fill:#ff8787,stroke:#000,color:#000\n```\n\n\nThe exact numbers depend on the tokenizer. Do not rely on a universal multiplier. Measure token counts for the languages and content types your application serves.\n\n### Newer Models Are Getting Better\n\nThis gap has narrowed. Newer tokenizers such as OpenAI's `o200k_base` and Llama 3's expanded vocabulary handle many non-English and code-heavy inputs more efficiently than older GPT-2-era tokenizers.\n\nThe gap has not disappeared. If you are building a multilingual application, test token counts across your target languages and include that overhead in cost estimates, prompt budgets, and retrieval chunk sizes.\n\n---\n\n# Token Counting and Cost Estimation\n\nTokenization directly affects cost. Most LLM APIs charge by token, and input, cached input, output, reasoning, image, and audio tokens may be priced differently depending on the provider and model.\n\n### The Cost Formula\n\nProviders commonly publish text prices in dollars per million tokens. For a text-only request, the rough cost is:\n\n\n```plaintext\nCost = (input_tokens x input_price_per_token) + (output_tokens x output_price_per_token)\n```\n\n\nIn code, keep prices configurable. Model pricing changes, and hard-coding old rates into application logic is a good way to produce wrong forecasts.\n\n\n**main.py**\n\n```python\nimport tiktoken\n\ndef estimate_cost(\n prompt: str,\n expected_output_tokens: int = 500,\n encoding_name: str = \"o200k_base\",\n input_price_per_million: float = 2.50,\n output_price_per_million: float = 10.00,\n) -> dict:\n \"\"\"Estimate text-token cost for an LLM API call.\n\n Prices are examples. Read current prices from configuration in production.\n \"\"\"\n encoder = tiktoken.get_encoding(encoding_name)\n input_tokens = len(encoder.encode(prompt))\n\n input_cost = (input_tokens / 1_000_000) * input_price_per_million\n output_cost = (expected_output_tokens / 1_000_000) * output_price_per_million\n total_cost = input_cost + output_cost\n\n return {\n \"input_tokens\": input_tokens,\n \"expected_output_tokens\": expected_output_tokens,\n \"input_cost\": f\"${input_cost:.6f}\",\n \"output_cost\": f\"${output_cost:.6f}\",\n \"total_cost\": f\"${total_cost:.6f}\",\n }\n\n# Example: estimate cost for a customer support prompt.\nprompt = \"\"\"You are a helpful customer support agent for an e-commerce company.\nThe customer has an order that was shipped 5 days ago but hasn't arrived.\nTheir order number is #12345. Please help them check the status and offer\nappropriate solutions.\"\"\"\n\nresult = estimate_cost(prompt, expected_output_tokens=300)\nfor key, value in result.items():\n print(f\"{key}: {value}\")\n```\n\n\nA single request may be cheap. Cost becomes visible at scale. If your application handles 100,000 requests per day, a 100-token system-prompt reduction removes 10 million input tokens per day. Whether that is worth doing depends on model price, cache hit rate, latency goals, and how much clarity you lose by shortening the prompt.\n\n### Context Window Math\n\nEvery model has a context window: the maximum number of tokens it can process in a single request. The budget must cover the input plus the tokens the model will generate. Some APIs also count hidden reasoning tokens, tool-call arguments, image tokens, or wrapper formatting, depending on the model.\n\nContext sizes change often, so treat the following as examples, not as a permanent reference:\n\n\n| Model | Context Window |\n|-------|---------------|\n| GPT-4o | 128,000 tokens |\n| Claude Sonnet family | commonly 200,000 tokens |\n| Llama 3.1 family | 128,000 tokens |\n| Gemini 2.5 Pro | 1,000,000 tokens |\n\n\nThe context window must fit the entire request: system instructions, developer instructions, conversation history, retrieved documents, tool results, the user message, and the model's output. If you reserve 4,000 output tokens and your input is 125,000 tokens, you need at least 129,000 tokens of usable context.\n\nThis is why tokenization affects architecture. A RAG system needs chunk sizes, retrieval limits, prompt templates, citation metadata, and output budgets that fit the target model. Estimating by word count is unreliable because the word-to-token ratio varies by language and content type.\n\n\n**main.py**\n\n```python\nimport tiktoken\n\ndef check_context_fit(\n system_prompt: str,\n user_message: str,\n retrieved_docs: list[str],\n max_output_tokens: int = 2000,\n context_window: int = 128000,\n encoding_name: str = \"o200k_base\",\n) -> dict:\n \"\"\"Check whether a prompt fits within a model's context window.\"\"\"\n encoder = tiktoken.get_encoding(encoding_name)\n\n system_tokens = len(encoder.encode(system_prompt))\n user_tokens = len(encoder.encode(user_message))\n doc_tokens = sum(len(encoder.encode(doc)) for doc in retrieved_docs)\n\n total_input = system_tokens + user_tokens + doc_tokens\n total_needed = total_input + max_output_tokens\n fits = total_needed <= context_window\n\n return {\n \"system_prompt_tokens\": system_tokens,\n \"user_message_tokens\": user_tokens,\n \"document_tokens\": doc_tokens,\n \"total_input_tokens\": total_input,\n \"max_output_tokens\": max_output_tokens,\n \"total_needed\": total_needed,\n \"context_window\": context_window,\n \"fits\": fits,\n \"remaining_tokens\": context_window - total_needed if fits else 0,\n }\n```\n\n\nThis kind of utility belongs near your prompt-building code. It is better to reject, summarize, or retrieve less context before the request than to discover the problem through truncation or an API error.\n\n---\n\n# Tokenization Pitfalls in Production\n\nTokenization is easy to ignore until it breaks something. Watch for these issues:\n\n- **Chunk boundaries:** RAG chunks should be sized in tokens, not characters. A 2,000-character chunk can be tiny in English prose and large in code or multilingual text.\n- **Structured output:** JSON punctuation, escaped strings, and long field names all count as tokens. A verbose schema can consume more context than the user input.\n- **Whitespace sensitivity:** Leading spaces, newlines, indentation, and Markdown formatting can change token boundaries. This matters for code and prompt templates.\n- **Special tokens:** Chat APIs may add hidden formatting tokens around roles, tool calls, images, or messages. Your local count may be slightly different from the provider's billable count.\n- **Model migration:** Changing models can change the tokenizer. Recheck prompt budgets, retrieval chunk sizes, and cost forecasts when you switch providers or model families.\n\nA good production habit is to count tokens with the same tokenizer as the model you will call, measure representative inputs, and leave margin for output and provider-side formatting.\n\n---\n\n# Quiz\n\n---\n\n### References\n\n- [OpenAI tiktoken library (GitHub)](https://github.com/openai/tiktoken)\n- [Hugging Face Tokenizers documentation](https://huggingface.co/docs/tokenizers)\n- [Sennrich, Haddow, Birch - \"Neural Machine Translation of Rare Words with Subword Units\" (BPE paper)](https://arxiv.org/abs/1508.07909)\n- [OpenAI Tokenizer (interactive tool)](https://platform.openai.com/tokenizer)\n- [Google's WordPiece paper - \"Japanese and Korean Voice Search\"](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37842.pdf)","pageType":"ai-engineering"}

Get Premium