{"title":"Understanding LLM Parameters","description":"","content":"When you call an LLM API with default settings, you let the model decide several important details: how much variation to allow, how long the answer can be, and when to stop. That is fine for a quick experiment. In a real application, you usually want more control.\n\nLLM APIs expose parameters that shape generation: `temperature`, `top_p`, output-token limits, stop sequences, penalties, seeds, and log probabilities. These controls will not fix a poor prompt by themselves, but they help you manage cost, latency, variation, and output format.\n\nIn this chapter, you will learn what the most common parameters do, when to change them, and when to leave them alone.\n\n---\n\n# How LLMs Generate Text\n\nBefore tuning parameters, you need a simple model of how text generation works. An LLM does not \"write\" text the way a human does. It predicts one token at a time.\n\nAt each step, the model looks at everything generated so far and produces a probability distribution over all possible next tokens. The word \"the\" might have a 15% chance, \"a\" might have 8%, \"Hello\" might have 0.001%. The model then picks one token from this distribution, appends it to the output, and repeats.\n\nThe parameters we cover in this chapter all influence this token selection process. Some change the probability distribution itself. Others control when the process stops.\n\n\n```mermaid\nflowchart LR\n A[Input Prompt] --> B[Model Computes
Probabilities]\n B --> C[Apply Temperature
& Top-p]\n C --> D[Sample
Next Token]\n D --> E{Stop
Condition?}\n E -- No --> B\n E -- Yes --> F[Return Output]\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n classDef green fill:#69db7c,stroke:#000,color:#000\n classDef red fill:#ff8787,stroke:#000,color:#000\n\n class A primary\n class B,C orange\n class D,E green\n class F red\n```\n\n\nThis loop runs hundreds or thousands of times for a single response. Each parameter below affects a different part of the loop.\n\n---\n\n# Temperature: Controlling Variation\n\nTemperature is the parameter you will adjust most often. It controls how narrowly or broadly the model samples from the next-token probabilities.\n\n### How It Works\n\nAt each generation step, the model produces a probability distribution over possible next tokens. Temperature reshapes that distribution before a token is sampled.\n\nHere is the math, simplified. The model produces raw scores (called logits) for each possible next token. These get converted to probabilities using the softmax function:\n\n\n```shell\nP(token_i) = exp(logit_i / T) / sum(exp(logit_j / T) for all j)\n```\n\n\nHere, **T** is the temperature. Different values produce different sampling behavior:\n\n- **T = 0 (or near 0):** The highest-probability token dominates, so the model usually picks the most likely next token. Output is more repeatable, though not guaranteed identical across providers or model backends.\n- **T = 1.0:** This is the default, where probabilities stay as the model originally computed them. The result is a balanced mix of predictability and variety.\n- **T > 1.0:** Probabilities flatten out and lower-probability tokens get a bigger share. The model produces more varied output, but it can also become less coherent.\n\n### Code\n\nHere is a quick way to see temperature in action. The script sends the same prompt at five different temperature values and prints the results:\n\n\n**main.py**\n\n```python\nimport os\nfrom openai import OpenAI\n\napi_key = os.getenv(\"OPENROUTER_API_KEY\")\nif not api_key:\n raise ValueError(\"Set OPENROUTER_API_KEY before running this script.\")\n\nclient = OpenAI(\n api_key=api_key,\n base_url=\"https://openrouter.ai/api/v1\",\n)\n\nprompt = \"Suggest a name for a coffee shop.\"\n\ntemperatures = [0, 0.3, 0.7, 1.0, 1.5]\n\nfor temp in temperatures:\n print(f\"\\n--- Temperature: {temp} ---\")\n for i in range(3):\n response = client.chat.completions.create(\n model=\"openai/gpt-5.4-mini\",\n messages=[{\"role\": \"user\", \"content\": prompt}],\n temperature=temp,\n max_completion_tokens=50,\n )\n print(f\" Run {i+1}: {response.choices[0].message.content}\")\n```\n\n\n#### Sample Output:\n\n\n```plaintext\n--- Temperature: 0 ---\n Run 1: The Daily Grind\n Run 2: The Daily Grind\n Run 3: The Daily Grind\n\n--- Temperature: 0.3 ---\n Run 1: The Daily Grind\n Run 2: The Daily Grind\n Run 3: Morning Ritual\n\n--- Temperature: 0.7 ---\n Run 1: Morning Ritual\n Run 2: Brew & Bloom\n Run 3: The Daily Grind\n\n--- Temperature: 1.0 ---\n Run 1: Brew & Bloom\n Run 2: Percolate\n Run 3: Grounds for Joy\n\n--- Temperature: 1.5 ---\n Run 1: Caffeinated Whimsy\n Run 2: The Roasted Wanderer\n Run 3: Velvet Grind & Reverie\n```\n\n\nThe pattern is what matters. At temperature 0, repeated runs tend to look very similar. At 0.7, the outputs vary but usually stay sensible. At 1.5, the model explores more unusual combinations, which can be useful for brainstorming but risky for production answers.\n\n### When to Use Each Temperature\n\n\n| Temperature | Behavior | Use Cases |\n|-------------|----------|-----------|\n| 0 | Most repeatable | Code generation, factual Q&A, data extraction, classification |\n| 0.1 - 0.3 | Mostly consistent, slight variation | Summarization, translation, structured tasks |\n| 0.5 - 0.7 | Balanced creativity | General conversation, email drafting, explanations |\n| 0.8 - 1.0 | Creative, diverse | Brainstorming, creative writing, marketing copy |\n| 1.0 - 1.5 | Highly varied, less predictable | Brainstorming and experiments, with review |\n\n\n---\n\n# Top-p: Limiting the Token Pool\n\nTemperature is not the only way to control variation. `top_p`, also called nucleus sampling, limits which next tokens are eligible before the model samples one.\n\n### How It Works\n\nTop-p works by sorting possible next tokens from most likely to least likely. The API then adds those probabilities until the total reaches your `top_p` value. Only tokens inside that group are eligible.\n\nSay the model produces these probabilities for the next token:\n\n\n| Token | Probability | Cumulative |\n|-------|-------------|------------|\n| the | 0.40 | 0.40 |\n| a | 0.25 | 0.65 |\n| one | 0.15 | 0.80 |\n| my | 0.10 | 0.90 |\n| some | 0.05 | 0.95 |\n| every | 0.03 | 0.98 |\n| that | 0.02 | 1.00 |\n\n\nWith **top_p = 0.8**, the model only considers `\"the\"`, `\"a\"`, and `\"one\"` because their cumulative probability reaches 0.80. The remaining tokens are excluded. The model then samples from the reduced set.\n\nWith **top_p = 0.95**, the model considers the top 5 tokens. With **top_p = 1.0** (the default), all tokens are eligible, so top_p has no effect.\n\n\n```mermaid\nflowchart LR\n A[All Tokens
Sorted by Prob] --> B[Cumulative Sum
Until Threshold]\n B --> C[Nucleus:
Eligible Tokens]\n C --> D[Sample from
Nucleus]\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n classDef green fill:#69db7c,stroke:#000,color:#000\n classDef purple fill:#38d9a9,stroke:#000,color:#000\n\n class A primary\n class B orange\n class C green\n class D purple\n```\n\n\n### Temperature vs Top-p\n\nBoth parameters control randomness, but they work differently:\n\n\n| Aspect | Temperature | Top-p |\n|--------|-------------|-------|\n| Mechanism | Scales all probabilities | Cuts off low-probability tokens |\n| Low value | One token dominates | Fewer tokens eligible |\n| High value | All tokens get a chance | More tokens eligible |\n| Default | 1.0 | 1.0 |\n| Range | 0 to 2.0 (varies by API) | 0 to 1.0 |\n\n\nIn practice, adjust either temperature or `top_p`, not both at the same time unless you are deliberately experimenting. Changing both makes it harder to attribute changes in behavior. Most teams tune temperature first and leave `top_p` at `1.0`.\n\n\n**main.py**\n\n```python\nimport os\nfrom openai import OpenAI\n\napi_key = os.getenv(\"OPENROUTER_API_KEY\")\nif not api_key:\n raise ValueError(\"Set OPENROUTER_API_KEY before running this script.\")\n\nclient = OpenAI(\n api_key=api_key,\n base_url=\"https://openrouter.ai/api/v1\",\n)\n\n# Good: adjust temperature, leave top_p at default\nresponse = client.chat.completions.create(\n model=\"openai/gpt-5.4-mini\",\n messages=[{\"role\": \"user\", \"content\": \"Write a haiku about databases.\"}],\n temperature=0.9,\n # top_p defaults to 1.0\n)\n\n# Also good: adjust top_p, leave temperature at default\nresponse = client.chat.completions.create(\n model=\"openai/gpt-5.4-mini\",\n messages=[{\"role\": \"user\", \"content\": \"Write a haiku about databases.\"}],\n top_p=0.9,\n # temperature defaults to 1.0\n)\n\n# Usually avoid tuning both at once.\n# It makes behavior harder to reason about.\n```\n\n\n---\n\n# Top K: A Hard Cutoff for Token Selection\n\nTop-p dynamically adjusts how many tokens are eligible based on cumulative probability. Top-k takes a simpler approach: it keeps exactly the top K most probable tokens and discards the rest.\n\nIf `top_k=50`, the model only considers the 50 highest-probability tokens at each step, regardless of how much cumulative probability they cover. If `top_k=1`, generation becomes greedy: only the highest-ranked token is eligible.\n\n### Top K vs Top-p\n\nThink of `top_k` as a fixed-size window and `top_p` as a variable-size window. `top_p` adapts to confidence: when the distribution is sharp, the nucleus may contain only a few tokens; when the distribution is flat, it may contain many. `top_k` always keeps the same number of candidates.\n\n\n| Aspect | Top K | Top-p |\n|--------|-------|-------|\n| Mechanism | Fixed number of top tokens | Dynamic based on cumulative probability |\n| Adapts to confidence? | No | Yes |\n| Default | Varies (often 0 = disabled) | 1.0 |\n| Common values | 10-100 | 0.8-0.95 |\n\n\n### Code\n\n\n**main.py**\n\n```python\nimport os\nfrom openai import OpenAI\n\napi_key = os.getenv(\"OPENROUTER_API_KEY\")\nif not api_key:\n raise ValueError(\"Set OPENROUTER_API_KEY before running this script.\")\n\nclient = OpenAI(\n api_key=api_key,\n base_url=\"https://openrouter.ai/api/v1\",\n)\n\n# Top-k support varies by provider and model.\nresponse = client.chat.completions.create(\n model=\"openai/gpt-5.4-mini\",\n messages=[{\"role\": \"user\", \"content\": \"Write a haiku about databases.\"}],\n extra_body={\"top_k\": 40},\n max_completion_tokens=100,\n)\n\nprint(response.choices[0].message.content)\n```\n\n\n---\n\n# Seed: Making Outputs Easier to Reproduce\n\nTemperature and `top_p` introduce randomness by design, so the same prompt can produce different outputs on different runs. Reproducibility matters when you are testing a pipeline, debugging a prompt change, or showing a specific behavior to another engineer.\n\nThe `seed` parameter helps with that. It initializes the random number generator used during sampling, so repeated requests are more likely to follow the same path through the probability distributions.\n\n### How It Works\n\nWhen you pass a seed value, the sampler uses it to initialize randomness. Same prompt, same seed, same model, and same parameters often produce the same output. It still depends on provider support and backend stability.\n\n\n**main.py**\n\n```python\nimport os\nfrom openai import OpenAI\n\napi_key = os.getenv(\"OPENROUTER_API_KEY\")\nif not api_key:\n raise ValueError(\"Set OPENROUTER_API_KEY before running this script.\")\n\nclient = OpenAI(\n api_key=api_key,\n base_url=\"https://openrouter.ai/api/v1\",\n)\n\n# Same seed often produces the same output\nfor i in range(3):\n response = client.chat.completions.create(\n model=\"openai/gpt-5.4-mini\",\n messages=[{\"role\": \"user\", \"content\": \"Suggest a name for a coffee shop.\"}],\n temperature=0.7,\n seed=42,\n max_completion_tokens=50,\n )\n print(f\"Run {i+1}: {response.choices[0].message.content}\")\n```\n\n\n#### Sample Output:\n\n\n```plaintext\nRun 1: Brew & Bloom\nRun 2: Brew & Bloom\nRun 3: Brew & Bloom\n```\n\n\nWithout the seed, temperature 0.7 would usually give you different names across runs. With `seed=42`, the runs are much more likely to match.\n\n### The Caveat: Best-Effort, Not Guaranteed\n\nSeed-based reproducibility is best-effort across most providers. Even when a provider supports seeds, exact determinism is not guaranteed. Model serving infrastructure can change, and small numerical differences can affect a sampled token.\n\nIn practice, short outputs with the same model version are often reproducible. Long outputs or requests made weeks apart may differ. For testing and debugging, seed is useful because it reduces noise. For production logic, do not depend on exact string matching.\n\nSome providers expose a `system_fingerprint` or similar backend identifier. If it changes, a seeded request may change too:\n\n\n**main.py**\n\n```python\nimport os\nfrom openai import OpenAI\n\napi_key = os.getenv(\"OPENROUTER_API_KEY\")\nif not api_key:\n raise ValueError(\"Set OPENROUTER_API_KEY before running this script.\")\n\nclient = OpenAI(\n api_key=api_key,\n base_url=\"https://openrouter.ai/api/v1\",\n)\n\nresponse = client.chat.completions.create(\n model=\"openai/gpt-5.4-mini\",\n messages=[{\"role\": \"user\", \"content\": \"What is 2+2?\"}],\n seed=123,\n max_completion_tokens=50,\n)\n\nprint(f\"Fingerprint: {response.system_fingerprint}\")\nprint(f\"Output: {response.choices[0].message.content}\")\n```\n\n\n---\n\n# Output Token Limits: Controlling Response Length\n\nEvery LLM has a context window: a fixed token budget shared by your input and the model's output. If you do not manage that budget, responses can be cut off or cost more than expected.\n\n### How Context Windows Work\n\nThe context window is the total number of tokens the model can handle in a single request. This includes everything: the system prompt, the conversation history, and the generated output.\n\n\n```mermaid\nflowchart LR\n subgraph CW[\"Context Window (400K tokens)\"]\n A[\"System Prompt
~200 tokens\"]:::primary\n B[\"Conversation
History
~2,000 tokens\"]:::orange\n C[\"User Message
~300 tokens\"]:::green\n D[\"Output
(max_completion_tokens)
~4,096 tokens\"]:::purple\n end\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n classDef green fill:#69db7c,stroke:#000,color:#000\n classDef purple fill:#38d9a9,stroke:#000,color:#000\n```\n\n\nHere is the relationship:\n\n\n```shell\ninput_tokens + output_tokens <= context_window\n```\n\n\nIf your input uses 390,000 tokens of a 400K context window, the model has at most 10,000 tokens left for output, regardless of what output limit you request.\n\n### What max_completion_tokens Does\n\nThe `max_completion_tokens` parameter sets an upper limit on how many tokens the model can generate. It does not force the model to use the full amount. The model may stop earlier if it reaches a natural conclusion or hits a stop sequence.\n\nYou may still see older examples use `max_tokens`. On OpenRouter, both names refer to the same kind of output cap, but this course uses `max_completion_tokens` in new code.\n\nWhen the model hits the output-token limit, it stops generating immediately, even if it is mid-sentence. The API response includes a `finish_reason` field that tells you why generation stopped:\n\n- `\"stop\"` means the model finished naturally\n- `\"length\"` means it hit the output-token limit, so the output may be incomplete\n\n\n**main.py**\n\n```python\nimport os\nfrom openai import OpenAI\n\napi_key = os.getenv(\"OPENROUTER_API_KEY\")\nif not api_key:\n raise ValueError(\"Set OPENROUTER_API_KEY before running this script.\")\n\nclient = OpenAI(\n api_key=api_key,\n base_url=\"https://openrouter.ai/api/v1\",\n)\n\n# Short limit: this will probably cut off\nresponse = client.chat.completions.create(\n model=\"openai/gpt-5.4-mini\",\n messages=[{\"role\": \"user\", \"content\": \"Explain how a hash table works.\"}],\n max_completion_tokens=50,\n)\n\nprint(response.choices[0].message.content)\nprint(f\"Finish reason: {response.choices[0].finish_reason}\")\n# Likely prints \"length\", because the model needed more room\n\n# Generous limit: the model can stop naturally\nresponse = client.chat.completions.create(\n model=\"openai/gpt-5.4-mini\",\n messages=[{\"role\": \"user\", \"content\": \"Explain how a hash table works.\"}],\n max_completion_tokens=2000,\n)\n\nprint(f\"Finish reason: {response.choices[0].finish_reason}\")\n# Likely prints \"stop\", because the model finished on its own\nprint(f\"Tokens used: {response.usage.completion_tokens}\")\n# The model probably used far fewer than 2000 tokens\n```\n\n\n### Counting Tokens Across Providers\n\nTokens are not words. The word \"understanding\" is one word but may be split into multiple tokens depending on the tokenizer. Knowing how many tokens your input uses helps you set limits and predict costs.\n\nYou can use the `tiktoken` library to count tokens before making an API call. It runs offline and returns results instantly.\n\n\n**main.py**\n\n```python\nimport tiktoken\n\n# Use the tokenizer family used by current OpenAI chat models\nencoder = tiktoken.get_encoding(\"o200k_base\")\n\ntext = \"Understanding LLM parameters is essential for building reliable AI applications.\"\n\ntokens = encoder.encode(text)\nprint(f\"Text: {text}\")\nprint(f\"Token count: {len(tokens)}\")\nprint(f\"Tokens: {tokens}\")\n\n# Decode individual tokens to see how the text was split\nfor token_id in tokens:\n print(f\" {token_id} -> '{encoder.decode([token_id])}'\")\n```\n\n\nRunning this will show you that common words often map to single tokens, while less common words get split into pieces. This is important because it means token count does not scale linearly with word count. A rough rule of thumb: 1 token is approximately 0.75 words in English, or about 4 characters.\n\n---\n\n# Stop Sequences: Choosing Where Generation Ends\n\nSometimes you need the model to stop at a specific point. You may want one short answer, or you may be generating text up to a delimiter such as `Cons:` or `### END`. Stop sequences give you a simple way to do that.\n\n### How They Work\n\nA stop sequence is a string that causes generation to stop when the model produces it. The stop sequence itself is usually not included in the returned text.\n\nYou can specify up to 4 stop sequences per request (depending on the API). The model stops as soon as it generates any one of them.\n\n\n**main.py**\n\n```python\nimport os\nfrom openai import OpenAI\n\napi_key = os.getenv(\"OPENROUTER_API_KEY\")\nif not api_key:\n raise ValueError(\"Set OPENROUTER_API_KEY before running this script.\")\n\nclient = OpenAI(\n api_key=api_key,\n base_url=\"https://openrouter.ai/api/v1\",\n)\n\n# Generate only the first sentence of an explanation\nresponse = client.chat.completions.create(\n model=\"openai/gpt-5.4-mini\",\n messages=[{\"role\": \"user\", \"content\": \"Explain what a binary tree is.\"}],\n stop=[\"\\n\\n\"], # Stop at the first double newline\n max_completion_tokens=200,\n)\n\nprint(response.choices[0].message.content)\n# Gets just the opening explanation, no follow-up paragraphs\n```\n\n\n### Practical Uses\n\nStop sequences are especially useful when you are generating structured or delimited content:\n\n\n**main.py**\n\n```python\nimport os\nfrom openai import OpenAI\n\napi_key = os.getenv(\"OPENROUTER_API_KEY\")\nif not api_key:\n raise ValueError(\"Set OPENROUTER_API_KEY before running this script.\")\n\nclient = OpenAI(\n api_key=api_key,\n base_url=\"https://openrouter.ai/api/v1\",\n)\n\n# Use stop sequences to extract just a function name\nresponse = client.chat.completions.create(\n model=\"openai/gpt-5.4-mini\",\n messages=[\n {\n \"role\": \"system\",\n \"content\": \"You are a code assistant. When asked to name a function, \"\n \"respond with just the function name, nothing else.\",\n },\n {\n \"role\": \"user\",\n \"content\": \"What should I name a function that validates email addresses?\",\n },\n ],\n stop=[\"(\", \"\\n\", \" \"], # Stop at parenthesis, newline, or space\n max_completion_tokens=50,\n)\n\nprint(response.choices[0].message.content)\n# Output: \"validate_email\" (stops before any extra text)\n```\n\n\nAnother common pattern is using stop sequences with multi-part generation, where you want the model to generate content in sections:\n\n\n**main.py**\n\n```python\nimport os\nfrom openai import OpenAI\n\napi_key = os.getenv(\"OPENROUTER_API_KEY\")\nif not api_key:\n raise ValueError(\"Set OPENROUTER_API_KEY before running this script.\")\n\nclient = OpenAI(\n api_key=api_key,\n base_url=\"https://openrouter.ai/api/v1\",\n)\n\n# Generate only the \"Pros\" section, stop before \"Cons\"\nresponse = client.chat.completions.create(\n model=\"openai/gpt-5.4-mini\",\n messages=[\n {\n \"role\": \"user\",\n \"content\": \"List the pros and cons of microservices. \"\n \"Start with 'Pros:' then 'Cons:'\",\n }\n ],\n stop=[\"Cons:\"],\n max_completion_tokens=300,\n)\n\nprint(response.choices[0].message.content)\n# Prints only the Pros section\n```\n\n\n---\n\n# Frequency and Presence Penalties: Reducing Repetition\n\nLLMs sometimes repeat the same phrase, token, or idea more than you want, especially in longer outputs. Frequency and presence penalties give you two ways to reduce that behavior.\n\n### How They Differ\n\nBoth penalties lower the probability of tokens that have already appeared in the output, but they do it differently:\n\n**Frequency penalty** (range: -2.0 to 2.0, default: 0) reduces the probability of a token proportionally to how many times it has already appeared. If the word \"the\" has appeared 5 times, it gets penalized 5 times as much as a word that appeared once.\n\n**Presence penalty** (range: -2.0 to 2.0, default: 0) applies a flat penalty to any token that has appeared at all, regardless of how many times. Whether a word appeared once or fifty times, it gets the same penalty.\n\n\n| Penalty Type | Effect | Best For |\n|-------------|--------|----------|\n| Frequency penalty | Penalizes repeated tokens proportionally | Reducing word-level repetition (same word over and over) |\n| Presence penalty | Penalizes any token that appeared at all | Encouraging topic diversity (covering new ground) |\n\n\nIn short, frequency penalty helps with repeated wording, while presence penalty nudges the model toward covering new ground.\n\n### Code\n\n\n**main.py**\n\n```python\nimport os\nfrom openai import OpenAI\n\napi_key = os.getenv(\"OPENROUTER_API_KEY\")\nif not api_key:\n raise ValueError(\"Set OPENROUTER_API_KEY before running this script.\")\n\nclient = OpenAI(\n api_key=api_key,\n base_url=\"https://openrouter.ai/api/v1\",\n)\n\nprompt = \"Write a 200-word essay about the benefits of exercise.\"\n\n# No penalties: may repeat phrases\nresponse_default = client.chat.completions.create(\n model=\"openai/gpt-5.4-mini\",\n messages=[{\"role\": \"user\", \"content\": prompt}],\n temperature=1.0,\n max_completion_tokens=300,\n)\n\n# High frequency penalty: reduces word repetition\nresponse_freq = client.chat.completions.create(\n model=\"openai/gpt-5.4-mini\",\n messages=[{\"role\": \"user\", \"content\": prompt}],\n temperature=1.0,\n frequency_penalty=1.5,\n max_completion_tokens=300,\n)\n\n# High presence penalty: encourages new topics\nresponse_pres = client.chat.completions.create(\n model=\"openai/gpt-5.4-mini\",\n messages=[{\"role\": \"user\", \"content\": prompt}],\n temperature=1.0,\n presence_penalty=1.5,\n max_completion_tokens=300,\n)\n\nprint(\"=== Default ===\")\nprint(response_default.choices[0].message.content)\nprint(\"\\n=== Frequency Penalty 1.5 ===\")\nprint(response_freq.choices[0].message.content)\nprint(\"\\n=== Presence Penalty 1.5 ===\")\nprint(response_pres.choices[0].message.content)\n```\n\n\nWith a high frequency penalty, the model usually uses a wider range of words and avoids repeating specific terms. With a high presence penalty, it tends to move to new topics more readily, which can help with coverage but may hurt coherence.\n\nIn practice, mild values such as `0.3` to `0.8` are often enough. Values above `1.5` can make the output feel forced or unnatural.\n\n---\n\n# Logprobs: Inspecting Token Probabilities\n\nSo far, every parameter we have covered changes how the model generates text. Logprobs do something different: they show the probability the model assigned to each output token, along with optional alternatives.\n\nThis is useful for debugging and evaluation. If a model classifies a support ticket as `\"billing\"`, you can check whether `\"technical\"` was close behind. Logprobs give you token-level evidence about the alternatives the model considered.\n\n### How It Works\n\nWhen you set `logprobs=True`, the API returns the log-probability of each token in the output. A log-probability is just the natural logarithm of the probability: a value of 0 means 100% confidence, and more negative values mean lower confidence. You can convert to a regular probability with `exp(logprob)`.\n\nYou can also set `top_logprobs` (1 to 20) to see the probabilities of the top alternative tokens at each position.\n\n\n**main.py**\n\n```python\nimport math\nimport os\nfrom openai import OpenAI\n\napi_key = os.getenv(\"OPENROUTER_API_KEY\")\nif not api_key:\n raise ValueError(\"Set OPENROUTER_API_KEY before running this script.\")\n\nclient = OpenAI(\n api_key=api_key,\n base_url=\"https://openrouter.ai/api/v1\",\n)\n\nresponse = client.chat.completions.create(\n model=\"openai/gpt-5.4-mini\",\n messages=[\n {\"role\": \"user\", \"content\": \"What is the capital of France? Answer in one word.\"}\n ],\n logprobs=True,\n top_logprobs=3,\n max_completion_tokens=16,\n temperature=0,\n)\n\n# Inspect token-level probabilities\nfor token_info in response.choices[0].logprobs.content:\n prob = math.exp(token_info.logprob) * 100\n print(f\"Token: '{token_info.token}' | Probability: {prob:.1f}%\")\n\n # Show alternatives the model considered\n for alt in token_info.top_logprobs:\n alt_prob = math.exp(alt.logprob) * 100\n print(f\" Alternative: '{alt.token}' | {alt_prob:.1f}%\")\n```\n\n\nFor a simple factual prompt such as this, the chosen token will usually have very high probability. That is useful signal, but it is not a general proof of factual correctness.\n\n### Practical Uses\n\nLogprobs are most useful when the model is less certain:\n\n\n**main.py**\n\n```python\nimport math\nimport os\nfrom openai import OpenAI\n\napi_key = os.getenv(\"OPENROUTER_API_KEY\")\nif not api_key:\n raise ValueError(\"Set OPENROUTER_API_KEY before running this script.\")\n\nclient = OpenAI(\n api_key=api_key,\n base_url=\"https://openrouter.ai/api/v1\",\n)\n\n# A more ambiguous classification task\nresponse = client.chat.completions.create(\n model=\"openai/gpt-5.4-mini\",\n messages=[\n {\n \"role\": \"system\",\n \"content\": \"Classify the sentiment as positive, negative, or neutral. \"\n \"Respond with one word only.\",\n },\n {\"role\": \"user\", \"content\": \"The product works but the shipping was slow.\"},\n ],\n logprobs=True,\n top_logprobs=3,\n max_completion_tokens=16,\n temperature=0,\n)\n\nfirst_token = response.choices[0].logprobs.content[0]\nprint(f\"Classification: {first_token.token}\")\nprint(f\"Confidence: {math.exp(first_token.logprob) * 100:.1f}%\")\nprint(\"\\nAlternatives:\")\nfor alt in first_token.top_logprobs:\n print(f\" {alt.token}: {math.exp(alt.logprob) * 100:.1f}%\")\n```\n\n\nOn an input like this, the top class often comes back with middling confidence, with a close runner-up not far behind. That spread tells you the input is ambiguous enough to justify human review or a separate \"mixed\" category.\n\nHere are the most common uses for logprobs:\n\n- **Confidence scoring:** Flag low-confidence outputs for human review\n- **Classification validation:** Check if the model's top choice was a close call\n- **Risk scoring:** Low confidence on a constrained answer can indicate ambiguity or weak evidence\n- **A/B testing prompts:** Compare confidence distributions across different prompt phrasings\n\n---\n\n# Token Counting and Cost Estimation\n\nLLM APIs charge per token. Input tokens and output tokens often have different prices. If you are building a product, you need to estimate costs before traffic grows. A chatbot that handles thousands of conversations a day can become expensive if you never measure usage.\n\n### How Pricing Works\n\nMost providers charge per million tokens, with separate rates for input and output. Output tokens are often more expensive because the model has to generate them one step at a time.\n\nUse current provider pricing when estimating production cost. Prices change, and gateway pricing may differ from direct-provider pricing.\n\n### Building a Cost Calculator\n\nHere is a small utility that estimates the cost of a request before you make it:\n\n\n**main.py**\n\n```python\nimport tiktoken\n\n# Example OpenRouter pricing per 1M tokens for openai/gpt-5.4-mini.\n# Check the model page before using hardcoded prices in production.\nMODEL_PRICING = {\n \"openai/gpt-5.4-mini\": {\"input\": 0.75, \"output\": 4.50, \"encoding\": \"o200k_base\"},\n}\n\ndef count_tokens(text: str, model: str = \"openai/gpt-5.4-mini\") -> int:\n \"\"\"Count the number of tokens in a text string.\"\"\"\n encoding_name = MODEL_PRICING[model][\"encoding\"]\n encoder = tiktoken.get_encoding(encoding_name)\n return len(encoder.encode(text))\n\ndef estimate_cost(\n input_text: str,\n model: str = \"openai/gpt-5.4-mini\",\n estimated_output_tokens: int = 500,\n) -> dict:\n \"\"\"Estimate the cost of an API call.\"\"\"\n input_tokens = count_tokens(input_text, model)\n pricing = MODEL_PRICING[model]\n\n input_cost = (input_tokens / 1_000_000) * pricing[\"input\"]\n output_cost = (estimated_output_tokens / 1_000_000) * pricing[\"output\"]\n total_cost = input_cost + output_cost\n\n return {\n \"model\": model,\n \"input_tokens\": input_tokens,\n \"estimated_output_tokens\": estimated_output_tokens,\n \"input_cost\": f\"${input_cost:.6f}\",\n \"output_cost\": f\"${output_cost:.6f}\",\n \"total_cost\": f\"${total_cost:.6f}\",\n }\n\nprompt = \"Explain the differences between SQL and NoSQL databases in detail.\"\n\nresult = estimate_cost(prompt, estimated_output_tokens=1000)\nfor key, value in result.items():\n print(f\"{key}: {value}\")\n```\n\n\nOne caveat: tokenizers differ across model families. This example uses OpenAI's `o200k_base` tokenizer because the course model is an OpenAI model. For other providers, use the provider's tokenizer when available. Treat the numbers from `response.usage` as the source of truth for actual billing.\n\n### Tracking Actual Costs\n\nEvery API response includes a `usage` field with the actual token counts. Use this to track real costs:\n\n\n**main.py**\n\n```python\nimport os\nfrom openai import OpenAI\n\napi_key = os.getenv(\"OPENROUTER_API_KEY\")\nif not api_key:\n raise ValueError(\"Set OPENROUTER_API_KEY before running this script.\")\n\nclient = OpenAI(\n api_key=api_key,\n base_url=\"https://openrouter.ai/api/v1\",\n)\n\nresponse = client.chat.completions.create(\n model=\"openai/gpt-5.4-mini\",\n messages=[{\"role\": \"user\", \"content\": \"What is a binary search tree?\"}],\n max_completion_tokens=500,\n)\n\nusage = response.usage\nprint(f\"Input tokens: {usage.prompt_tokens}\")\nprint(f\"Output tokens: {usage.completion_tokens}\")\nprint(f\"Total tokens: {usage.total_tokens}\")\n\n# Calculate actual cost for openai/gpt-5.4-mini using the example rates above\ninput_cost = (usage.prompt_tokens / 1_000_000) * 0.75\noutput_cost = (usage.completion_tokens / 1_000_000) * 4.50\nprint(f\"Actual cost: ${input_cost + output_cost:.6f}\")\n```\n\n\nFor production applications, wrap this in a logging function that tracks costs across all API calls:\n\n\n**main.py**\n\n```python\nfrom datetime import datetime\n\nclass CostTracker:\n \"\"\"Track LLM API costs across multiple calls.\"\"\"\n\n def __init__(self):\n self.calls = []\n\n def log_call(self, model: str, usage: dict, pricing: dict):\n input_cost = (usage[\"prompt_tokens\"] / 1_000_000) * pricing[\"input\"]\n output_cost = (usage[\"completion_tokens\"] / 1_000_000) * pricing[\"output\"]\n\n self.calls.append(\n {\n \"timestamp\": datetime.now().isoformat(),\n \"model\": model,\n \"input_tokens\": usage[\"prompt_tokens\"],\n \"output_tokens\": usage[\"completion_tokens\"],\n \"cost\": input_cost + output_cost,\n }\n )\n\n def total_cost(self) -> float:\n return sum(call[\"cost\"] for call in self.calls)\n\n def summary(self) -> str:\n total = self.total_cost()\n count = len(self.calls)\n avg = total / count if count > 0 else 0\n return (\n f\"Total calls: {count}, \"\n f\"Total cost: ${total:.4f}, \"\n f\"Avg cost/call: ${avg:.6f}\"\n )\n\ntracker = CostTracker()\n\nexample_usage = {\n \"prompt_tokens\": 1200,\n \"completion_tokens\": 350,\n}\n\ntracker.log_call(\n \"openai/gpt-5.4-mini\",\n example_usage,\n {\"input\": 0.75, \"output\": 4.50},\n)\nprint(tracker.summary())\n```\n\n\n---\n\n# Parameter Cheat Sheet\n\nHere is a quick reference for the parameters covered in this chapter:\n\n\n| Parameter | Range | Default | What It Does | When to Adjust |\n|-----------|-------|---------|-------------|----------------|\n| temperature | 0 - 2.0 | 1.0 | Controls variation in token selection | Creative tasks (higher) or repeatable tasks (lower) |\n| top_p | 0 - 1.0 | 1.0 | Limits token pool by cumulative probability | Alternative to temperature (adjust one, not both) |\n| top_k | 1 - vocab size | Varies (often disabled) | Keeps only the top K most probable tokens | When you want a hard cutoff on token candidates |\n| seed | Any integer | None | Makes sampling reproducible | Testing, debugging, demos |\n| max_completion_tokens | 1 - model limit | Varies | Caps output length | When you need specific length limits or cost control |\n| stop | Up to 4 strings | None | Stops generation at specified strings | Structured output, delimiters, single-answer extraction |\n| frequency_penalty | -2.0 - 2.0 | 0 | Penalizes tokens proportional to frequency | Reducing word-level repetition |\n| presence_penalty | -2.0 - 2.0 | 0 | Flat penalty for any token already used | Encouraging broader topic coverage |\n| logprobs | true/false | false | Returns token-level probabilities | Confidence scoring, debugging, classification validation |\n| response_format | object | None | Forces structured output (e.g., JSON) | When you need machine-parseable output (see Chapter 1.3) |\n\n\n### Common Parameter Recipes\n\nThese combinations work well for specific use cases:\n\n\n| Use Case | Temperature | Output Tokens | Other | Notes |\n|----------|-------------|---------------|-------|-------|\n| Code generation | 0-0.3 | 2000-4000 | seed for tests | Reduce variance without assuming perfect determinism |\n| Factual Q&A | 0-0.2 | 500-1000 | retrieval or citations when needed | Sampling settings do not make facts true |\n| Chatbot | 0.7 | 1000-2000 | freq_penalty: 0.3 | Balanced, slightly varied |\n| Creative writing | 0.9-1.0 | 2000-4000 | pres_penalty: 0.5 | Diverse vocabulary and topics |\n| Brainstorming | 1.2-1.5 | 1000 | pres_penalty: 1.0 | Maximum diversity |\n| Data extraction | 0 | 500 | structured output | Prefer schema validation over stop sequences |\n| Summarization | 0.3 | 500-1000 | freq_penalty: 0.5 | Concise, low repetition |\n| Classification | 0 | 10-50 | logprobs: true | Check confidence with top_logprobs |\n| Testing/Debugging | 0.7 | Varies | seed: 42 | Reproducible outputs for iteration |\n\n\n---\n\n# Putting It All Together: The Parameter Playground\n\nNow let's tie the ideas together. This script sends the same prompt at different temperature values and measures how much the outputs vary.\n\n\n**main.py**\n\n```python\nimport os\nfrom openai import OpenAI\n\napi_key = os.getenv(\"OPENROUTER_API_KEY\")\nif not api_key:\n raise ValueError(\"Set OPENROUTER_API_KEY before running this script.\")\n\nclient = OpenAI(\n api_key=api_key,\n base_url=\"https://openrouter.ai/api/v1\",\n)\n\nPROMPT = \"Suggest a creative name for a mobile app that helps people find hiking trails.\"\nMODEL = \"openai/gpt-5.4-mini\"\nTEMPERATURES = [0, 0.3, 0.7, 1.0, 1.5]\nRUNS_PER_TEMP = 10\n\ndef generate_outputs(prompt: str, temperature: float, n_runs: int) -> list[str]:\n \"\"\"Generate several outputs at one temperature.\"\"\"\n outputs = []\n for _ in range(n_runs):\n response = client.chat.completions.create(\n model=MODEL,\n messages=[{\"role\": \"user\", \"content\": prompt}],\n temperature=temperature,\n max_completion_tokens=100,\n )\n outputs.append(response.choices[0].message.content.strip())\n return outputs\n\ndef calculate_diversity(outputs: list[str]) -> dict:\n \"\"\"Calculate simple diversity metrics for a list of outputs.\"\"\"\n unique_outputs = set(outputs)\n all_words = \" \".join(outputs).lower().split()\n unique_words = set(all_words)\n\n return {\n \"total_outputs\": len(outputs),\n \"unique_outputs\": len(unique_outputs),\n \"uniqueness_ratio\": len(unique_outputs) / len(outputs),\n \"total_words\": len(all_words),\n \"unique_words\": len(unique_words),\n \"vocabulary_diversity\": len(unique_words) / len(all_words) if all_words else 0,\n }\n\n# Run the experiment\nprint(f\"Prompt: {PROMPT}\")\nprint(f\"Runs per temperature: {RUNS_PER_TEMP}\")\nprint(\"=\" * 70)\n\nresults = {}\n\nfor temp in TEMPERATURES:\n print(f\"\\nTemperature: {temp}\")\n print(\"-\" * 40)\n\n outputs = generate_outputs(PROMPT, temp, RUNS_PER_TEMP)\n metrics = calculate_diversity(outputs)\n results[temp] = metrics\n\n # Print sample outputs\n for i, output in enumerate(outputs[:3]):\n print(f\" Sample {i+1}: {output[:80]}...\")\n\n print(f\"\\n Unique outputs: {metrics['unique_outputs']}/{metrics['total_outputs']}\")\n print(f\" Uniqueness ratio: {metrics['uniqueness_ratio']:.2f}\")\n print(f\" Vocabulary diversity: {metrics['vocabulary_diversity']:.2f}\")\n\n# Summary table\nprint(\"\\n\" + \"=\" * 70)\nprint(\"SUMMARY\")\nprint(\"=\" * 70)\nprint(f\"{'Temperature':<15}{'Unique Outputs':<20}{'Uniqueness %':<18}{'Vocab Diversity'}\")\nprint(\"-\" * 70)\n\nfor temp in TEMPERATURES:\n m = results[temp]\n print(\n f\"{temp:<15}\"\n f\"{m['unique_outputs']}/{m['total_outputs']:<17}\"\n f\"{m['uniqueness_ratio']:.0%}{'':<14}\"\n f\"{m['vocabulary_diversity']:.2%}\"\n )\n```\n\n\nWhen you run this, expect to see something like:\n\n- Temperature 0: 1-2 unique outputs out of 10 (nearly deterministic)\n- Temperature 0.3: 3-5 unique outputs (some variation)\n- Temperature 0.7: 6-8 unique outputs (good diversity)\n- Temperature 1.0: 8-10 unique outputs (high diversity)\n- Temperature 1.5: 10 unique outputs (maximum diversity, some may be odd)\n\nThe takeaway is that temperature changes how much variation you should expect from the same prompt, but it is only one part of the system. Prompt specificity, schema constraints, model choice, and post-processing matter just as much in production.\n\n---\n\n# Quiz","pageType":"ai-engineering"}

Get Premium