Last Updated: March 18, 2026
When you call an LLM API with default parameters, you leave key generation behaviors in the model’s hands: how random the output is, how long it generates, and when it stops. That may be fine for a quick experiment, but in production, you usually need much tighter control.
LLM APIs expose a set of parameters that shape the generation process, such as temperature, top_p, max_tokens, stop sequences, and penalties. These settings influence how creative, focused, long, repetitive, or structured the output becomes.
In this chapter, we’ll cover the most important LLM parameters you should know and how to use them effectively.
Before we touch any parameters, you need a quick mental model of how text generation works. An LLM does not "write" text the way a human does. It predicts one token at a time.
At each step, the model looks at everything generated so far and produces a probability distribution over all possible next tokens. The word "the" might have a 15% chance, "a" might have 8%, "Hello" might have 0.001%. The model then picks one token from this distribution, appends it to the output, and repeats.
The parameters we cover in this chapter all influence this token selection process. Some change the probability distribution itself. Others control when the process stops.
This loop runs hundreds or thousands of times for a single response. Parameters we discuss below affects some part of this loop.
Temperature is the single most important parameter you will use. It controls how "creative" or "deterministic" the model's output is.
Remember the probability distribution the model produces at each step? Temperature modifies that distribution before a token is sampled.
Here is the math, simplified. The model produces raw scores (called logits) for each possible next token. These get converted to probabilities using the softmax function:
Where T is the temperature. Watch what happens at different values:
Let us see this in practice. The following script sends the same prompt at five different temperature values and prints the results:
The pattern in the output is clear. At temperature 0, all three runs return the same name. At 0.7, outputs vary but remain coherent. At 1.5, the model produces unusual combinations as lower-probability tokens get selected more often.
Temperature is not the only way to control randomness. Top-p, also called nucleus sampling, takes a fundamentally different approach. Instead of scaling all probabilities, it restricts which tokens the model can even consider.
Top-p works by sorting all possible next tokens by probability (highest first), then adding up probabilities until the cumulative sum reaches the threshold p. Only tokens within this "nucleus" are eligible for selection.
Say the model produces these probabilities for the next token:
With top_p = 0.8, the model only considers "the", "a", and "one" (cumulative probability reaches 0.80). The remaining tokens are excluded. The model then samples from this reduced set.
With top_p = 0.95, the model considers the top 5 tokens. With top_p = 1.0 (the default), all tokens are eligible, so top_p has no effect.
Both parameters control randomness, but they work differently:
The golden rule: Adjust one, keep the other at its default. Changing both at the same time makes it hard to predict behavior. OpenAI's own documentation recommends this. Most practitioners stick to temperature and leave top_p at 1.0.
Top-p dynamically adjusts how many tokens are eligible based on cumulative probability. Top-k takes a simpler approach: it keeps exactly the top K most probable tokens and discards the rest.
If top_k=50, the model only considers the 50 highest-probability tokens at each step, regardless of how much cumulative probability they cover. If top_k=1, the model always picks the single most likely token (equivalent to temperature 0).
Think of top_k as a fixed-size window and top_p as a variable-size window. Top-p adapts: when the model is confident, it might only consider 3 tokens; when it is uncertain, it might consider 200. Top-k always considers exactly K tokens, which can be too many when the model is confident or too few when it is unsure.
Temperature and top_p introduce randomness by design, which means the same prompt produces different outputs each run. But reproducibility matters in several real scenarios: writing tests for an LLM pipeline, debugging a prompt change to isolate whether a difference came from your edit or from random sampling, or demonstrating specific model behavior in a review.
The seed parameter solves this. It initializes the random number generator used during sampling, so the model follows the same "path" through its probability distributions.
When you pass a seed value, the model uses it to make its random choices deterministic. Same prompt + same seed + same parameters = same output (most of the time).
Without the seed, temperature 0.7 would give you different names each time. With seed=42, every run follows the same sampling path.
Seed-based reproducibility is best-effort across most providers. OpenAI's docs say deterministic output is "not guaranteed." Why? Because model infrastructure changes (load balancing, GPU routing, numerical precision differences across hardware) can introduce tiny variations.
In practice, short outputs with the same model version are highly reproducible. Long outputs or requests made weeks apart may occasionally differ. For testing and debugging, seed is reliable enough to be extremely useful. For production logic, do not depend on exact output matching.
To verify reproducibility, check the system_fingerprint field in the response. If two responses have the same fingerprint and seed, they should be identical:
Every LLM has a context window, a fixed budget of tokens shared between your input and the model's output. If you do not manage this budget, the model will either cut off mid-sentence or blow through your cost estimates.
The context window is the total number of tokens the model can handle in a single request. This includes everything: the system prompt, the conversation history, and the generated output.
Here is the relationship:
If your input uses 120,000 tokens of a 128K context window, the model can only generate 8,000 tokens of output, regardless of what you set max_tokens to.
The max_tokens parameter sets an upper limit on how many tokens the model can generate. It does not guarantee the model will use all of them. The model may stop earlier if it reaches a natural conclusion or hits a stop sequence.
What happens when the model hits the max_tokens limit? It stops generating immediately, even if it is mid-sentence. The API response includes a finish_reason field that tells you why generation stopped:
"stop" means the model finished naturally"length" means it hit the max_tokens limit (you probably want more tokens)Tokens are not words. The word "understanding" is one word but may consist of two tokens ("under" + "standing"). Knowing how many tokens your input uses helps you set appropriate limits and predict costs.
You can use the tiktoken library to count tokens before making an API call. It runs offline and returns results instantly.
Running this will show you that common words often map to single tokens, while less common words get split into pieces. This is important because it means token count does not scale linearly with word count. A rough rule of thumb: 1 token is approximately 0.75 words in English, or about 4 characters.
Sometimes you need the model to stop at a specific point. Maybe you want it to generate a single answer without explanation. Maybe you are building a multi-turn system where the model should stop when it reaches a delimiter. Stop sequences give you that control.
A stop sequence is a string that, when generated by the model, causes it to stop immediately. The stop sequence itself is not included in the output.
You can specify up to 4 stop sequences per request (depending on the API). The model stops as soon as it generates any one of them.
Stop sequences are especially useful when you are generating structured or delimited content:
Another common pattern is using stop sequences with multi-part generation, where you want the model to generate content in sections:
LLMs sometimes get stuck in loops, repeating the same phrase or idea over and over. This is especially common in longer outputs. Frequency and presence penalties give you two distinct ways to push the model away from repetition.
Both penalties modify the probability of tokens that have already appeared in the output. But they work differently:
Frequency penalty (range: -2.0 to 2.0, default: 0) reduces the probability of a token proportionally to how many times it has already appeared. If the word "the" has appeared 5 times, it gets penalized 5 times as much as a word that appeared once.
Presence penalty (range: -2.0 to 2.0, default: 0) applies a flat penalty to any token that has appeared at all, regardless of how many times. Whether a word appeared once or fifty times, it gets the same penalty.
Think of it this way: frequency penalty stops the model from saying "very very very very." Presence penalty stops the model from circling back to the same topic.
With the frequency penalty cranked up, you will notice the model uses a wider variety of words and avoids repeating specific terms. With the presence penalty high, the model tends to jump between topics more aggressively, covering more ground but sometimes at the expense of coherence.
In practice, mild values (0.3 to 0.8) work well. Values above 1.5 can make the output feel forced and unnatural.
So far, every parameter we have covered changes how the model generates text. Logprobs does something different: it shows you the probability the model assigned to each token it chose, and what alternatives it considered.
This is incredibly useful for debugging and evaluation. Is the model confident in its answer, or is it basically guessing? When it classifies a support ticket as "billing," was "technical" a close second? Logprobs turn the model from a black box into something you can inspect.
When you set logprobs=True, the API returns the log-probability of each token in the output. A log-probability is just the natural logarithm of the probability: a value of 0 means 100% confidence, and more negative values mean lower confidence. You can convert to a regular probability with exp(logprob).
You can also set top_logprobs (1 to 20) to see the probabilities of the top alternative tokens at each position.
The model is 99.9% confident the answer is "Paris." That is a clear signal you can trust this output.
Where logprobs really shine is when the model is less certain:
Only 52% confident. That tells you this input is genuinely ambiguous and you might want a human to review it, or you might want to return "mixed" as a category.
Here are the most common uses for logprobs:
LLM APIs charge per token. Input tokens and output tokens often have different prices. If you are building a product, you need to predict costs before they surprise you. A chatbot that processes 10,000 conversations a day can run up a significant bill if you are not paying attention.
Most providers charge per million tokens, with separate rates for input and output. Output tokens are almost always more expensive because generation requires more compute than processing input.
You can find the pricing for popular models here. Choosing the right model for each task is one of the most effective ways to control costs.
Let us build a utility that estimates the cost of an API call before you make it:
Every API response includes a usage field with the actual token counts. Use this to track real costs:
For production applications, wrap this in a logging function that tracks costs across all API calls:
Here is a quick reference for all the parameters covered in this chapter:
These combinations work well for specific use cases:
Now that you understand each parameter individually, let us build the exercise that ties everything together. This script generates multiple outputs for the same prompt across different temperature values and measures how output diversity changes.
When you run this, expect to see something like:
This is the core intuition. Temperature is a direct lever on how many different responses the model can give for the same input.