Sections
Related Guides
Prompt Engineering: From Zero-Shot to Production Systems
GenAI & Agents
Advanced RAG: Hybrid Retrieval, Reranking, and Production Architecture
GenAI & Agents
LLM Fundamentals — Transformers, Attention & Architecture
GenAI & Agents
LLM Serving at Scale: vLLM, KV Cache, Batching, and LLMOps
GenAI & Agents
Decoding Strategies: Temperature, Sampling, and Constrained Generation
Master every LLM decoding parameter: greedy vs beam search vs sampling, temperature scaling, top-k and nucleus sampling, repetition penalties, and the 2024 min-p sampler. Understand when each strategy is optimal and why temperature doesn't change what the model 'knows'.
How LLMs Actually Generate Text
Language model generation is fundamentally a search problem: at each step, the model assigns a probability to every token in its vocabulary (typically 32,000–128,000 tokens), and the decoding strategy decides which token to select next.
The model outputs raw logits — unnormalized scores for each vocabulary token. These are converted to probabilities via softmax: p(token_i) = exp(logit_i) / Σ exp(logit_j). A logit of 5.0 for "dog" and 3.0 for "cat" gives probabilities of 0.88 and 0.12 respectively. The decoding strategy then samples or selects from this distribution.
The key insight most engineers miss: the model's logit distribution captures its "beliefs" about what token should come next given all prior context. This distribution is fixed once the forward pass is complete. Decoding parameters do not change what the model knows or believes — they only change how we sample from the distribution the model already produced. Temperature=0.1 and temperature=1.0 query the exact same logit distribution; temperature only sharpens or flattens that distribution before sampling.
This has a critical implication: if the model assigns 40% probability to a wrong answer and 35% to a correct answer, lowering the temperature to near-zero will confidently select the wrong answer. Temperature cannot fix a model that has the wrong beliefs — only training can do that.
Three fundamental decoding families:
- Deterministic: always select the same token (greedy, beam search). Reproducible, consistent, inflexible.
- Stochastic sampling: sample from the probability distribution (temperature, top-k, top-p). Diverse, creative, variable.
- Constrained decoding: modify the logit distribution to enforce output structure before sampling. Guarantees format compliance.
Greedy Decoding and Beam Search: Deterministic Approaches
Greedy decoding selects the single highest-probability token at each step: token_t = argmax p(token | context). It is deterministic (same input always produces the same output), fast (no additional computation beyond the forward pass), and consistent.
Where greedy fails: greedy is myopic. It maximizes local probability at each step without considering how that choice affects future tokens. A token that is slightly less probable now might enable a much more probable sequence later. The classic failure: "I want to go to the store to buy [milk]" — greedy selects "milk" (probability 0.4) but the ground truth is "groceries" which would have enabled a higher-probability continuation. Over long sequences, greedy accumulates suboptimal choices, leading to generic, repetitive text.
Beam search maintains B=4 (or B=8) hypotheses in parallel. At each step, it expands all B beams by considering the top tokens, scores each expanded hypothesis by total log-probability, and keeps the top-B. This is approximate tree search over the token sequence.
Length normalization: without it, beam search systematically prefers shorter sequences (longer sequences accumulate lower log-probabilities from multiplying many values < 1). The fix: divide the total log-probability by sequence length^α, where α ≈ 0.6–0.7. Too low α: still prefers short sequences. Too high α: prefers nonsensically long sequences.
Why beam search ≠ better for open-ended generation: Holtzman et al. (2020) show that beam search produces text that is more likely according to the model but less preferred by humans for open-ended generation. High-probability text is predictable and generic. Human-preferred text explores lower-probability continuations that are surprising and natural. Beam search optimizes the wrong objective for creative tasks. Use beam search for: machine translation, summarization with high factual requirement, code generation where you want the most likely correct solution. Do not use for: dialogue, creative writing, any open-ended generation.
Temperature Scaling: The Most Misunderstood Parameter
Temperature modifies the softmax before sampling. The standard softmax is p_i = exp(logit_i) / Σ exp(logit_j). Temperature-scaled softmax is p_i = exp(logit_i / T) / Σ exp(logit_j / T).
Effect of temperature:
- T → 0: dividing by a very small number amplifies differences between logits. The highest-logit token gets probability ~1.0; all others get ~0. Equivalent to greedy decoding. Deterministic, repetitive, conservative.
- T = 1.0: original model distribution unchanged. Samples as the model "intends."
- T > 1.0: dividing by a number > 1 compresses the logit differences. The distribution flattens toward uniform. All tokens become equally likely. At T = ∞, the distribution is exactly uniform — every token in the vocabulary has the same probability.
The non-obvious insight: temperature doesn't change the model's beliefs — it only sharpens or flattens the probability distribution. At T=0.1, the model is not "more confident" — the logits are identical. You are just sampling more deterministically from the same distribution. This means T=0.1 and T=1.0 will produce different outputs but both can be equally "correct" given a model with well-calibrated probabilities. T=0.1 samples the model's most likely response; T=1.0 samples a more representative response.
Practical guidance:
- T=0 (or 0.0): code generation, structured extraction, factual QA — where correctness matters and diversity does not.
- T=0.3–0.5: summarization, translation — low diversity acceptable, slight variation to avoid repetition.
- T=0.7–0.9: dialogue, Q&A assistants — natural variation without becoming incoherent.
- T=1.0–1.2: creative writing, brainstorming, story generation — maximize diversity.
- T > 1.5: generally incoherent for most models. Use only for intentional randomness or ablation studies.
Temperature Softmax and Top-P Threshold
Top-K, Nucleus (Top-P), and Min-P Sampling
The problem with pure temperature sampling: even at T=0.8, the full vocabulary can be sampled from — 50,257 tokens for GPT-2, 128,256 for LLaMA-3. This includes tokens with probability 0.000001 that produce incoherent continuations. Truncation samplers solve this by restricting sampling to a high-probability subset.
Top-k sampling: sort tokens by probability, keep the top-k, renormalize, sample. Simple, fast. The problem: k is a fixed count, which is inappropriate when the probability distribution has different shapes. When the model is uncertain (flat distribution), k=50 might still include only 30% of the probability mass. When the model is highly confident (peaked distribution), k=50 might include tokens with probabilities as low as 0.00001 — introducing incoherence.
Nucleus / top-p sampling (Holtzman et al. 2020): instead of a fixed count, keep the minimal set of tokens that covers at least p% of the probability mass. If p=0.9, include tokens until their cumulative probability reaches 0.9. When the model is confident, this might be just 2–3 tokens. When uncertain, it might be 500 tokens. The cutoff adapts to the model's certainty. Top-p=0.9–0.95 is the production standard for most generation tasks.
Min-p sampling (2024): a newer approach that has shown strong empirical performance for creative tasks. Include token i if and only if its probability exceeds p_min × max_prob, where max_prob is the probability of the most likely token. This is a relative threshold: if the top token has probability 0.6 and p_min=0.1, include all tokens with probability ≥ 0.06. When the top token is highly probable (peaked distribution), the threshold is high — few tokens included. When no token is dominant (flat distribution), the threshold is low — more tokens included. Min-p better preserves coherence than top-p at high temperatures and outperforms top-p on creative writing benchmarks according to Llama community evaluations.
Combining parameters: in production, top-p and temperature are used together. Temperature=0.8 + top-p=0.9 is more controlled than temperature=0.8 alone because top-p cuts off low-probability nonsense. Temperature=1.2 + top-p=0.95 enables high diversity while staying within the nucleus. The order of operations: apply temperature to logits → compute softmax probabilities → apply top-p (or top-k or min-p) truncation → renormalize → sample.
Decoding Strategy Decision Tree
Repetition Penalties: Presence vs Frequency
Repetition in generated text is a common failure mode, especially with greedy decoding and long outputs. At each step, the model's highest-probability next token is often the same token it just generated — "the the the the" in pathological cases, or more subtly, repeating the same phrase every few sentences.
Two distinct penalty mechanisms (from GPT-4 API parameters):
Presence penalty (range -2.0 to 2.0, typical use: 0.1–0.6): subtracts a fixed value from the logit of any token that has appeared at least once in the output. A token seen once gets the same penalty as a token seen ten times. Effect: reduces the probability of reusing any previously used token, regardless of how many times it was used. Good for: encouraging diverse vocabulary, reducing any repetition. Bad for: necessary repetitions (code variable names, proper nouns) are penalized the same as filler words.
Frequency penalty (range -2.0 to 2.0, typical use: 0.1–0.4): subtracts a value proportional to how many times the token has appeared. A token seen once gets a small penalty; seen five times gets a 5× larger penalty. Effect: allows tokens to appear 1–2 times naturally but discourages them from dominating the output. More nuanced than presence penalty. Good for: long-form generation where some repetition is acceptable but spiraling repetition is not.
Production recommendations:
- Code generation:
presence_penalty=0,frequency_penalty=0. Penalties disrupt variable name consistency and break syntactic patterns. - Dialogue:
presence_penalty=0.1–0.2. Discourages the model from repeating the same phrase it used two turns ago. - Long-form creative writing:
frequency_penalty=0.2–0.4. Allows natural repetition of key nouns but prevents paragraphs that recycle the same vocabulary. - Never use penalties > 0.8 — aggressive penalties make the model avoid even correct token choices and degrade output quality significantly.
Decoding Strategy Comparison by Use Case
| Strategy | Diversity | Coherence | Repetition Risk | Latency | Best Use Case |
|---|---|---|---|---|---|
| Greedy (T=0) | None — single output | Highest | High on long outputs | Fastest | Code generation, structured extraction, factual QA |
| Beam Search B=4 | Low — near-greedy | Very High | High — generic sentences | 2–4× greedy | Machine translation, abstractive summarization |
| Temperature T=0.7 + Top-P=0.9 | Moderate — controlled | High | Low to moderate | Same as greedy (no extra passes) | Dialogue, chat assistants, Q&A |
| Temperature T=1.0 + Top-P=0.95 | High — varied output | Moderate | Moderate without rep penalty | Same as greedy | Creative writing, brainstorming, story generation |
| Min-P (p_min=0.05–0.1) | High — adaptive nucleus | Higher than top-p at same T | Low — adaptive cutoff prevents incoherence | Same as greedy | Creative tasks where top-p loses coherence at T > 1.0 |
| Constrained Decoding (Outlines) | Controlled by schema | Schema-dependent | None for structured fields | 5–15% overhead per token step | JSON / SQL output, production format-critical APIs |
Constrained Decoding: Guaranteeing Output Format
Instruction-based format constraints ("always output JSON") achieve ~85–90% compliance in practice. For production systems where downstream code parses the output, 10–15% failure rate is unacceptable.
Constrained decoding solves this by modifying the logit distribution at each generation step. At step t, a finite-state machine (or LALR parser for grammars) tracks which tokens are valid given the tokens generated so far. Invalid tokens get logit = -∞ (effectively zero probability). The model can only generate tokens that advance the finite state machine toward a valid final state.
Outlines library (dottxt-ai): the leading open-source constrained decoding library. Accepts a Pydantic model or a regex and compiles it into a token-level finite-state machine. At each step, it masks the logit vector to allow only valid next tokens. Compatible with llama.cpp, vLLM, and Hugging Face Transformers.
Key insight: constrained decoding doesn't change the model's behavior for the valid parts of the output — it only prevents structurally invalid tokens. The model still makes its own choices among the valid tokens. So a constrained-decoded JSON output will have the correct schema but the content of each field still depends on the model's generation quality.
OpenAI and Anthropic alternatives: for API-served models, function calling / tool use provides similar guarantees without requiring local model hosting. The API enforces schema conformance server-side. The main limitation: you cannot use arbitrary regex constraints — only the JSON schema format they support.
When constrained decoding adds meaningful latency: for schemas with many possible valid tokens at each step (large JSON schemas, complex grammars), the finite-state machine lookup adds ~5–15% overhead per token. For simple schemas (enums, fixed-format outputs), overhead is < 2%. Not a significant concern for most production use cases.
Decoding Parameters for Different Tasks
from openai import OpenAI
from dataclasses import dataclass
from typing import Optional
client = OpenAI()
@dataclass
class DecodingConfig:
"""
Production-tested decoding configurations per task type.
All configs assume GPT-4o but parameters are API-portable.
"""
temperature: float
top_p: float
presence_penalty: float = 0.0
frequency_penalty: float = 0.0
max_tokens: Optional[int] = None
# Task-specific configurations (empirically validated)
CONFIGS = {
# Code generation: deterministic, no penalties (breaks variable consistency)
"code": DecodingConfig(temperature=0.0, top_p=1.0, max_tokens=2048),
# Structured extraction: deterministic, strict format
"extraction": DecodingConfig(temperature=0.0, top_p=1.0, max_tokens=512),
# Factual QA: very low temperature, slight top-p truncation
"factual_qa": DecodingConfig(temperature=0.1, top_p=0.9, max_tokens=1024),
# Summarization: low temperature, some variation, frequency penalty for freshness
"summarization": DecodingConfig(
temperature=0.3, top_p=0.85,
frequency_penalty=0.2, max_tokens=512
),
# Dialogue/chat: balanced temperature, presence penalty to avoid repeating phrases
"dialogue": DecodingConfig(
temperature=0.7, top_p=0.9,
presence_penalty=0.15, max_tokens=1024
),
# Creative writing: high temperature + top-p, frequency penalty for vocabulary diversity
"creative": DecodingConfig(
temperature=1.1, top_p=0.95,
frequency_penalty=0.3, max_tokens=2048
),
# Brainstorming: high diversity, want multiple distinct ideas
"brainstorming": DecodingConfig(
temperature=1.2, top_p=0.95,
presence_penalty=0.4, max_tokens=1024
),
}
def generate(prompt: str, task: str, system_prompt: str = "") -> str:
"""Generate with task-appropriate decoding configuration."""
config = CONFIGS.get(task, CONFIGS["dialogue"]) # default to dialogue
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt} if system_prompt else None,
{"role": "user", "content": prompt},
],
temperature=config.temperature,
top_p=config.top_p,
presence_penalty=config.presence_penalty,
frequency_penalty=config.frequency_penalty,
max_tokens=config.max_tokens,
)
return response.choices[0].message.content
# Self-consistency: sample multiple times and take majority vote (for reasoning tasks)
def self_consistency_generate(
prompt: str,
n_samples: int = 5,
temperature: float = 0.7
) -> str:
"""
Self-consistency decoding: sample n times, take majority vote.
Adds ~3-7% accuracy on reasoning benchmarks at n=5 cost.
Use only when accuracy >> latency and cost.
"""
responses = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=temperature,
n=n_samples, # generate n samples in one API call
max_tokens=256,
)
answers = [choice.message.content.strip() for choice in responses.choices]
# Simple majority vote (for classification/short answer tasks)
from collections import Counter
most_common_answer, count = Counter(answers).most_common(1)[0]
print(f"Self-consistency: {count}/{n_samples} agreed on: {most_common_answer}")
return most_common_answer
# Usage examples
code_output = generate(
"Write a Python function to compute Levenshtein distance.",
task="code"
)
story_start = generate(
"Begin a story about a cartographer who discovers maps that show the future.",
task="creative",
system_prompt="You are a literary fiction writer with a lyrical style."
)
The Interview Answer on Temperature That Impresses
Most candidates say 'higher temperature = more creative, lower = more focused.' This is true but shallow. The answer that impresses:
'Temperature T scales the logit vector before softmax: p_i = softmax(logit_i / T). As T → 0, the distribution collapses onto the argmax token — equivalent to greedy decoding. As T → ∞, the distribution flattens to uniform over the vocabulary.
The critical insight is that temperature does not change the model's underlying beliefs. It cannot make a model more or less confident about facts — it only changes the sampling sharpness from the fixed logit distribution produced by the forward pass. A model that assigns 30% to a wrong token and 40% to the right token will, at T=0.0, always pick the right token — but it will also confidently pick the wrong token 30% of the time at T=1.0. Temperature is not a confidence dial — it's a sampling sharpness dial.
For production: use T=0 for any task where there is a correct answer (code, extraction, factual QA). Use T=0.7–1.0 for generation tasks where diversity is desirable. Never use temperature to compensate for a poorly performing model — improve the model or prompting instead.'
Common Decoding Mistakes in Production Systems
Mistake 1: High temperature for code generation. Setting temperature=0.7 for code generation introduces random token choices that break syntax. Use T=0 for code. The only exception: when generating multiple diverse solutions for a test suite — sample at T=0.8 to get variety, then unit-test all solutions.
Mistake 2: Ignoring top-p when setting high temperature. Temperature=1.5 without top-p truncation samples from the entire vocabulary including low-probability nonsense tokens. Always pair high temperature with top-p=0.9–0.95.
Mistake 3: Repetition penalties for all tasks. Repetition penalties hurt code generation (consistent variable names are 'repetition'). They hurt extraction (you may need to repeat the same value in multiple fields). Apply penalties only to open-ended generation tasks.
Mistake 4: Using beam search for dialogue. Beam search produces generic, safe text. Dialogue systems trained with RLHF use stochastic sampling, not beam search. Beam search is the right choice for MT and constrained summarization — not for conversational AI.
Mistake 5: Not fixing the random seed for reproducibility in testing. When debugging or A/B testing prompt changes, non-deterministic sampling makes it impossible to attribute output changes to the prompt vs. random sampling. Set seed=42 during evaluation; use temperature=0 for strict comparisons.
Interview Questions
Click to reveal answersSign in to take the Quiz
This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.
Sign in to take quiz →