Sections

0/12

Related Guides

Prompt Engineering: From Zero-Shot to Production Systems

GenAI & Agents

45m

Advanced RAG: Hybrid Retrieval, Reranking, and Production Architecture

GenAI & Agents

60m

LLM Fundamentals — Transformers, Attention & Architecture

GenAI & Agents

100m

LLM Serving at Scale: vLLM, KV Cache, Batching, and LLMOps

GenAI & Agents

60m

Quiz

← Back to Library

GenAI & Agents·Intermediate

Decoding Strategies: Temperature, Sampling, and Constrained Generation

Master every LLM decoding parameter: greedy vs beam search vs sampling, temperature scaling, top-k and nucleus sampling, repetition penalties, and the 2024 min-p sampler. Understand when each strategy is optimal and why temperature doesn't change what the model 'knows'.

40 min read 12 sections 6 interview questions

Decoding StrategiesTemperature SamplingTop-P SamplingNucleus SamplingBeam SearchGreedy DecodingRepetition PenaltyConstrained DecodingMin-P SamplingLogit ProcessorsOutlinesToken Sampling

How LLMs Actually Generate Text

Language model generation is fundamentally a search problem: at each step, the model assigns a probability to every token in its vocabulary (typically 32,000–128,000 tokens), and the decoding strategy decides which token to select next.

The model outputs raw logits — unnormalized scores for each vocabulary token. These are converted to probabilities via softmax: p(token_i) = exp(logit_i) / Σ exp(logit_j). A logit of 5.0 for "dog" and 3.0 for "cat" gives probabilities of 0.88 and 0.12 respectively. The decoding strategy then samples or selects from this distribution.

The key insight most engineers miss: the model's logit distribution captures its "beliefs" about what token should come next given all prior context. This distribution is fixed once the forward pass is complete. Decoding parameters do not change what the model knows or believes — they only change how we sample from the distribution the model already produced. Temperature=0.1 and temperature=1.0 query the exact same logit distribution; temperature only sharpens or flattens that distribution before sampling.

This has a critical implication: if the model assigns 40% probability to a wrong answer and 35% to a correct answer, lowering the temperature to near-zero will confidently select the wrong answer. Temperature cannot fix a model that has the wrong beliefs — only training can do that.

Three fundamental decoding families:

Deterministic: always select the same token (greedy, beam search). Reproducible, consistent, inflexible.
Stochastic sampling: sample from the probability distribution (temperature, top-k, top-p). Diverse, creative, variable.
Constrained decoding: modify the logit distribution to enforce output structure before sampling. Guarantees format compliance.

Greedy Decoding and Beam Search: Deterministic Approaches

Greedy decoding selects the single highest-probability token at each step: token_t = argmax p(token | context). It is deterministic (same input always produces the same output), fast (no additional computation beyond the forward pass), and consistent.

Where greedy fails: greedy is myopic. It maximizes local probability at each step without considering how that choice affects future tokens. A token that is slightly less probable now might enable a much more probable sequence later. The classic failure: "I want to go to the store to buy [milk]" — greedy selects "milk" (probability 0.4) but the ground truth is "groceries" which would have enabled a higher-probability continuation. Over long sequences, greedy accumulates suboptimal choices, leading to generic, repetitive text.

Beam search maintains B=4 (or B=8) hypotheses in parallel. At each step, it expands all B beams by considering the top tokens, scores each expanded hypothesis by total log-probability, and keeps the top-B. This is approximate tree search over the token sequence.

Length normalization: without it, beam search systematically prefers shorter sequences (longer sequences accumulate lower log-probabilities from multiplying many values < 1). The fix: divide the total log-probability by sequence length^α, where α ≈ 0.6–0.7. Too low α: still prefers short sequences. Too high α: prefers nonsensically long sequences.

Why beam search ≠ better for open-ended generation: Holtzman et al. (2020) show that beam search produces text that is more likely according to the model but less preferred by humans for open-ended generation. High-probability text is predictable and generic. Human-preferred text explores lower-probability continuations that are surprising and natural. Beam search optimizes the wrong objective for creative tasks. Use beam search for: machine translation, summarization with high factual requirement, code generation where you want the most likely correct solution. Do not use for: dialogue, creative writing, any open-ended generation.

Temperature Scaling: The Most Misunderstood Parameter

Temperature modifies the softmax before sampling. The standard softmax is p_i = exp(logit_i) / Σ exp(logit_j). Temperature-scaled softmax is p_i = exp(logit_i / T) / Σ exp(logit_j / T).

Effect of temperature:

T → 0: dividing by a very small number amplifies differences between logits. The highest-logit token gets probability ~1.0; all others get ~0. Equivalent to greedy decoding. Deterministic, repetitive, conservative.
T = 1.0: original model distribution unchanged. Samples as the model "intends."
T > 1.0: dividing by a number > 1 compresses the logit differences. The distribution flattens toward uniform. All tokens become equally likely. At T = ∞, the distribution is exactly uniform — every token in the vocabulary has the same probability.

The non-obvious insight: temperature doesn't change the model's beliefs — it only sharpens or flattens the probability distribution. At T=0.1, the model is not "more confident" — the logits are identical. You are just sampling more deterministically from the same distribution. This means T=0.1 and T=1.0 will produce different outputs but both can be equally "correct" given a model with well-calibrated probabilities. T=0.1 samples the model's most likely response; T=1.0 samples a more representative response.

Practical guidance:

T=0 (or 0.0): code generation, structured extraction, factual QA — where correctness matters and diversity does not.
T=0.3–0.5: summarization, translation — low diversity acceptable, slight variation to avoid repetition.
T=0.7–0.9: dialogue, Q&A assistants — natural variation without becoming incoherent.
T=1.0–1.2: creative writing, brainstorming, story generation — maximize diversity.
T > 1.5: generally incoherent for most models. Use only for intentional randomness or ablation studies.

Temperature Softmax and Top-P Threshold

Top-K, Nucleus (Top-P), and Min-P Sampling

The problem with pure temperature sampling: even at T=0.8, the full vocabulary can be sampled from — 50,257 tokens for GPT-2, 128,256 for LLaMA-3. This includes tokens with probability 0.000001 that produce incoherent continuations. Truncation samplers solve this by restricting sampling to a high-probability subset.

Top-k sampling: sort tokens by probability, keep the top-k, renormalize, sample. Simple, fast. The problem: k is a fixed count, which is inappropriate when the probability distribution has different shapes. When the model is uncertain (flat distribution), k=50 might still include only 30% of the probability mass. When the model is highly confident (peaked distribution), k=50 might include tokens with probabilities as low as 0.00001 — introducing incoherence.

Nucleus / top-p sampling (Holtzman et al. 2020): instead of a fixed count, keep the minimal set of tokens that covers at least p% of the probability mass. If p=0.9, include tokens until their cumulative probability reaches 0.9. When the model is confident, this might be just 2–3 tokens. When uncertain, it might be 500 tokens. The cutoff adapts to the model's certainty. Top-p=0.9–0.95 is the production standard for most generation tasks.

Min-p sampling (2024): a newer approach that has shown strong empirical performance for creative tasks. Include token i if and only if its probability exceeds p_min × max_prob, where max_prob is the probability of the most likely token. This is a relative threshold: if the top token has probability 0.6 and p_min=0.1, include all tokens with probability ≥ 0.06. When the top token is highly probable (peaked distribution), the threshold is high — few tokens included. When no token is dominant (flat distribution), the threshold is low — more tokens included. Min-p better preserves coherence than top-p at high temperatures and outperforms top-p on creative writing benchmarks according to Llama community evaluations.

Combining parameters: in production, top-p and temperature are used together. Temperature=0.8 + top-p=0.9 is more controlled than temperature=0.8 alone because top-p cuts off low-probability nonsense. Temperature=1.2 + top-p=0.95 enables high diversity while staying within the nucleus. The order of operations: apply temperature to logits → compute softmax probabilities → apply top-p (or top-k or min-p) truncation → renormalize → sample.

Decoding Strategy Decision Tree

Rendering diagram...

Repetition Penalties: Presence vs Frequency

Repetition in generated text is a common failure mode, especially with greedy decoding and long outputs. At each step, the model's highest-probability next token is often the same token it just generated — "the the the the" in pathological cases, or more subtly, repeating the same phrase every few sentences.

Two distinct penalty mechanisms (from GPT-4 API parameters):

Presence penalty (range -2.0 to 2.0, typical use: 0.1–0.6): subtracts a fixed value from the logit of any token that has appeared at least once in the output. A token seen once gets the same penalty as a token seen ten times. Effect: reduces the probability of reusing any previously used token, regardless of how many times it was used. Good for: encouraging diverse vocabulary, reducing any repetition. Bad for: necessary repetitions (code variable names, proper nouns) are penalized the same as filler words.

Frequency penalty (range -2.0 to 2.0, typical use: 0.1–0.4): subtracts a value proportional to how many times the token has appeared. A token seen once gets a small penalty; seen five times gets a 5× larger penalty. Effect: allows tokens to appear 1–2 times naturally but discourages them from dominating the output. More nuanced than presence penalty. Good for: long-form generation where some repetition is acceptable but spiraling repetition is not.

Production recommendations:

Code generation: presence_penalty=0, frequency_penalty=0. Penalties disrupt variable name consistency and break syntactic patterns.
Dialogue: presence_penalty=0.1–0.2. Discourages the model from repeating the same phrase it used two turns ago.
Long-form creative writing: frequency_penalty=0.2–0.4. Allows natural repetition of key nouns but prevents paragraphs that recycle the same vocabulary.
Never use penalties > 0.8 — aggressive penalties make the model avoid even correct token choices and degrade output quality significantly.

Decoding Strategy Comparison by Use Case

Strategy	Diversity	Coherence	Repetition Risk	Latency	Best Use Case
Greedy (T=0)	None — single output	Highest	High on long outputs	Fastest	Code generation, structured extraction, factual QA
Beam Search B=4	Low — near-greedy	Very High	High — generic sentences	2–4× greedy	Machine translation, abstractive summarization
Temperature T=0.7 + Top-P=0.9	Moderate — controlled	High	Low to moderate	Same as greedy (no extra passes)	Dialogue, chat assistants, Q&A
Temperature T=1.0 + Top-P=0.95	High — varied output	Moderate	Moderate without rep penalty	Same as greedy	Creative writing, brainstorming, story generation
Min-P (p_min=0.05–0.1)	High — adaptive nucleus	Higher than top-p at same T	Low — adaptive cutoff prevents incoherence	Same as greedy	Creative tasks where top-p loses coherence at T > 1.0
Constrained Decoding (Outlines)	Controlled by schema	Schema-dependent	None for structured fields	5–15% overhead per token step	JSON / SQL output, production format-critical APIs

Constrained Decoding: Guaranteeing Output Format

Instruction-based format constraints ("always output JSON") achieve ~85–90% compliance in practice. For production systems where downstream code parses the output, 10–15% failure rate is unacceptable.

Constrained decoding solves this by modifying the logit distribution at each generation step. At step t, a finite-state machine (or LALR parser for grammars) tracks which tokens are valid given the tokens generated so far. Invalid tokens get logit = -∞ (effectively zero probability). The model can only generate tokens that advance the finite state machine toward a valid final state.

Outlines library (dottxt-ai): the leading open-source constrained decoding library. Accepts a Pydantic model or a regex and compiles it into a token-level finite-state machine. At each step, it masks the logit vector to allow only valid next tokens. Compatible with llama.cpp, vLLM, and Hugging Face Transformers.

Key insight: constrained decoding doesn't change the model's behavior for the valid parts of the output — it only prevents structurally invalid tokens. The model still makes its own choices among the valid tokens. So a constrained-decoded JSON output will have the correct schema but the content of each field still depends on the model's generation quality.

OpenAI and Anthropic alternatives: for API-served models, function calling / tool use provides similar guarantees without requiring local model hosting. The API enforces schema conformance server-side. The main limitation: you cannot use arbitrary regex constraints — only the JSON schema format they support.

When constrained decoding adds meaningful latency: for schemas with many possible valid tokens at each step (large JSON schemas, complex grammars), the finite-state machine lookup adds ~5–15% overhead per token. For simple schemas (enums, fixed-format outputs), overhead is < 2%. Not a significant concern for most production use cases.

Decoding Parameters for Different Tasks

pythondecoding_configs.py

from openai import OpenAI
from dataclasses import dataclass
from typing import Optional

client = OpenAI()

@dataclass
class DecodingConfig:
    """
    Production-tested decoding configurations per task type.
    All configs assume GPT-4o but parameters are API-portable.
    """
    temperature: float
    top_p: float
    presence_penalty: float = 0.0
    frequency_penalty: float = 0.0
    max_tokens: Optional[int] = None

# Task-specific configurations (empirically validated)
CONFIGS = {
    # Code generation: deterministic, no penalties (breaks variable consistency)
    "code": DecodingConfig(temperature=0.0, top_p=1.0, max_tokens=2048),

    # Structured extraction: deterministic, strict format
    "extraction": DecodingConfig(temperature=0.0, top_p=1.0, max_tokens=512),

    # Factual QA: very low temperature, slight top-p truncation
    "factual_qa": DecodingConfig(temperature=0.1, top_p=0.9, max_tokens=1024),

    # Summarization: low temperature, some variation, frequency penalty for freshness
    "summarization": DecodingConfig(
        temperature=0.3, top_p=0.85,
        frequency_penalty=0.2, max_tokens=512
    ),

    # Dialogue/chat: balanced temperature, presence penalty to avoid repeating phrases
    "dialogue": DecodingConfig(
        temperature=0.7, top_p=0.9,
        presence_penalty=0.15, max_tokens=1024
    ),

    # Creative writing: high temperature + top-p, frequency penalty for vocabulary diversity
    "creative": DecodingConfig(
        temperature=1.1, top_p=0.95,
        frequency_penalty=0.3, max_tokens=2048
    ),

    # Brainstorming: high diversity, want multiple distinct ideas
    "brainstorming": DecodingConfig(
        temperature=1.2, top_p=0.95,
        presence_penalty=0.4, max_tokens=1024
    ),
}

def generate(prompt: str, task: str, system_prompt: str = "") -> str:
    """Generate with task-appropriate decoding configuration."""
    config = CONFIGS.get(task, CONFIGS["dialogue"])  # default to dialogue

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt} if system_prompt else None,
            {"role": "user", "content": prompt},
        ],
        temperature=config.temperature,
        top_p=config.top_p,
        presence_penalty=config.presence_penalty,
        frequency_penalty=config.frequency_penalty,
        max_tokens=config.max_tokens,
    )
    return response.choices[0].message.content

# Self-consistency: sample multiple times and take majority vote (for reasoning tasks)
def self_consistency_generate(
    prompt: str,
    n_samples: int = 5,
    temperature: float = 0.7
) -> str:
    """
    Self-consistency decoding: sample n times, take majority vote.
    Adds ~3-7% accuracy on reasoning benchmarks at n=5 cost.
    Use only when accuracy >> latency and cost.
    """
    responses = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature,
        n=n_samples,  # generate n samples in one API call
        max_tokens=256,
    )

    answers = [choice.message.content.strip() for choice in responses.choices]

    # Simple majority vote (for classification/short answer tasks)
    from collections import Counter
    most_common_answer, count = Counter(answers).most_common(1)[0]
    print(f"Self-consistency: {count}/{n_samples} agreed on: {most_common_answer}")
    return most_common_answer

# Usage examples
code_output = generate(
    "Write a Python function to compute Levenshtein distance.",
    task="code"
)

story_start = generate(
    "Begin a story about a cartographer who discovers maps that show the future.",
    task="creative",
    system_prompt="You are a literary fiction writer with a lyrical style."
)

TIP

The Interview Answer on Temperature That Impresses

Most candidates say 'higher temperature = more creative, lower = more focused.' This is true but shallow. The answer that impresses:

'Temperature T scales the logit vector before softmax: p_i = softmax(logit_i / T). As T → 0, the distribution collapses onto the argmax token — equivalent to greedy decoding. As T → ∞, the distribution flattens to uniform over the vocabulary.

The critical insight is that temperature does not change the model's underlying beliefs. It cannot make a model more or less confident about facts — it only changes the sampling sharpness from the fixed logit distribution produced by the forward pass. A model that assigns 30% to a wrong token and 40% to the right token will, at T=0.0, always pick the right token — but it will also confidently pick the wrong token 30% of the time at T=1.0. Temperature is not a confidence dial — it's a sampling sharpness dial.

For production: use T=0 for any task where there is a correct answer (code, extraction, factual QA). Use T=0.7–1.0 for generation tasks where diversity is desirable. Never use temperature to compensate for a poorly performing model — improve the model or prompting instead.'

⚠ WARNING

Common Decoding Mistakes in Production Systems

Mistake 1: High temperature for code generation. Setting temperature=0.7 for code generation introduces random token choices that break syntax. Use T=0 for code. The only exception: when generating multiple diverse solutions for a test suite — sample at T=0.8 to get variety, then unit-test all solutions.

Mistake 2: Ignoring top-p when setting high temperature. Temperature=1.5 without top-p truncation samples from the entire vocabulary including low-probability nonsense tokens. Always pair high temperature with top-p=0.9–0.95.

Mistake 3: Repetition penalties for all tasks. Repetition penalties hurt code generation (consistent variable names are 'repetition'). They hurt extraction (you may need to repeat the same value in multiple fields). Apply penalties only to open-ended generation tasks.

Mistake 4: Using beam search for dialogue. Beam search produces generic, safe text. Dialogue systems trained with RLHF use stochastic sampling, not beam search. Beam search is the right choice for MT and constrained summarization — not for conversational AI.

Mistake 5: Not fixing the random seed for reproducibility in testing. When debugging or A/B testing prompt changes, non-deterministic sampling makes it impossible to attribute output changes to the prompt vs. random sampling. Set seed=42 during evaluation; use temperature=0 for strict comparisons.

Interview Questions

Click to reveal answers

Test your knowledge

Sign in to take the Quiz

This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.