Tech_Nuggets

Posted on Jun 12

Sampling strategies compared: temperature, top-p, top-k, min-p, and what actually works in production

#ai #machinelearning #opensource #llm

Sampling strategies compared: temperature, top-p, top-k, min-p, and what actually works in production

You deployed a chatbot, picked temperature 0.7 because every blog post says that, and the first live user sends back screenshots of responses that drift into gibberish mid-sentence. A colleague suggests top-p 0.9. Another says top-k 50. Someone new to the team mentions min-p and claims it solves everything. You have no benchmark, no test set, and no way to tell whether any of these knobs actually fix your specific problem instead of just making the outputs shorter.

This is the state of sampling parameter selection for most teams shipping LLM products. The parameters are poorly documented, they interact in non-intuitive ways, and the default values in every inference engine are tuned for general-purpose chat benchmarks, not for your use case. This post maps the four most common sampling knobs -- temperature, top-p, top-k, and min-p -- to the concrete effects they have on the output distribution, so you can pick the right one (or combination) without guessing.

Why sampling parameters matter

Every LLM generates text one token at a time by choosing from a probability distribution over the vocabulary. The raw distribution (the logits from the final transformer layer, passed through softmax) is almost never used directly. A raw distribution might assign 0.0001 probability to fifty thousand tokens and 0.3 to the top token. If you sample directly from that, you get a narrow band of high-probability continuations that sound repetitive and robotic.

Sampling parameters reshape this distribution. The goal is to widen the distribution enough for creative or useful variation, but not so much that the model assigns meaningful probability to tokens that make no sense. Each parameter attacks a different failure mode:

Temperature controls the overall sharpness of the distribution.
Top-p (nucleus sampling) truncates the distribution to the smallest set of tokens whose cumulative probability reaches a threshold.
Top-k keeps only the k highest-probability tokens and renormalizes.
Min-p scales a probability floor relative to the top token's probability, keeping tokens whose probability is at least that fraction of the top token.

The following diagram shows how each strategy transforms the same logit distribution:

flowchart LR
    A[Raw logits<br/>from model] --> B[Softmax]
    B --> C[Full probability<br/>distribution]
    C --> D{Temperature}
    D -->|tau < 1| E[Sharpened<br/>peaks]
    D -->|tau > 1| F[Flattened<br/>tails]
    E --> G{Top-p / Top-k / Min-p}
    F --> G
    G --> H[Truncated<br/>distribution]
    H --> I[Sample<br/>next token]
    C --> J[Greedy argmax<br/>tau = 0]

Each box above is a tunable step. The order matters: temperature is applied to logits before softmax, while top-p, top-k, and min-p are applied to the resulting probability distribution after softmax. If you set temperature to 0 first, the later truncation parameters have no effect because the distribution is already a delta function on the argmax token.

The four knobs, explained from the inside

Temperature

Temperature is the oldest and most widely understood parameter. It divides the logits by tau before softmax:

P(token_i) = exp(logit_i / tau) / sum_j exp(logit_j / tau)

When tau = 1, this is the standard softmax. When tau approaches 0, the distribution converges to a one-hot vector on the highest-probability token (greedy decoding). When tau is above 1, the distribution flattens, making low-probability tokens more likely than the raw model intended.

Practical ranges: tau = 0 (deterministic, good for code generation or factual QA), tau = 0.1-0.3 (near-deterministic, useful for classification), tau = 0.6-0.9 (creative writing, conversational), tau = 1.0-1.5 (brainstorming, diverse generations). Above 1.5, the model increasingly produces incoherent text because it is assigning meaningful probability to tokens the model considers unlikely.

The critical property of temperature is that it is a distribution-wide transform. It does not prune any tokens; it just makes the probabilities more equal (tau > 1) or more unequal (tau < 1). This means tau > 1 can activate tokens that were essentially zero-probability in the raw distribution, including tokens that are misspellings, in the wrong language, or hallucinated -- because the model gave them low probability for a reason, and temperature is overriding that signal.

Top-p (nucleus sampling)

Top-p, introduced by Holtzman et al. in 2019, solves a specific problem with temperature: temperature alone does not truncate the vocabulary. At tau = 0.8, the model still assigns tiny nonzero probability to thousands of tokens, and sampling from that long tail produces unexpected tokens.

Top-p works by sorting tokens by probability descending, then keeping tokens from the top until their cumulative probability exceeds p. If p = 0.9, it keeps the top tokens that collectively account for 90% of the probability mass. This is adaptive: when the model is confident, top-p keeps few tokens; when uncertain, it keeps more.

Practical ranges: p = 0.8-0.95 for most generation tasks. Lower values (0.5-0.7) produce more focused outputs useful for factual QA. Values above 0.95 are close to no truncation at all. The surprising property of top-p is that it can be less restrictive than top-k in high-entropy distributions, because it adapts to the distribution shape.

Top-k

Top-k is the simplest truncation: keep only the k tokens with the highest probability and renormalize. A common default is k = 40 or k = 50, inherited from the early GPT-2 days.

The problem with top-k is that it is static. When the distribution is peaked (model is confident), k = 50 keeps many low-probability tokens that should have been truncated. When the distribution is flat (model is uncertain), k = 50 cuts off tokens that carry meaningful probability. Top-k works acceptably when you have tuned k for a specific domain and model, but it is fragile across models and tasks.

Practical ranges: k = 10-50 for general generation. k = 1 is greedy (effectively tau = 0). k above 100 approaches no truncation for most models.

Min-p

Min-p, proposed by Nguyen et al. in 2024 (arXiv 2407.01082), addresses the static nature of top-k with an adaptive threshold. It works by setting a floor at (min_p * P_max), where P_max is the probability of the most likely token. Tokens below this floor are discarded, and the remaining distribution is renormalized.

If min_p = 0.1 and the top token has probability 0.6, the floor is 0.06. Any token below 0.06 probability is pruned. When the model is confident (top token near 1), the floor is high and few tokens survive. When the model is uncertain (top token at 0.3), the floor drops and more tokens pass through.

Practical ranges: min_p = 0.01-0.2. Default recommendations from the paper are around 0.05-0.1 for a good balance of creativity and coherence. Values below 0.01 are close to no truncation. Values above 0.2 become very restrictive.

Comparison table

Parameter	What it does	Adaptive?	Common range	Best for	Key failure mode
Temperature	Scales logits before softmax	No	0 - 1.5	Controlling randomness/creativity	Enables low-probability tokens without discrimination
Top-p (nucleus)	Keeps top tokens up to cumulative probability p	Yes (adaptive count)	0.8 - 0.95	General generation when model confidence varies	Can be too permissive in peaked distributions
Top-k	Keeps only k highest-probability tokens	No (fixed count)	10 - 50	Legacy compatibility, simple tuning	Static; either too restrictive or too permissive
Min-p	Keeps tokens with prob >= min_p * P_max	Yes (adaptive threshold)	0.01 - 0.2	Production systems needing coherence + creativity	Less tested at very large scales

Sampling in practice: what combinations work

In production systems, sampling parameters are almost never used alone. The most common production recipe is:

Default for conversational agents: temperature = 0.7, top-p = 0.9, min-p = 0.05. This gives enough randomness for natural variation while the min-p floor prevents the model from wandering into very low-probability regions. Top-k is usually turned off (set to 0 or a high value like 200) because min-p and top-p already handle truncation more adaptively.

For code generation or structured output: temperature = 0.1-0.2, top-p = 0.95, min-p = 0.01. The near-zero temperature forces most probability onto the top few tokens. Top-p at 0.95 ensures that when the model is truly uncertain (e.g., picking a variable name), it still has options beyond the argmax.

For creative writing or brainstorming: temperature = 0.9-1.1, top-p = 0.95, min-p = 0.02. Slightly elevated temperature encourages variety. The generous top-p keeps the distribution wide. The low min-p exists mainly as a safety net against the worst long-tail tokens.

For classification or extraction: temperature = 0 (greedy), no truncation parameters needed. When the output space is a fixed set of labels, any sampling at all reduces accuracy. This is the rare case where the default parameters are actually optimal.

Here is a Python snippet showing how vLLM combines these parameters in practice:

from vllm import SamplingParams

# Conversational agent
params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    min_p=0.05,
    max_tokens=1024,
    stop=["<|im_end|>"]
)

# Code generation
code_params = SamplingParams(
    temperature=0.1,
    top_p=0.95,
    min_p=0.01,
    max_tokens=2048
)

# Classification (deterministic)
classify_params = SamplingParams(
    temperature=0.0,
    max_tokens=16
)

Common pitfalls

Stacking truncation parameters without understanding the interaction. Top-p at 0.9 and top-k at 50 at the same time means two truncations fire sequentially. Top-p might keep 30 tokens, then top-k cuts that to 50 -- which does nothing. Or top-k keeps 50, then top-p might further trim them. The effective behavior depends on which truncation applies first. Most engines apply top-k first, then top-p, then min-p. If you set all three, you are relying on an ordering you may not remember next month. Pick at most two truncation methods.

Setting temperature above 1.5 and expecting coherence. Temperature is not a creativity dial. Above 1.5, the model assigns significant probability to tokens it considers extremely unlikely. The outputs may appear creative but are actually random. If you need diverse outputs, try increasing top-p or lowering min-p instead of pushing temperature beyond 1.2.

Using top-k as the only sampler. This is the most common mistake I see in deployed services. A static k cannot adapt to the distribution. At k=50, sometimes you keep garbage and sometimes you cut off the valid tail. If you must use top-k alone, set k conservatively (10-20) and accept that you are leaving performance on the table.

Forgetting that temperature 0 disables all sampling. If temperature is 0, the model always picks the argmax token. Top-p, top-k, and min-p have no effect because there is no distribution to truncate. If you see "temperature=0, top_p=0.95" in a config, the top_p is dead code.

Applying sampling parameters incorrectly in batched inference. Some inference engines share sampling parameters across all sequences in a batch. Passing a per-request temperature override that conflicts with the batch default causes silent fallback to the default. Always verify that per-request sampling overrides are actually wired through the batching layer.

When NOT to use it

Sampling parameters should not be the primary tool for improving output quality if:

Your outputs are incoherent at temperature 0. Sampling parameters cannot fix a model that produces bad output even when it is maximally deterministic. If greedy decoding gives poor results, the problem is in the model, the prompt, or the training data, not in the sampling strategy. Add more examples to the prompt or improve the fine-tuning data before touching sampling parameters.
You need guaranteed structured output. Sampling introduces nondeterminism. If the application requires valid JSON, a specific schema, or exact string matching, use constrained decoding (grammar-guided generation or JSON mode) instead of hoping the right parameters keep the output valid. Sampling parameters can reduce the rate of malformed output but cannot eliminate it.
You are running a benchmark or eval. Every paper and leaderboard uses greedy decoding (temperature 0) or a tightly controlled sampling procedure. If you compare a model at temperature 0.7 against another at temperature 0, you are measuring sampling strategy differences, not model quality differences. For evaluation, use deterministic settings and control for temperature as a variable.
You have not measured the output quality. Before tuning sampling parameters, establish a metric -- accuracy on a held-out set, human preference ratings, or a task-specific score. Without a metric, every sampling parameter change is cargo-culting. Measure first, tune second.
Your application uses speculative decoding. Speculative decoding's acceptance rate drops significantly at temperature 0 (greedy mode) compared to low-temperature sampling. If throughput is critical and you use speculation, the optimal temperature may be higher than you would choose for quality alone. Benchmark the throughput-quality tradeoff explicitly.

TL;DR

Temperature scales logits before softmax. It is the only knob that affects the entire distribution uniformly. Use it to control randomness, from 0 (deterministic) to ~1.2 (max practical creativity).
Top-p keeps the top tokens that cover p percent of the probability mass. It adapts to distribution shape and is the most popular general-purpose truncation.
Top-k keeps the top k tokens regardless of their probabilities. It is simple but fragile across inputs. Prefer top-p or min-p unless you have a specific reason for a fixed count.
Min-p keeps tokens whose probability is at least a fraction of the top-token probability. It is the most adaptive truncation and works well as a safety net alongside temperature and top-p.
Best production combo for most use cases: temperature 0.7 + top-p 0.9 + min-p 0.05. Drop top-k entirely. For structured output, use constrained decoding instead of sampling tricks.
Never tune sampling parameters without a metric. Greedy decoding (tau=0) is the first thing to check. If greedy fails, sampling parameters will not save you.

The MCP (Model Context Protocol) has been called the missing standard for tool integration, but the real question is what it costs in latency, reliability, and debuggability. Next post: a production-oriented walkthrough of MCP -- how tool calls flow through the protocol, where the serialization overhead lives, and what the current ecosystem actually supports.

DEV Community

Sampling strategies compared: temperature, top-p, top-k, min-p, and what actually works in production

Sampling strategies compared: temperature, top-p, top-k, min-p, and what actually works in production

Why sampling parameters matter

The four knobs, explained from the inside

Temperature

Top-p (nucleus sampling)

Top-k

Min-p

Comparison table

Sampling in practice: what combinations work

Common pitfalls

When NOT to use it

TL;DR

Next post

Top comments (0)