Sections
Related Guides
LLM Fundamentals — Transformers, Attention & Architecture
GenAI & Agents
Prompt Engineering: From Zero-Shot to Production Systems
GenAI & Agents
Decoding Strategies: Temperature, Sampling, and Constrained Generation
GenAI & Agents
LLM Evaluation & Benchmarking — HELM, MMLU, MT-Bench, Arena, LLM-as-Judge
GenAI & Agents
Structured Output, Function & Tool Calling — JSON Schema, Strict Mode, Agent Safety
GenAI & Agents
LLM Guardrails and Safety: Input/Output Filters, Red-Teaming, and Constitutional AI
GenAI & Agents
Chain-of-Thought, Test-Time Compute & Multi-Step Reasoning
From CoT and self-consistency to tree search over verbal states and modern reasoning-tuned models. When step-by-step prompts help, when they add latency and variance, and how to evaluate and ship reasoning in production (verifiers, judges, routing) without leaking competitive or customer detail.
Reasoning Is a Systems Problem, Not a Prompting Meme
Chain-of-thought (CoT) prompting (Wei et al., 2022) elicits intermediate natural-language steps before the final answer. It improves many multi-step tasks (math word problems, structured logic, some coding) because the autoregressive model spends more serial decode steps on the generation plane before committing to a label — a form of implicit test-time work, not a proof.
Senior interviews expect you to stop at "add think step by step." You must connect CoT to cost and P95 latency (operations), sampling variance and judge design (evaluation), overthinking and systematic algebraic mistakes (reliability), and user-visible vs. internal scratchpads (product, legal). Reasoning-tuned models (industry: RL or process-supervised post-training on math/code) push the same trade as test-time compute — spend more wall-clock at inference to raise quality on the hard tail, without always increasing parameter count (typical framing since ~2024).
What interviewers are really testing
Mid level: defines CoT; names GSM8K/MATH; knows high temperature adds variance to samples.
Senior level: explains self-consistency (k samples, vote on parsed answer), when majority vote is invalid, and that CoT can hurt simple tasks.
Staff level: describes routers (easy path vs. high-compute path), executable verifiers vs. LLM judges, regression decks for reasoning, and a threat model for exposed chains (PII, IP, prompt injection in tools).
Clarify before you enable heavy reasoning
Task class
Is the correct behavior truly multi-step? A single-fact lookup should not pay a 500-token CoT tax.
Ground truth
Can you verify with code, a SQL result, a unit test, or only with another LLM? Deterministic checks define what you can ship under SLAs.
User-visible chain
Support and education may need explanations; a pricing or negotiation agent may be forbidden to leak its scratchpad to end users or competitors.
Latency and dollars
Each added reasoning token is serial decode time. k-sample self-consistency multiplies output token spend ~k in the naïve design.
Tool and injection risk
Long CoT in agent loops mixes user text, RAG, and tool output — the same untrusted text in a trusted context class as any agent.
From one CoT trace to self-consistency and search
Self-Consistency and When Voting Fails
Self-consistency (Wang et al., 2022) samples k independent CoT completions (non-zero temperature or nucleus sampling), parses a final answer from each chain, and returns the majority answer. It cuts uncorrelated slip errors on tasks with a discrete, canonical final label (e.g. normalized numerics, multiple-choice letter).
Voting fails when: (1) the output is free-form and there is no stable equality on answers; (2) all chains make the same systematic error (violated independence); (3) the SLA is too tight for k full decodes; (4) you need calibrated probabilities, not a mode. LLM-as-judge to pick the best chain is better than nothing but reintroduces position bias and model blind spots — prefer verifiers (Python, SQL, typecheck) wherever the environment allows.
Reasoning pattern vs. when to use it in production
| Pattern | Best for | Main cost | Caveat |
|---|---|---|---|
| Zero-shot CoT | Wide baseline on reasoning benchmarks; quick A/B | 2–4× output tokens vs. direct answer; higher P95 latency | Can add spurious steps on *easy* questions — route away |
| Few-shot CoT (exemplar rationales in prompt) | Stabilizes *format* and decomposition | Longer *input* context; cache exemplars in a prompt library | Template drift when the base model is upgraded |
| Self-consistency (k=5–20 is common in papers) | Math / MCQ with parseable final answers | ~k× generation spend + aggregation CPU | Useless for essays; correlated errors break the vote |
| Tree-of-thought / search over verbal steps | Tasks with *scorable* partial states (games, code with tests) | Branching factor × depth × model calls | Without a **prune signal**, this is just expensive fan-out |
| Reasoning-tuned + long hidden chains (vendor 'reasoning' SKUs) | Hard single-step math/code without hand-built orchestration | High $ and latency per *hard* query (typical) | Must **router** *easy* traffic to a cheaper path |
| Staff default | Short LM steps *around* executors: REPL, sympy, SQL | You pay engineering for verifiers, not for English paragraphs | Most reliable *truth* is still **symbols + tools** |
Test-Time Compute and Reasoning-Tuned Models
Test-time compute is the 2024+ industry term for: spend more inference work (longer internal chains, more samples, or search) to improve the per-query outcome. Reasoning-tuned models (post-trained with RL on outcome labels, process supervision, or long CoT distillation on math/code) often raise ceilings on contest math and code without matching parameter scaling alone.
In PEARL terms: the L (latency / cost) interview must include routing — only route to an expensive reasoning tier when a cheap classifier, heuristic, or a first-pass cheap model flags hard. Track $ per resolved task, not only $/1M tokens.
Where the answer must come from (verify plane)
CoT can *reduce* quality on the wrong tasks
For single-step or format-trivial tasks, forcing a long scratchpad adds verbosity and can worsen accuracy when the model invents irrelevant constraints (overthinking). Always A/B on your distribution; GSM8K-style curves do not replace product metrics.
60-second interview close
"I route easy queries to a cheap path, spend k-sample or long-chain budget only on flagged hard prompts, verify with code/SQL when possible, regress on a private canary deck, and treat exposed CoT as a leak and injection surface — tools and hidden scratchpads are the default for prod unless the product needs a teaching trace."
Interview Questions
Click to reveal answersSign in to take the Quiz
This topic has 16 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.
Sign in to take quiz →Start Solving
You've covered the theory. Now implement it from scratch and run your solution against hidden test cases.
Open Coding Problem →