Skip to main content
GenAI & Agents·Advanced

Chain-of-Thought, Test-Time Compute & Multi-Step Reasoning

From CoT and self-consistency to tree search over verbal states and modern reasoning-tuned models. When step-by-step prompts help, when they add latency and variance, and how to evaluate and ship reasoning in production (verifiers, judges, routing) without leaking competitive or customer detail.

90 min read 10 sections 8 interview questions
Chain of ThoughtCoTSelf-ConsistencyTree of ThoughtsTest-Time ComputeReasoning ModelsGSM8KMATHVerifiersLatencyLLM as JudgeMajority VotePEARLProduction

Reasoning Is a Systems Problem, Not a Prompting Meme

Chain-of-thought (CoT) prompting (Wei et al., 2022) elicits intermediate natural-language steps before the final answer. It improves many multi-step tasks (math word problems, structured logic, some coding) because the autoregressive model spends more serial decode steps on the generation plane before committing to a label — a form of implicit test-time work, not a proof.

Senior interviews expect you to stop at "add think step by step." You must connect CoT to cost and P95 latency (operations), sampling variance and judge design (evaluation), overthinking and systematic algebraic mistakes (reliability), and user-visible vs. internal scratchpads (product, legal). Reasoning-tuned models (industry: RL or process-supervised post-training on math/code) push the same trade as test-time compute — spend more wall-clock at inference to raise quality on the hard tail, without always increasing parameter count (typical framing since ~2024).

IMPORTANT

What interviewers are really testing

Mid level: defines CoT; names GSM8K/MATH; knows high temperature adds variance to samples.

Senior level: explains self-consistency (k samples, vote on parsed answer), when majority vote is invalid, and that CoT can hurt simple tasks.

Staff level: describes routers (easy path vs. high-compute path), executable verifiers vs. LLM judges, regression decks for reasoning, and a threat model for exposed chains (PII, IP, prompt injection in tools).

Clarify before you enable heavy reasoning

01

Task class

Is the correct behavior truly multi-step? A single-fact lookup should not pay a 500-token CoT tax.

02

Ground truth

Can you verify with code, a SQL result, a unit test, or only with another LLM? Deterministic checks define what you can ship under SLAs.

03

User-visible chain

Support and education may need explanations; a pricing or negotiation agent may be forbidden to leak its scratchpad to end users or competitors.

04

Latency and dollars

Each added reasoning token is serial decode time. k-sample self-consistency multiplies output token spend ~k in the naïve design.

05

Tool and injection risk

Long CoT in agent loops mixes user text, RAG, and tool output — the same untrusted text in a trusted context class as any agent.

Self-Consistency and When Voting Fails

Self-consistency (Wang et al., 2022) samples k independent CoT completions (non-zero temperature or nucleus sampling), parses a final answer from each chain, and returns the majority answer. It cuts uncorrelated slip errors on tasks with a discrete, canonical final label (e.g. normalized numerics, multiple-choice letter).

Voting fails when: (1) the output is free-form and there is no stable equality on answers; (2) all chains make the same systematic error (violated independence); (3) the SLA is too tight for k full decodes; (4) you need calibrated probabilities, not a mode. LLM-as-judge to pick the best chain is better than nothing but reintroduces position bias and model blind spots — prefer verifiers (Python, SQL, typecheck) wherever the environment allows.

Reasoning pattern vs. when to use it in production

PatternBest forMain costCaveat
Zero-shot CoTWide baseline on reasoning benchmarks; quick A/B2–4× output tokens vs. direct answer; higher P95 latencyCan add spurious steps on *easy* questions — route away
Few-shot CoT (exemplar rationales in prompt)Stabilizes *format* and decompositionLonger *input* context; cache exemplars in a prompt libraryTemplate drift when the base model is upgraded
Self-consistency (k=5–20 is common in papers)Math / MCQ with parseable final answers~k× generation spend + aggregation CPUUseless for essays; correlated errors break the vote
Tree-of-thought / search over verbal stepsTasks with *scorable* partial states (games, code with tests)Branching factor × depth × model callsWithout a **prune signal**, this is just expensive fan-out
Reasoning-tuned + long hidden chains (vendor 'reasoning' SKUs)Hard single-step math/code without hand-built orchestrationHigh $ and latency per *hard* query (typical)Must **router** *easy* traffic to a cheaper path
Staff defaultShort LM steps *around* executors: REPL, sympy, SQLYou pay engineering for verifiers, not for English paragraphsMost reliable *truth* is still **symbols + tools**

Test-Time Compute and Reasoning-Tuned Models

Test-time compute is the 2024+ industry term for: spend more inference work (longer internal chains, more samples, or search) to improve the per-query outcome. Reasoning-tuned models (post-trained with RL on outcome labels, process supervision, or long CoT distillation on math/code) often raise ceilings on contest math and code without matching parameter scaling alone.

In PEARL terms: the L (latency / cost) interview must include routing — only route to an expensive reasoning tier when a cheap classifier, heuristic, or a first-pass cheap model flags hard. Track $ per resolved task, not only $/1M tokens.

Where the answer must come from (verify plane)

Rendering diagram...
⚠ WARNING

CoT can *reduce* quality on the wrong tasks

For single-step or format-trivial tasks, forcing a long scratchpad adds verbosity and can worsen accuracy when the model invents irrelevant constraints (overthinking). Always A/B on your distribution; GSM8K-style curves do not replace product metrics.

TIP

60-second interview close

"I route easy queries to a cheap path, spend k-sample or long-chain budget only on flagged hard prompts, verify with code/SQL when possible, regress on a private canary deck, and treat exposed CoT as a leak and injection surfacetools and hidden scratchpads are the default for prod unless the product needs a teaching trace."

Interview Questions

Click to reveal answers
Test your knowledge

Sign in to take the Quiz

This topic has 16 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.

Sign in to take quiz →
Ready to put it into practice?

Start Solving

You've covered the theory. Now implement it from scratch and run your solution against hidden test cases.

Open Coding Problem →