Skip to main content

Preview — Pro guide

You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.

GenAI & Agents·Advanced

Knowledge Distillation for LLMs: Logit KD, Context Distillation, and Speculative Decoding Pairing

Classical KD minimized KL between student and teacher logits on a fixed dataset; LLM-era variants distill **reasoning traces**, **tool-use format**, or **chain-of-thought** into smaller models — or pair student draft models with teacher verification in speculative decoding. Covers when offline KD beats RLHF, sequence-level distillation pitfalls, and latency-quality tradeoffs for on-device assistants.

46 min read 2 sections 1 interview questions
Knowledge DistillationLLM CompressionLogit DistillationContext DistillationSpeculative DecodingStudent TeacherKL DivergenceOn-Device LLMDistilBERTMiniCPMvLLMInference Optimization

What Changes When the Teacher Is a Frontier LLM

Classic KD (Hinton et al.): match softmax temperature-smoothed logits from a large teacher to a small student on the same inputs — dark knowledge in inter-class probabilities helps the student.

LLM KD often targets sequences instead: distill rationales, JSON tool syntax, or refusal styles via supervised fine-tuning on teacher-generated outputs — sometimes called imitation learning more than literal logit matching (teacher APIs may hide logits).

Interviewers probe whether you know logit KD vs behavior cloning tradeoffs and speculative decoding where a draft student accelerates a target teacher without permanently shrinking model quality.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.