Preview — Pro guide
You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.
Knowledge Distillation for LLMs: Logit KD, Context Distillation, and Speculative Decoding Pairing
Classical KD minimized KL between student and teacher logits on a fixed dataset; LLM-era variants distill **reasoning traces**, **tool-use format**, or **chain-of-thought** into smaller models — or pair student draft models with teacher verification in speculative decoding. Covers when offline KD beats RLHF, sequence-level distillation pitfalls, and latency-quality tradeoffs for on-device assistants.
What Changes When the Teacher Is a Frontier LLM
Classic KD (Hinton et al.): match softmax temperature-smoothed logits from a large teacher to a small student on the same inputs — dark knowledge in inter-class probabilities helps the student.
LLM KD often targets sequences instead: distill rationales, JSON tool syntax, or refusal styles via supervised fine-tuning on teacher-generated outputs — sometimes called imitation learning more than literal logit matching (teacher APIs may hide logits).
Interviewers probe whether you know logit KD vs behavior cloning tradeoffs and speculative decoding where a draft student accelerates a target teacher without permanently shrinking model quality.