Preview — Pro guide
You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.
Sections
Related Guides
Transformers: Self-Attention, Architecture & Modern LLMs
Machine Learning
Neural Networks: Backpropagation, Activations & Training
Machine Learning
LLM Quantization: INT4/INT8, GPTQ, AWQ, and bitsandbytes
GenAI & Agents
Optimization & Training: SGD to AdamW, Learning Rate Scheduling, and Gradient Flow
Machine Learning
Regularization in ML: Controlling Variance Without Killing Signal
Machine Learning
Knowledge Distillation: Temperature, Soft Targets, and Students
ML interview: Hinton 2015 soft labels, softmax temperature T and T² scaling, dark knowledge, FitNets, DistilBERT, student–teacher training, pruning/quantization stacks, and when distillation fails versus pruning.
The Core Idea — What the Student Actually Learns
Knowledge distillation trains a small student model to match a large teacher (or ensemble) on the same inputs. Hinton, Vinyals, and Dean (2015, arXiv:1503.02531) key observation: a heavy teacher’s class probabilities carry more information than a one-hot label because they encode relative confusions between wrong classes. Example: a digit classifier that assigns 10⁻⁶ to a '2' and 10⁻⁵ to a '3' is telling you something about the geometry of the mistake that cross-entropy on hard labels never rewards the student for learning — that structure is the 'dark knowledge' in the name.
Deployment motivation: a 12-layer Transformer may be the accuracy champion, but latency, batch cost, and edge deployment often need a 4- or 6-layer student with most of the behavior. Distillation is one leg of a compression stool alongside pruning, low-rank factorization, and quantization — in practice, teams often stack them (first distill, then quantize) because aggressive quantization on a bad student can still collapse quality.
What Interviewers Test (DRIFT on Distillation)
Define soft vs hard labels, temperature T in softmax, KL between teacher and student.
Reason why high-T softens the distribution, exposing rich negative-class structure, and the gradient scaling ∝ 1/T² (implementation detail: watch learning rate when changing T).
Identify failure — student too small, teacher miscalibrated, distilling from a weak teacher, domain shift between train and deploy, over-regularizing the student, or distilling after aggressive quantization and inheriting a broken teacher.
Fix — multi-teacher, intermediate feature matching (not only logits), self-distillation, unlabeled distillation (dark knowledge on extra data), temperature annealing.
Test — compare student vs teacher on in-distribution and OOD slices, track ECE if probabilities matter, and measure p99 latency not just quality.