Skip to main content

Preview — Pro guide

You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.

Machine Learning·Intermediate

Knowledge Distillation: Temperature, Soft Targets, and Students

ML interview: Hinton 2015 soft labels, softmax temperature T and T² scaling, dark knowledge, FitNets, DistilBERT, student–teacher training, pruning/quantization stacks, and when distillation fails versus pruning.

60 min read 3 sections 1 interview questions
Knowledge DistillationSoftmax TemperatureHinton 2015Student TeacherDark KnowledgeFitNetsDistilBERTTinyLLaMAQuantizationMachine Learning InterviewPyTorchKL DivergenceBERTTransfer LearningAttention Transfer

The Core Idea — What the Student Actually Learns

Knowledge distillation trains a small student model to match a large teacher (or ensemble) on the same inputs. Hinton, Vinyals, and Dean (2015, arXiv:1503.02531) key observation: a heavy teacher’s class probabilities carry more information than a one-hot label because they encode relative confusions between wrong classes. Example: a digit classifier that assigns 10⁻⁶ to a '2' and 10⁻⁵ to a '3' is telling you something about the geometry of the mistake that cross-entropy on hard labels never rewards the student for learning — that structure is the 'dark knowledge' in the name.

Deployment motivation: a 12-layer Transformer may be the accuracy champion, but latency, batch cost, and edge deployment often need a 4- or 6-layer student with most of the behavior. Distillation is one leg of a compression stool alongside pruning, low-rank factorization, and quantization — in practice, teams often stack them (first distill, then quantize) because aggressive quantization on a bad student can still collapse quality.

IMPORTANT

What Interviewers Test (DRIFT on Distillation)

Define soft vs hard labels, temperature T in softmax, KL between teacher and student.

Reason why high-T softens the distribution, exposing rich negative-class structure, and the gradient scaling ∝ 1/T² (implementation detail: watch learning rate when changing T).

Identify failure — student too small, teacher miscalibrated, distilling from a weak teacher, domain shift between train and deploy, over-regularizing the student, or distilling after aggressive quantization and inheriting a broken teacher.

Fix — multi-teacher, intermediate feature matching (not only logits), self-distillation, unlabeled distillation (dark knowledge on extra data), temperature annealing.

Test — compare student vs teacher on in-distribution and OOD slices, track ECE if probabilities matter, and measure p99 latency not just quality.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.