Instruction Tuning: Teaching LLMs to Follow Instructions
How SFT on (instruction, response) pairs transforms a base LLM into an instruction-following assistant. Covers FLAN's multi-task discovery, Alpaca self-instruct, the Lima quality-over-quantity finding, catastrophic forgetting mitigations, and when instruction tuning beats RAG.
What Instruction Tuning Actually Does — and Doesn't Do
A base LLM is a text completion engine. Given 'Translate to French: Hello', a base model will often output more English text that plausibly follows — perhaps another example sentence in English, or a tangential continuation — because that pattern appears in pretraining data. The model has never been explicitly trained to recognize 'Translate to French:' as a directive requiring a specific response type.
Instruction tuning is SFT on (instruction, response) pairs where the instruction is a natural language task description. After training on diverse instruction-following examples, the model learns the meta-skill of following directives rather than just continuing text. The same model can then execute novel instructions it has never seen — 'Summarize this in bullet points', 'Write a SQL query to...', 'Explain this concept to a 10-year-old' — without task-specific fine-tuning.
Critically, instruction tuning teaches format and behavior, not new facts. The Lima paper (2023) made this explicit: 'Almost all knowledge in LLMs comes from pretraining. Alignment is largely about surface formatting.' A 7B model instruction-tuned on medical Q&A is not acquiring new medical knowledge — it is learning to surface its pretraining knowledge in a clinically useful format. If the fact wasn't in pretraining, instruction tuning won't add it. This distinction determines when instruction tuning is the right tool versus RAG.
The practical consequence: instruction tuning is powerful for tasks where the model already has relevant knowledge but needs behavioral shaping (output format, reasoning style, response length, tone). It is the wrong tool for injecting private data, recent events, or domain-specific facts absent from the pretraining corpus.
What Interviewers Want to Hear About Instruction Tuning
The single insight that separates strong candidates: instruction tuning changes behavior, not knowledge. Interviewers at Google and Meta specifically test whether candidates understand when to reach for instruction tuning versus RAG versus prompting. The correct answer is almost never 'fine-tune to inject facts' — that's RAG's job. The correct answer for instruction tuning: consistent output format, domain reasoning style, persona, and task adherence that prompting alone cannot produce reliably.
FLAN — The Discovery That Multi-Task Diversity Drives Zero-Shot Transfer
FLAN (Fine-tuned Language Net, Wei et al., 2021) established the foundational insight behind instruction tuning: fine-tuning on a diverse collection of tasks with natural language instruction templates produces dramatically better zero-shot generalization to unseen tasks compared to task-specific fine-tuning.
FLAN-137B was fine-tuned on 62 NLP datasets (sentiment, translation, QA, summarization, etc.), each expressed in multiple natural language instruction phrasings ('paraphrase templates'). The key finding: held-out task performance improved dramatically, and the improvement scaled with the number of tasks in the training mix. FLAN outperformed the 137B base model on 19/25 held-out tasks in zero-shot evaluation — without ever having seen those specific tasks.
The instruction template diversity was critical. Representing each task with 10 different phrasings of the instruction ('Summarize the following:', 'Write a brief summary of:', 'Provide a concise overview of:') prevented the model from fitting to specific trigger phrases. FLAN-T5 (Chung et al., 2022) extended this to 1,800+ tasks and demonstrated the scaling law: more tasks with diverse templates consistently beats more data on fewer tasks.
The non-obvious insight: FLAN's success comes from teaching the model a meta-skill of instruction interpretation, not from the information in any individual task. The model learns 'when text has this structure (instruction + content), produce output that satisfies the instruction' as a generalizable pattern. This is why FLAN-style models generalize to unseen tasks while task-specific fine-tuned models do not — they learned the meta-skill rather than the task.
Alpaca and Self-Instruct: GPT-3.5 as a Data Factory
Alpaca (Taori et al., Stanford, 2023) demonstrated that instruction-following capability from a stronger model can be transferred to a weaker model via data distillation — at dramatically lower cost than human annotation.
The self-instruct pipeline: (1) Seed with 175 human-written task-instruction pairs covering diverse topics. (2) Prompt GPT-3.5 to generate novel instructions given the seed examples as demonstrations. (3) For each generated instruction, prompt GPT-3.5 to generate a high-quality response. (4) Filter low-quality outputs (too short, repetitive, non-English). This generated 52,000 instruction-following pairs at a total API cost of approximately $600. Fine-tuning LLaMA-7B on this data for 3 epochs on 4× A100 GPUs cost under $100. The resulting Alpaca-7B matched GPT-3.5 performance on Stanford's human evaluation on many general language tasks.
The catch: self-instruct data has systematic quality limitations. GPT-3.5-generated responses inherit GPT-3.5's failure modes: hallucinated facts stated confidently, superficial reasoning that sounds plausible, and limited coverage of long-tail or technical domains. Alpaca-7B was later found to hallucinate at higher rates than ChatGPT-3.5 on factual QA benchmarks. The $600 dataset is a demonstration of the technique's power, not a production-quality training set.
Practical guideline: self-instruct is appropriate for behavioral tasks (format, tone, style), where factual accuracy is less critical. For domain-specific knowledge tasks (medical, legal, scientific), self-instruct from a generalist model introduces systematic errors that domain experts would catch but automated quality checks miss.
Orca and Lima: Quality Over Quantity in Instruction Data
Two 2023 papers established that instruction data quality dominates quantity in determining fine-tuning outcomes.
Orca (Microsoft, Mukherjee et al., 2023): Fine-tuned Llama-13B on 5 million GPT-4-generated explanation traces — not just final answers but step-by-step reasoning chains ('Think step by step: [reasoning]... Therefore the answer is...'). Despite the same 13B base model as Alpaca and Vicuna, Orca dramatically outperformed both on reasoning benchmarks (BigBench Hard, AGIEval). The teacher signal quality — GPT-4's explicit reasoning steps rather than GPT-3.5's bare answers — was the key variable.
Lima (Zhou et al., 2023): Curated exactly 1,000 instruction-following examples: 750 from Stack Exchange (human expert responses), 250 from other high-quality sources, all hand-selected for diversity and quality. Lima-65B (LLaMA-65B fine-tuned on these 1K examples) matched ChatGPT-3.5 quality on 43% of evaluations in human preference studies. 1,000 expert examples outperformed 52,000 self-instruct examples on the same base model.
The Lima principle: diversity and quality of examples matters more than volume. The implication for practitioners: spend annotation budget on curation and quality verification, not on generating more examples. A dataset of 2,000 carefully vetted examples from domain experts will outperform 50,000 auto-generated examples from a weaker model on almost any domain-specific task. This runs counter to the default ML instinct ('more data is always better'), which is why it's a high-signal interview answer.
The Lima Insight: What to Tell Interviewers About Data Strategy
When asked how you would build instruction-tuning data for a domain-specific task, answer with the Lima principle: 1,000 high-quality, diverse, expert-curated examples outperform 50,000 auto-generated examples in instruction-following quality. Budget accordingly: pay domain experts for 1–2K curated examples rather than running mass self-instruct pipelines. Filter aggressively — keep the top 10% of any generated dataset rather than using all of it. Monitor response quality on a held-out eval set, not just training loss. The signal: if training loss decreases but held-out instruction-following scores plateau, you are overfit to low-quality data patterns.
Conversation Templates and Why Template Mismatch Degrades Inference
Instruction-tuned models are trained with a specific conversation template that delineates turns. The template must be applied identically at training and inference — mismatch is one of the most common causes of degraded instruction-following in deployed models.
Major conversation templates:
- ChatML (OpenAI):
<|im_start|>system\n{system}\n<|im_end|>\n<|im_start|>user\n{user}\n<|im_end|>\n<|im_start|>assistant\n - Llama 2:
[INST] {user} [/INST] {assistant} </s> - Llama 3:
<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{user}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n - Mistral/Mixtral: Same as Llama 2 format
Why mismatch fails: the model has learned to detect the delimiter tokens as context-switching signals. If you train with Llama 3 format but infer with Llama 2 [INST] tags, the model sees unfamiliar control tokens and falls back to base-model-style continuation rather than instruction-following mode. This appears as responses that ignore the instruction, repeat the prompt, or produce off-format outputs.
The fix: always use the HuggingFace tokenizer.apply_chat_template() method, which encodes the correct template for the model. Do not manually construct conversation strings. During fine-tuning, apply the same chat template to training data using the same tokenizer — this is where most practitioners make errors, hand-crafting conversation strings that differ from what apply_chat_template produces.
Instruction Tuning Pipeline: From Data Collection to Deployment
Instruction Data Collection Strategies: Quality, Cost, and Scale Tradeoffs
| Strategy | Data Quality | Cost per Example | Scalability | Best Use Case |
|---|---|---|---|---|
| Human expert annotation | Highest — expert-level accuracy, nuance | $15–$50 USD per example | Low — bottlenecked on expert time | Medical, legal, scientific domains where factual accuracy is critical; seed data for self-instruct pipelines |
| Crowd-sourced (MTurk) | Medium — general tasks only | $0.50–$5 per example | High — thousands/day | General instruction-following, summarization, simple QA; NOT suitable for technical domains |
| Self-Instruct (Alpaca-style) | Medium-low — inherits teacher model errors | $0.001–$0.01 via API | Very high — 52K examples for $600 | Behavioral tasks: format, tone, style; NOT for factual injection; requires quality filtering |
| Distillation with reasoning traces (Orca-style) | High — step-by-step reasoning transfers | $0.05–$0.20 per example (GPT-4 API) | Medium — expensive at scale | Math, coding, multi-step reasoning; explicit CoT traces dramatically improve reasoning quality |
| Domain expert + quality filter (Lima-style) | Very high — curated diversity | $30–$100 per example all-in | Low — 500–2K examples is the target | Production-quality domain-specific models; when benchmark scores must match or exceed GPT-3.5 |
5 Questions: Should I Instruction-Tune or Use RAG?
Is the capability gap about knowledge or behavior?
If the model fails because it lacks specific facts (company policies, product catalog, recent events), use RAG — instruction tuning cannot reliably inject facts. If the model fails because it ignores your format requirements, uses the wrong tone, or doesn't follow task structure, instruction tuning is the right tool.
Does the required knowledge change frequently?
If your knowledge base is updated daily or weekly (news, pricing, inventory), RAG is required — you cannot retrain a fine-tuned model on every update. Instruction tuning is appropriate for stable behavioral patterns that do not change: 'always respond as JSON', 'write in clinical SOAP note format', 'use Hemingway's style'.
Is source attribution required?
Legal, medical, and compliance applications often require citing the specific document that grounded a claim. RAG enables this naturally (retrieved documents are explicit sources). Instruction tuning internalizes knowledge without provenance — you cannot attribute a fine-tuned model's output to a specific training example.
What is your latency budget?
RAG adds a retrieval round-trip: typically 50–200ms for vector search + reranking. Instruction tuning bakes behavior into model weights — zero retrieval overhead. For latency-critical applications (real-time chat, code autocomplete), instruction tuning is preferred. For knowledge-intensive applications where 200ms retrieval latency is acceptable, RAG is usually better value.
Do you have enough high-quality training examples?
Instruction tuning needs a minimum of ~200–500 high-quality (instruction, response) pairs to produce consistent behavioral change — fewer than this and results are erratic. RAG requires a document corpus but no labeled examples. If you have under 200 labeled examples, use few-shot prompting or RAG rather than fine-tuning.
Catastrophic Forgetting Is Systematic, Not Random
Forgetting is not uniformly distributed across skills. The skills most vulnerable to forgetting are the ones least represented in your instruction data. If your instruction dataset is predominantly English general language tasks (summarization, QA, creative writing), expect degradation in: code generation (HumanEval -15%), mathematical reasoning (MATH -10%), low-resource languages (perplexity increases of 20-50%), and specialized domains (medical, legal, scientific terminology).
Diagnosis: always evaluate on a benchmark battery covering both instruction-following AND base capabilities before and after fine-tuning. If code scores drop more than 5%, add code examples to the instruction mix. Do not assume instruction tuning is free — it always trades off some base capability. The goal is to find the data mixture where the instruction-following gain exceeds the forgetting cost across your task distribution.
LoRA for Instruction Tuning: Hardware Reality
Full fine-tuning Llama-3-70B requires approximately 140GB GPU memory in BF16 (model weights) plus optimizer states (Adam adds 2× parameter count in FP32) — roughly 400–500GB total, requiring 6–8× A100-80GB GPUs. For most practitioners, this is either inaccessible or economically unjustifiable for instruction tuning experiments.
LoRA (rank=16, alpha=32) on Llama-3-70B adds approximately 16M trainable parameters (0.02% of total), fits in approximately 80GB with gradient checkpointing on a single A100-80GB, and produces instruction-following quality within 2–3% of full fine-tuning on standard benchmarks. The parameter math: for each target weight matrix of size d×d, LoRA adds two matrices of size d×r and r×d. For d=4096, r=16: (4096×16 + 16×4096) = 131,072 parameters per layer, versus 16.8M for the full matrix. Applied to Q, K, V, O projections across 32 layers: ~16.8M trainable params total.
QLoRA (Dettmers et al., 2023) extends this to consumer hardware by quantizing the base model to 4-bit (NF4 format) before applying LoRA adapters in BF16. A 70B model in 4-bit requires ~35GB — fitting on 2× RTX 4090 (48GB combined). Empirical accuracy cost: approximately 2–5% on instruction-following benchmarks versus LoRA in BF16. For a 7B model, QLoRA fits on a single RTX 4090 (24GB).
LoRA rank selection: rank 16 is a reasonable starting point. Increasing to rank 64 approaches full fine-tuning quality on most instruction-following tasks and is worth the 4× parameter cost if you have sufficient data (>5K examples). Ranks above 128 provide minimal additional benefit and suggest the task requires full fine-tuning or a larger base model.
QLoRA Instruction Tuning with PEFT and TRL
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTConfig, SFTTrainer
from datasets import Dataset
import torch
# QLoRA: 4-bit base model + BF16 LoRA adapters
# 7B model fits on single RTX 4090 (24GB); 70B on 2x A100-80GB
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NF4 quantization (better than int4)
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # Nested quantization saves ~0.4 bits/param
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8B",
quantization_config=bnb_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B")
tokenizer.pad_token = tokenizer.eos_token
# LoRA configuration: target Q, V projections (K, O optional +5% quality)
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # rank: 8-64; higher = more params, better quality
lora_alpha=32, # scale factor: typically 2×rank
target_modules=["q_proj", "v_proj"], # add k_proj, o_proj for +5%
lora_dropout=0.05,
bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 8,030,261,248 || 0.052%
# Dataset must have 'text' column with pre-formatted chat template applied
def format_example(example):
# Apply Llama 3 chat template — must match inference exactly
messages = [
{"role": "user", "content": example["instruction"]},
{"role": "assistant", "content": example["response"]},
]
return {"text": tokenizer.apply_chat_template(messages, tokenize=False)}
dataset = Dataset.from_list([...]) # your instruction pairs
dataset = dataset.map(format_example)
sft_config = SFTConfig(
max_seq_length=2048,
num_train_epochs=2, # 1-3 epochs; monitor eval loss for early stop
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # effective batch = 16
learning_rate=2e-4, # LoRA uses higher LR than full FT
warmup_ratio=0.03,
lr_scheduler_type="cosine",
save_strategy="epoch",
output_dir="./qlora_output",
bf16=True,
dataset_text_field="text",
)
trainer = SFTTrainer(
model=model,
args=sft_config,
train_dataset=dataset,
tokenizer=tokenizer,
)
trainer.train()
Interview Questions
Click to reveal answersSign in to take the Quiz
This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.
Sign in to take quiz →