Skip to main content
GenAI & Agents·Intermediate

Instruction Tuning: Teaching LLMs to Follow Instructions

How SFT on (instruction, response) pairs transforms a base LLM into an instruction-following assistant. Covers FLAN's multi-task discovery, Alpaca self-instruct, the Lima quality-over-quantity finding, catastrophic forgetting mitigations, and when instruction tuning beats RAG.

45 min read 14 sections 7 interview questions
Instruction TuningSFTFLANAlpacaSelf-InstructLoRAQLoRACatastrophic ForgettingLimaOrcaChatMLFine-Tuning

What Instruction Tuning Actually Does — and Doesn't Do

A base LLM is a text completion engine. Given 'Translate to French: Hello', a base model will often output more English text that plausibly follows — perhaps another example sentence in English, or a tangential continuation — because that pattern appears in pretraining data. The model has never been explicitly trained to recognize 'Translate to French:' as a directive requiring a specific response type.

Instruction tuning is SFT on (instruction, response) pairs where the instruction is a natural language task description. After training on diverse instruction-following examples, the model learns the meta-skill of following directives rather than just continuing text. The same model can then execute novel instructions it has never seen — 'Summarize this in bullet points', 'Write a SQL query to...', 'Explain this concept to a 10-year-old' — without task-specific fine-tuning.

Critically, instruction tuning teaches format and behavior, not new facts. The Lima paper (2023) made this explicit: 'Almost all knowledge in LLMs comes from pretraining. Alignment is largely about surface formatting.' A 7B model instruction-tuned on medical Q&A is not acquiring new medical knowledge — it is learning to surface its pretraining knowledge in a clinically useful format. If the fact wasn't in pretraining, instruction tuning won't add it. This distinction determines when instruction tuning is the right tool versus RAG.

The practical consequence: instruction tuning is powerful for tasks where the model already has relevant knowledge but needs behavioral shaping (output format, reasoning style, response length, tone). It is the wrong tool for injecting private data, recent events, or domain-specific facts absent from the pretraining corpus.

IMPORTANT

What Interviewers Want to Hear About Instruction Tuning

The single insight that separates strong candidates: instruction tuning changes behavior, not knowledge. Interviewers at Google and Meta specifically test whether candidates understand when to reach for instruction tuning versus RAG versus prompting. The correct answer is almost never 'fine-tune to inject facts' — that's RAG's job. The correct answer for instruction tuning: consistent output format, domain reasoning style, persona, and task adherence that prompting alone cannot produce reliably.

FLAN — The Discovery That Multi-Task Diversity Drives Zero-Shot Transfer

FLAN (Fine-tuned Language Net, Wei et al., 2021) established the foundational insight behind instruction tuning: fine-tuning on a diverse collection of tasks with natural language instruction templates produces dramatically better zero-shot generalization to unseen tasks compared to task-specific fine-tuning.

FLAN-137B was fine-tuned on 62 NLP datasets (sentiment, translation, QA, summarization, etc.), each expressed in multiple natural language instruction phrasings ('paraphrase templates'). The key finding: held-out task performance improved dramatically, and the improvement scaled with the number of tasks in the training mix. FLAN outperformed the 137B base model on 19/25 held-out tasks in zero-shot evaluation — without ever having seen those specific tasks.

The instruction template diversity was critical. Representing each task with 10 different phrasings of the instruction ('Summarize the following:', 'Write a brief summary of:', 'Provide a concise overview of:') prevented the model from fitting to specific trigger phrases. FLAN-T5 (Chung et al., 2022) extended this to 1,800+ tasks and demonstrated the scaling law: more tasks with diverse templates consistently beats more data on fewer tasks.

The non-obvious insight: FLAN's success comes from teaching the model a meta-skill of instruction interpretation, not from the information in any individual task. The model learns 'when text has this structure (instruction + content), produce output that satisfies the instruction' as a generalizable pattern. This is why FLAN-style models generalize to unseen tasks while task-specific fine-tuned models do not — they learned the meta-skill rather than the task.

Alpaca and Self-Instruct: GPT-3.5 as a Data Factory

Alpaca (Taori et al., Stanford, 2023) demonstrated that instruction-following capability from a stronger model can be transferred to a weaker model via data distillation — at dramatically lower cost than human annotation.

The self-instruct pipeline: (1) Seed with 175 human-written task-instruction pairs covering diverse topics. (2) Prompt GPT-3.5 to generate novel instructions given the seed examples as demonstrations. (3) For each generated instruction, prompt GPT-3.5 to generate a high-quality response. (4) Filter low-quality outputs (too short, repetitive, non-English). This generated 52,000 instruction-following pairs at a total API cost of approximately $600. Fine-tuning LLaMA-7B on this data for 3 epochs on 4× A100 GPUs cost under $100. The resulting Alpaca-7B matched GPT-3.5 performance on Stanford's human evaluation on many general language tasks.

The catch: self-instruct data has systematic quality limitations. GPT-3.5-generated responses inherit GPT-3.5's failure modes: hallucinated facts stated confidently, superficial reasoning that sounds plausible, and limited coverage of long-tail or technical domains. Alpaca-7B was later found to hallucinate at higher rates than ChatGPT-3.5 on factual QA benchmarks. The $600 dataset is a demonstration of the technique's power, not a production-quality training set.

Practical guideline: self-instruct is appropriate for behavioral tasks (format, tone, style), where factual accuracy is less critical. For domain-specific knowledge tasks (medical, legal, scientific), self-instruct from a generalist model introduces systematic errors that domain experts would catch but automated quality checks miss.

Orca and Lima: Quality Over Quantity in Instruction Data

Two 2023 papers established that instruction data quality dominates quantity in determining fine-tuning outcomes.

Orca (Microsoft, Mukherjee et al., 2023): Fine-tuned Llama-13B on 5 million GPT-4-generated explanation traces — not just final answers but step-by-step reasoning chains ('Think step by step: [reasoning]... Therefore the answer is...'). Despite the same 13B base model as Alpaca and Vicuna, Orca dramatically outperformed both on reasoning benchmarks (BigBench Hard, AGIEval). The teacher signal quality — GPT-4's explicit reasoning steps rather than GPT-3.5's bare answers — was the key variable.

Lima (Zhou et al., 2023): Curated exactly 1,000 instruction-following examples: 750 from Stack Exchange (human expert responses), 250 from other high-quality sources, all hand-selected for diversity and quality. Lima-65B (LLaMA-65B fine-tuned on these 1K examples) matched ChatGPT-3.5 quality on 43% of evaluations in human preference studies. 1,000 expert examples outperformed 52,000 self-instruct examples on the same base model.

The Lima principle: diversity and quality of examples matters more than volume. The implication for practitioners: spend annotation budget on curation and quality verification, not on generating more examples. A dataset of 2,000 carefully vetted examples from domain experts will outperform 50,000 auto-generated examples from a weaker model on almost any domain-specific task. This runs counter to the default ML instinct ('more data is always better'), which is why it's a high-signal interview answer.

TIP

The Lima Insight: What to Tell Interviewers About Data Strategy

When asked how you would build instruction-tuning data for a domain-specific task, answer with the Lima principle: 1,000 high-quality, diverse, expert-curated examples outperform 50,000 auto-generated examples in instruction-following quality. Budget accordingly: pay domain experts for 1–2K curated examples rather than running mass self-instruct pipelines. Filter aggressively — keep the top 10% of any generated dataset rather than using all of it. Monitor response quality on a held-out eval set, not just training loss. The signal: if training loss decreases but held-out instruction-following scores plateau, you are overfit to low-quality data patterns.

Conversation Templates and Why Template Mismatch Degrades Inference

Instruction-tuned models are trained with a specific conversation template that delineates turns. The template must be applied identically at training and inference — mismatch is one of the most common causes of degraded instruction-following in deployed models.

Major conversation templates:

  • ChatML (OpenAI): <|im_start|>system\n{system}\n<|im_end|>\n<|im_start|>user\n{user}\n<|im_end|>\n<|im_start|>assistant\n
  • Llama 2: [INST] {user} [/INST] {assistant} </s>
  • Llama 3: <|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{user}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n
  • Mistral/Mixtral: Same as Llama 2 format

Why mismatch fails: the model has learned to detect the delimiter tokens as context-switching signals. If you train with Llama 3 format but infer with Llama 2 [INST] tags, the model sees unfamiliar control tokens and falls back to base-model-style continuation rather than instruction-following mode. This appears as responses that ignore the instruction, repeat the prompt, or produce off-format outputs.

The fix: always use the HuggingFace tokenizer.apply_chat_template() method, which encodes the correct template for the model. Do not manually construct conversation strings. During fine-tuning, apply the same chat template to training data using the same tokenizer — this is where most practitioners make errors, hand-crafting conversation strings that differ from what apply_chat_template produces.

Instruction Tuning Pipeline: From Data Collection to Deployment

Rendering diagram...

Instruction Data Collection Strategies: Quality, Cost, and Scale Tradeoffs

StrategyData QualityCost per ExampleScalabilityBest Use Case
Human expert annotationHighest — expert-level accuracy, nuance$15–$50 USD per exampleLow — bottlenecked on expert timeMedical, legal, scientific domains where factual accuracy is critical; seed data for self-instruct pipelines
Crowd-sourced (MTurk)Medium — general tasks only$0.50–$5 per exampleHigh — thousands/dayGeneral instruction-following, summarization, simple QA; NOT suitable for technical domains
Self-Instruct (Alpaca-style)Medium-low — inherits teacher model errors$0.001–$0.01 via APIVery high — 52K examples for $600Behavioral tasks: format, tone, style; NOT for factual injection; requires quality filtering
Distillation with reasoning traces (Orca-style)High — step-by-step reasoning transfers$0.05–$0.20 per example (GPT-4 API)Medium — expensive at scaleMath, coding, multi-step reasoning; explicit CoT traces dramatically improve reasoning quality
Domain expert + quality filter (Lima-style)Very high — curated diversity$30–$100 per example all-inLow — 500–2K examples is the targetProduction-quality domain-specific models; when benchmark scores must match or exceed GPT-3.5

5 Questions: Should I Instruction-Tune or Use RAG?

01

Is the capability gap about knowledge or behavior?

If the model fails because it lacks specific facts (company policies, product catalog, recent events), use RAG — instruction tuning cannot reliably inject facts. If the model fails because it ignores your format requirements, uses the wrong tone, or doesn't follow task structure, instruction tuning is the right tool.

02

Does the required knowledge change frequently?

If your knowledge base is updated daily or weekly (news, pricing, inventory), RAG is required — you cannot retrain a fine-tuned model on every update. Instruction tuning is appropriate for stable behavioral patterns that do not change: 'always respond as JSON', 'write in clinical SOAP note format', 'use Hemingway's style'.

03

Is source attribution required?

Legal, medical, and compliance applications often require citing the specific document that grounded a claim. RAG enables this naturally (retrieved documents are explicit sources). Instruction tuning internalizes knowledge without provenance — you cannot attribute a fine-tuned model's output to a specific training example.

04

What is your latency budget?

RAG adds a retrieval round-trip: typically 50–200ms for vector search + reranking. Instruction tuning bakes behavior into model weights — zero retrieval overhead. For latency-critical applications (real-time chat, code autocomplete), instruction tuning is preferred. For knowledge-intensive applications where 200ms retrieval latency is acceptable, RAG is usually better value.

05

Do you have enough high-quality training examples?

Instruction tuning needs a minimum of ~200–500 high-quality (instruction, response) pairs to produce consistent behavioral change — fewer than this and results are erratic. RAG requires a document corpus but no labeled examples. If you have under 200 labeled examples, use few-shot prompting or RAG rather than fine-tuning.

Catastrophic Forgetting: The Hidden Cost of Instruction Tuning

Catastrophic forgetting occurs when fine-tuning on one distribution degrades performance on tasks from the pretraining distribution. For instruction tuning, this typically manifests as degraded performance on specialized skills — code generation, mathematical reasoning, rare language support — that were present in the base model but underrepresented in the instruction tuning dataset.

Quantified impact: LLaMA-1 instruction-tuned on general instruction-following data showed 15–20% degradation on HumanEval (code benchmark) and 10–15% on MATH benchmark compared to the base model. This is the alignment tax — the cost in specialized capability of teaching general instruction-following behavior.

The mechanism: gradient descent during SFT updates the model parameters to minimize instruction-tuning loss. Parameters encoding specialized skills that are orthogonal to the instruction-following objective drift away from their optimal values. With limited instruction data, the model 'overwrites' pretraining representations rather than augmenting them.

Mitigations, ranked by effectiveness:

  1. Include specialized data in the SFT mix: Llama-3-instruct was trained with ~40% code examples mixed into the instruction data. This directly prevents code skill forgetting.
  2. LoRA instead of full fine-tuning: LoRA updates only ~0.1% of parameters, leaving 99.9% frozen. This dramatically limits the surface area for forgetting. Empirically, LoRA-tuned models show 5–8× less catastrophic forgetting than fully fine-tuned models on same-scale experiments.
  3. Low learning rate: Use 1e-5 instead of 1e-4 (typical for full fine-tuning). Smaller gradient steps reduce the magnitude of forgetting per epoch.
  4. Short training: 1–3 epochs maximum. Beyond 3 epochs on a fixed instruction dataset, forgetting accelerates while instruction-following improvement plateaus.
  5. Replay pretraining data: Mix 5–10% raw pretraining tokens into every training batch. This is expensive (requires access to pretraining data) but most effective.
⚠ WARNING

Catastrophic Forgetting Is Systematic, Not Random

Forgetting is not uniformly distributed across skills. The skills most vulnerable to forgetting are the ones least represented in your instruction data. If your instruction dataset is predominantly English general language tasks (summarization, QA, creative writing), expect degradation in: code generation (HumanEval -15%), mathematical reasoning (MATH -10%), low-resource languages (perplexity increases of 20-50%), and specialized domains (medical, legal, scientific terminology).

Diagnosis: always evaluate on a benchmark battery covering both instruction-following AND base capabilities before and after fine-tuning. If code scores drop more than 5%, add code examples to the instruction mix. Do not assume instruction tuning is free — it always trades off some base capability. The goal is to find the data mixture where the instruction-following gain exceeds the forgetting cost across your task distribution.

LoRA for Instruction Tuning: Hardware Reality

Full fine-tuning Llama-3-70B requires approximately 140GB GPU memory in BF16 (model weights) plus optimizer states (Adam adds 2× parameter count in FP32) — roughly 400–500GB total, requiring 6–8× A100-80GB GPUs. For most practitioners, this is either inaccessible or economically unjustifiable for instruction tuning experiments.

LoRA (rank=16, alpha=32) on Llama-3-70B adds approximately 16M trainable parameters (0.02% of total), fits in approximately 80GB with gradient checkpointing on a single A100-80GB, and produces instruction-following quality within 2–3% of full fine-tuning on standard benchmarks. The parameter math: for each target weight matrix of size d×d, LoRA adds two matrices of size d×r and r×d. For d=4096, r=16: (4096×16 + 16×4096) = 131,072 parameters per layer, versus 16.8M for the full matrix. Applied to Q, K, V, O projections across 32 layers: ~16.8M trainable params total.

QLoRA (Dettmers et al., 2023) extends this to consumer hardware by quantizing the base model to 4-bit (NF4 format) before applying LoRA adapters in BF16. A 70B model in 4-bit requires ~35GB — fitting on 2× RTX 4090 (48GB combined). Empirical accuracy cost: approximately 2–5% on instruction-following benchmarks versus LoRA in BF16. For a 7B model, QLoRA fits on a single RTX 4090 (24GB).

LoRA rank selection: rank 16 is a reasonable starting point. Increasing to rank 64 approaches full fine-tuning quality on most instruction-following tasks and is worth the 4× parameter cost if you have sufficient data (>5K examples). Ranks above 128 provide minimal additional benefit and suggest the task requires full fine-tuning or a larger base model.

QLoRA Instruction Tuning with PEFT and TRL

pythonqlora_sft.py
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTConfig, SFTTrainer
from datasets import Dataset
import torch

# QLoRA: 4-bit base model + BF16 LoRA adapters
# 7B model fits on single RTX 4090 (24GB); 70B on 2x A100-80GB
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",       # NF4 quantization (better than int4)
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,  # Nested quantization saves ~0.4 bits/param
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B",
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B")
tokenizer.pad_token = tokenizer.eos_token

# LoRA configuration: target Q, V projections (K, O optional +5% quality)
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,               # rank: 8-64; higher = more params, better quality
    lora_alpha=32,      # scale factor: typically 2×rank
    target_modules=["q_proj", "v_proj"],  # add k_proj, o_proj for +5%
    lora_dropout=0.05,
    bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 8,030,261,248 || 0.052%

# Dataset must have 'text' column with pre-formatted chat template applied
def format_example(example):
    # Apply Llama 3 chat template — must match inference exactly
    messages = [
        {"role": "user", "content": example["instruction"]},
        {"role": "assistant", "content": example["response"]},
    ]
    return {"text": tokenizer.apply_chat_template(messages, tokenize=False)}

dataset = Dataset.from_list([...])  # your instruction pairs
dataset = dataset.map(format_example)

sft_config = SFTConfig(
    max_seq_length=2048,
    num_train_epochs=2,          # 1-3 epochs; monitor eval loss for early stop
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,  # effective batch = 16
    learning_rate=2e-4,          # LoRA uses higher LR than full FT
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    save_strategy="epoch",
    output_dir="./qlora_output",
    bf16=True,
    dataset_text_field="text",
)

trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=dataset,
    tokenizer=tokenizer,
)
trainer.train()

Interview Questions

Click to reveal answers
Test your knowledge

Sign in to take the Quiz

This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.

Sign in to take quiz →