Preview — Pro guide
You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.
Sections
Related Guides
Transformers: Self-Attention, Architecture & Modern LLMs
Machine Learning
Neural Networks: Backpropagation, Activations & Training
Machine Learning
Optimization & Training: SGD to AdamW, Learning Rate Scheduling, and Gradient Flow
Machine Learning
Embeddings & Vector Databases: ANN Search at Scale
ML System Design
NLP Fundamentals: Tokenization, Embeddings, BERT vs GPT, and Fine-Tuning
The NLP concepts every ML engineer must know to work with language models. Covers subword tokenization (BPE, WordPiece, SentencePiece), static vs contextual embeddings, BERT vs GPT architectural differences, fine-tuning strategies from full fine-tuning to LoRA, and the practical tradeoffs that come up in every production NLP system.
Why Tokenization Matters More Than Most Engineers Realize
Before any text enters a neural network, it must be converted to numbers via tokenization. The choice of tokenization algorithm has concrete, measurable effects on model quality, vocabulary coverage, and production behavior — yet most engineers treat it as a black box.
The core tradeoff: vocabulary size vs sequence length
- Word-level tokenization: 'running', 'runs', 'ran' are all different tokens. Vocabulary can be 500K+, most tokens are rare (long tail), unknown words become
[UNK]and lose all signal. Used in: classical models (TF-IDF, word2vec). - Character-level tokenization: Zero unknown words, tiny vocabulary (~100 chars). But 'transformer' becomes 11 tokens → sequence lengths 10× longer → attention cost quadratic with length → impractical for long documents.
- Subword tokenization: The right tradeoff. Common words are single tokens ('the', 'model'). Rare words are split into meaningful pieces ('unbelievable' → 'un', '##believ', '##able'). Vocabulary 30K–100K. Used by all modern transformers.
The three dominant algorithms:
BPE (Byte-Pair Encoding) — used by GPT-2/3/4, LLaMA, RoBERTa: Bottom-up: start with characters, iteratively merge the most frequent adjacent pair. Trained on the corpus frequency distribution. GPT-4 uses cl100k_base: ~100,277 tokens, trained on a mix weighted toward English.
WordPiece — used by BERT, DistilBERT, ELECTRA: Similar to BPE but merges based on the pair that maximizes the likelihood of the training corpus (not just frequency). Subword pieces prefixed with '##' indicate continuation tokens ('##ing', '##tion'). Vocabulary: 30,522 for bert-base-uncased.
SentencePiece / Unigram LM — used by T5, ALBERT, mBART, LLaMA-2: Language-agnostic (no space tokenization assumption). Trains a unigram language model on the corpus and prunes the vocabulary to maximize the total likelihood. Works well for multilingual models and morphologically complex languages (Japanese, Korean).
Production implication: A model's tokenizer is trained once and fixed. Adding domain-specific terms (medical ICD codes, programming identifiers) that the tokenizer splits aggressively → longer sequences → slower inference → potentially worse performance. Domain-specific tokenizers (Med-BERT, CodeBERT) retrain the vocabulary on domain data.