Skip to main content

Preview — Pro guide

You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.

Machine Learning·Intermediate

NLP Fundamentals: Tokenization, Embeddings, BERT vs GPT, and Fine-Tuning

The NLP concepts every ML engineer must know to work with language models. Covers subword tokenization (BPE, WordPiece, SentencePiece), static vs contextual embeddings, BERT vs GPT architectural differences, fine-tuning strategies from full fine-tuning to LoRA, and the practical tradeoffs that come up in every production NLP system.

45 min read 2 sections 1 interview questions
NLPTokenizationBPEWordPieceEmbeddingsWord2VecBERTGPTFine-TuningLoRASentence EmbeddingsContextual EmbeddingsLanguage ModelsHugging Face

Why Tokenization Matters More Than Most Engineers Realize

Before any text enters a neural network, it must be converted to numbers via tokenization. The choice of tokenization algorithm has concrete, measurable effects on model quality, vocabulary coverage, and production behavior — yet most engineers treat it as a black box.

The core tradeoff: vocabulary size vs sequence length

  • Word-level tokenization: 'running', 'runs', 'ran' are all different tokens. Vocabulary can be 500K+, most tokens are rare (long tail), unknown words become [UNK] and lose all signal. Used in: classical models (TF-IDF, word2vec).
  • Character-level tokenization: Zero unknown words, tiny vocabulary (~100 chars). But 'transformer' becomes 11 tokens → sequence lengths 10× longer → attention cost quadratic with length → impractical for long documents.
  • Subword tokenization: The right tradeoff. Common words are single tokens ('the', 'model'). Rare words are split into meaningful pieces ('unbelievable' → 'un', '##believ', '##able'). Vocabulary 30K–100K. Used by all modern transformers.

The three dominant algorithms:

BPE (Byte-Pair Encoding) — used by GPT-2/3/4, LLaMA, RoBERTa: Bottom-up: start with characters, iteratively merge the most frequent adjacent pair. Trained on the corpus frequency distribution. GPT-4 uses cl100k_base: ~100,277 tokens, trained on a mix weighted toward English.

WordPiece — used by BERT, DistilBERT, ELECTRA: Similar to BPE but merges based on the pair that maximizes the likelihood of the training corpus (not just frequency). Subword pieces prefixed with '##' indicate continuation tokens ('##ing', '##tion'). Vocabulary: 30,522 for bert-base-uncased.

SentencePiece / Unigram LM — used by T5, ALBERT, mBART, LLaMA-2: Language-agnostic (no space tokenization assumption). Trains a unigram language model on the corpus and prunes the vocabulary to maximize the total likelihood. Works well for multilingual models and morphologically complex languages (Japanese, Korean).

Production implication: A model's tokenizer is trained once and fixed. Adding domain-specific terms (medical ICD codes, programming identifiers) that the tokenizer splits aggressively → longer sequences → slower inference → potentially worse performance. Domain-specific tokenizers (Med-BERT, CodeBERT) retrain the vocabulary on domain data.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.