Skip to main content

Preview — Pro guide

You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.

Transformers: Self-Attention, Architecture & Modern LLMs

The architecture that powers all modern LLMs. Covers self-attention derivation with complexity analysis, multi-head attention, positional encodings (absolute, RoPE, ALiBi), encoder vs decoder architectures, modern improvements (GQA, RMSNorm, SwiGLU), and how to count parameters and FLOPs. 8 hard interview questions.

65 min read 3 sections 1 interview questions
TransformersSelf-AttentionMulti-Head AttentionPositional EncodingBERTGPTRoPEGQAFlash AttentionParameter Count

Why Transformers Replaced RNNs

Before Transformers (2017), sequence models were dominated by RNNs and LSTMs. They had two fundamental limitations:

  1. Sequential processing: Each token must wait for the previous token before it can be computed — a 10,000-token sequence requires 10,000 sequential steps. Modern GPUs are designed for massive parallelism, so RNNs utilize ~5% of available FLOPs.
  2. Long-range dependencies: To connect token 1 to token 1,000, the gradient must flow through 999 recurrent steps — vanishing gradients make learning these dependencies nearly impossible, even with LSTM gates.

Transformers solve both:

  1. Parallel processing: All tokens are processed in parallel in each layer — full GPU utilization.
  2. Direct attention: A single attention operation connects any two tokens regardless of distance.

The 'Attention is All You Need' paper (Vaswani et al., 2017) eliminated recurrence entirely and achieved SOTA on machine translation with dramatically shorter training time.

Self-Attention — The Core Operation

Self-attention allows each position in a sequence to attend to all other positions in parallel. For an input sequence X ∈ Rⁿˣᵈ (n tokens, d dimensions each):

  1. Create three projections: Q = XWᵠ (queries), K = XWᴷ (keys), V = XWᵛ (values). Each Wᵠ, Wᴷ, Wᵛ ∈ Rᵈˣᵈₖ.

  2. Compute attention scores: A = QKᵀ/√dₖ ∈ Rⁿˣⁿ. The √dₖ scaling prevents dot products from growing large when dₖ is high (which would push softmax into saturation, killing gradients).

  3. Apply softmax per row: Ã = softmax(A) ∈ Rⁿˣⁿ. Each row is the attention weight distribution for one token over all other tokens.

  4. Weighted sum of values: Output = ÷V ∈ Rⁿˣᵈₖ.

Intuition: Q is 'what am I looking for?', K is 'what information do I have?', V is 'what information do I send if matched?'. The attention score Aᵢⱼ measures how much token i wants to attend to token j. High Aᵢⱼ → token j's value is heavily incorporated into token i's output.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.