Preview — Pro guide
You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.
Transformers: Self-Attention, Architecture & Modern LLMs
The architecture that powers all modern LLMs. Covers self-attention derivation with complexity analysis, multi-head attention, positional encodings (absolute, RoPE, ALiBi), encoder vs decoder architectures, modern improvements (GQA, RMSNorm, SwiGLU), and how to count parameters and FLOPs. 8 hard interview questions.
Why Transformers Replaced RNNs
Before Transformers (2017), sequence models were dominated by RNNs and LSTMs. They had two fundamental limitations:
- Sequential processing: Each token must wait for the previous token before it can be computed — a 10,000-token sequence requires 10,000 sequential steps. Modern GPUs are designed for massive parallelism, so RNNs utilize ~5% of available FLOPs.
- Long-range dependencies: To connect token 1 to token 1,000, the gradient must flow through 999 recurrent steps — vanishing gradients make learning these dependencies nearly impossible, even with LSTM gates.
Transformers solve both:
- Parallel processing: All tokens are processed in parallel in each layer — full GPU utilization.
- Direct attention: A single attention operation connects any two tokens regardless of distance.
The 'Attention is All You Need' paper (Vaswani et al., 2017) eliminated recurrence entirely and achieved SOTA on machine translation with dramatically shorter training time.
Self-Attention — The Core Operation
Self-attention allows each position in a sequence to attend to all other positions in parallel. For an input sequence X ∈ Rⁿˣᵈ (n tokens, d dimensions each):
-
Create three projections: Q = XWᵠ (queries), K = XWᴷ (keys), V = XWᵛ (values). Each Wᵠ, Wᴷ, Wᵛ ∈ Rᵈˣᵈₖ.
-
Compute attention scores: A = QKᵀ/√dₖ ∈ Rⁿˣⁿ. The √dₖ scaling prevents dot products from growing large when dₖ is high (which would push softmax into saturation, killing gradients).
-
Apply softmax per row: Ã = softmax(A) ∈ Rⁿˣⁿ. Each row is the attention weight distribution for one token over all other tokens.
-
Weighted sum of values: Output = ÷V ∈ Rⁿˣᵈₖ.
Intuition: Q is 'what am I looking for?', K is 'what information do I have?', V is 'what information do I send if matched?'. The attention score Aᵢⱼ measures how much token i wants to attend to token j. High Aᵢⱼ → token j's value is heavily incorporated into token i's output.