Preview — Pro guide
You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.
Sections
Related Guides
LLM Fundamentals — Transformers, Attention & Architecture
GenAI & Agents
Tokenization — BPE, WordPiece, SentencePiece & Production Artifacts
GenAI & Agents
Embeddings — From word2vec to Instruction-Tuned Vectors & Production RAG
GenAI & Agents
LLM Fine-Tuning: LoRA, QLORA, PEFT & RLHF
GenAI & Agents
LLM Serving at Scale: vLLM, KV Cache, Batching, and LLMOps
GenAI & Agents
Positional Encoding — Sinusoidal, RoPE, ALiBi & Context Length Extrapolation
Deep-dive into why self-attention is permutation-invariant and how sinusoidal, learned, RoPE, and ALiBi positional encodings solve this — with production guidance on context length extrapolation, YaRN scaling, and why RoPE is now the default for new LLM architectures.
The Permutation-Invariance Problem — Why Attention Needs Position
Self-attention computes outputs as a weighted sum of values: Attention(Q, K, V) = softmax(QK^T / √d_k) × V. Notice what is absent: the index of each token. The attention score between token i and token j depends only on their content vectors (Q_i · K_j), not on whether j appears two positions before i or two thousand positions before i.
Without any positional information, the model cannot distinguish "the dog bit the man" from "the man bit the dog" — the same tokens, the same attention scores, the same output. Formally, self-attention is permutation-equivariant: if you permute the input sequence, the output is permuted identically. This is catastrophic for language, where word order is syntactic structure.
The permutation-invariance problem has three distinct solutions, each with different inductive biases: (1) add a position-dependent signal to the token embeddings before attention (sinusoidal PE, learned absolute PE), (2) modify the attention score to depend on relative position (RoPE, ALiBi), or (3) implicitly encode position through causal masking alone (this works poorly without explicit PE — the model can infer some order from causality but lacks metric position information).
Why does this matter in interviews? Understanding that PE is injected into the attention score (for RoPE/ALiBi) vs the input embedding (for sinusoidal/learned) is the difference between a surface-level answer and demonstrating that you understand how position interacts with the attention mechanism at a mathematical level.