Skip to main content

Preview — Pro guide

You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.

GenAI & Agents·Advanced

Positional Encoding — Sinusoidal, RoPE, ALiBi & Context Length Extrapolation

Deep-dive into why self-attention is permutation-invariant and how sinusoidal, learned, RoPE, and ALiBi positional encodings solve this — with production guidance on context length extrapolation, YaRN scaling, and why RoPE is now the default for new LLM architectures.

45 min read 3 sections 1 interview questions
Positional EncodingRoPEALiBiSinusoidalTransformersContext LengthYaRNLLaMAAttention MechanismLLM ArchitectureContext ExtrapolationNTK Scaling

The Permutation-Invariance Problem — Why Attention Needs Position

Self-attention computes outputs as a weighted sum of values: Attention(Q, K, V) = softmax(QK^T / √d_k) × V. Notice what is absent: the index of each token. The attention score between token i and token j depends only on their content vectors (Q_i · K_j), not on whether j appears two positions before i or two thousand positions before i.

Without any positional information, the model cannot distinguish "the dog bit the man" from "the man bit the dog" — the same tokens, the same attention scores, the same output. Formally, self-attention is permutation-equivariant: if you permute the input sequence, the output is permuted identically. This is catastrophic for language, where word order is syntactic structure.

The permutation-invariance problem has three distinct solutions, each with different inductive biases: (1) add a position-dependent signal to the token embeddings before attention (sinusoidal PE, learned absolute PE), (2) modify the attention score to depend on relative position (RoPE, ALiBi), or (3) implicitly encode position through causal masking alone (this works poorly without explicit PE — the model can infer some order from causality but lacks metric position information).

Why does this matter in interviews? Understanding that PE is injected into the attention score (for RoPE/ALiBi) vs the input embedding (for sinusoidal/learned) is the difference between a surface-level answer and demonstrating that you understand how position interacts with the attention mechanism at a mathematical level.

Sinusoidal PE and RoPE Rotation Formulas

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.