What 32K–1M+ token support really means: attention and KV cache economics, the lost-in-the-middle result (Liu et al., TACL 2024, arXiv:2307.03172), needle-in-haystack and chunk reordering for RAG, and when retrieval still wins on cost and proof. Ties the five GenAI planes for staff interviews.

90 min read 2 sections 1 interview questions

Long ContextLost in the MiddleRAGKV CacheRoPENeedle in HaystackRAGASContext WindowPrefillPagedAttentionvLLMReorderingPEARLFlash AttentionTACL 2024

Bigger context windows are a budget line item

Dense self-attention has O(L²) FLOP growth in the naïve accounting for each layer, and long prompts mean long prefill phases before the first new token. Flash Attention (Dao et al.) reduces HBM round-trips and the materialized n×n footprint; it does not make “infinite” context “free” in FLOPs or in wall-clock. At serving time, KV cache for prior tokens also grows with context length and often dominates GPU memory for long interactive chats.

Liu et al., Lost in the Middle (TACL 2024; arXiv:2307.03172) showed a U-shaped pattern: models often use evidence at the start or end of a long multi-doc prompt better than evidence in the middle on their controlled multi-doc QA and key–value retrieval settings. RAG systems should reorder packed chunks, tighten K after rerank, and not “paste the whole drive” as a product strategy.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.

Upgrade to Pro Sign in to upgrade

Ready to put it into practice?

Start Solving

You've covered the theory. Now implement it from scratch and run your solution against hidden test cases.

Open Coding Problem →