GenAI & Agents
The stack reshaping interviews in 2026: LLMs, RAG, fine-tuning, inference optimization, agent frameworks, and evaluation — built for candidates targeting AI-first roles.
guides
Agent Memory Systems: In-Context, Semantic, Episodic, and Procedural
LLM context windows are finite and expensive. Learn how production agents implement hierarchical memory — in-context buffers, vector DB semantic retrieval, episodic event logs, and procedural fine-tuning — and when each layer is worth the engineering investment.
Instruction Tuning: Teaching LLMs to Follow Instructions
How SFT on (instruction, response) pairs transforms a base LLM into an instruction-following assistant. Covers FLAN's multi-task discovery, Alpaca self-instruct, the Lima quality-over-quantity finding, catastrophic forgetting mitigations, and when instruction tuning beats RAG.
Embeddings — From word2vec to Instruction-Tuned Vectors & Production RAG
Trace the evolution from word2vec through BERT to modern instruction-tuned embeddings (text-embedding-3, E5-mistral, BGE-M3), understand Matryoshka Representation Learning for cost-latency tradeoffs, and master the production decisions that determine retrieval quality in RAG systems.
Tokenization — BPE, WordPiece, SentencePiece & Production Artifacts
Master the three dominant tokenization algorithms (BPE, WordPiece, SentencePiece) used in GPT-4, BERT, and LLaMA, understand why tokenization causes subtle failures like the r-counting problem, and learn how vocabulary design directly impacts context window costs and multilingual quality.
Decoding Strategies: Temperature, Sampling, and Constrained Generation
Master every LLM decoding parameter: greedy vs beam search vs sampling, temperature scaling, top-k and nucleus sampling, repetition penalties, and the 2024 min-p sampler. Understand when each strategy is optimal and why temperature doesn't change what the model 'knows'.
Prompt Engineering: From Zero-Shot to Production Systems
Master the techniques that separate good prompts from great ones: few-shot examples, Chain-of-Thought reasoning, system prompt design, structured output, and token budget management. Understand when prompting beats fine-tuning and how to debug bad outputs systematically.
AI Agents & Agentic Systems Framework
Comprehensive guide to building production agentic AI systems — from ReAct patterns and tool design to multi-agent orchestration, memory, and evaluation. The fastest-growing area in AI engineering.
LLM & Agent Evaluation: Trajectories, RAGAS, LLM-as-Judge, and Hallucination Mitigation
Evaluating agents and LLM systems requires evaluating trajectories, not just outputs. Learn trajectory vs outcome evaluation, tool call accuracy, the GAIA benchmark gap (humans: 92%, best agents: ~50%), LLM-as-judge biases, RAGAS metrics for grounded generation, layered hallucination defense, non-deterministic CI gates, and how production teams run shadow evaluation and regression suites.
Multi-Agent Systems: Orchestration, LangGraph, and Production Patterns
Single agents hit context limits and accumulate errors on complex tasks. Learn orchestrator-worker architectures, LangGraph state machines, AutoGen debate patterns, parallelization, and why most multi-agent demos break beyond 5 steps in production.
Agentic RAG: ReAct, Self-RAG, and Multi-Step Retrieval
Single-shot RAG fails on multi-hop questions and self-correction. Master agentic RAG patterns — ReAct loops, Self-RAG reflection tokens, FLARE, and tool-augmented retrieval — with latency budgets and failure modes every FAANG candidate must know.
RLHF and DPO: Aligning LLMs with Human Preferences
How RLHF transforms a base LLM into a helpful, harmless assistant — and why DPO has largely replaced PPO for this task. Covers reward model training, PPO instability and reward hacking, Constitutional AI, and when PPO still wins.
LLM Fine-Tuning: LoRA, QLORA, PEFT & RLHF
How to adapt pre-trained LLMs for specific tasks without catastrophic forgetting. Covers full fine-tuning vs PEFT, LoRA math and implementation, QLoRA for consumer hardware, instruction tuning, RLHF with PPO, DPO as the modern alternative, and when fine-tuning actually helps vs. when RAG or prompting is better.
Knowledge Distillation for LLMs: Logit KD, Context Distillation, and Speculative Decoding Pairing
Classical KD minimized KL between student and teacher logits on a fixed dataset; LLM-era variants distill **reasoning traces**, **tool-use format**, or **chain-of-thought** into smaller models — or pair student draft models with teacher verification in speculative decoding. Covers when offline KD beats RLHF, sequence-level distillation pitfalls, and latency-quality tradeoffs for on-device assistants.
RAFT — Retrieval-Augmented Fine-Tuning: When RAFT Beats RAG and When It Does Not
RAFT (Stanford / industry follow-ons) trains models to ignore distractor documents and cite the right passages — closing the gap where vanilla RAG retrieves noise and the model hedges. This guide covers the distractor-augmented training recipe, comparison to supervised fine-tune without retrieval, evaluation on open-book QA, and failure modes when your doc corpus drifts faster than retrain cadence.
How to Approach a GenAI / LLM System Interview
The mindset and signal-management playbook for generative AI and large language model interviews. Covers the research-track vs production-track split, time budgeting for LLM design, recovery when math comes up, and the trap of pattern-matching RAG or fine-tuning without failure modes.
How to Design GenAI Systems: From Blank Whiteboard to Production
The mechanical playbook for designing LLM applications in interviews. Covers the prompt-vs-RAG-vs-fine-tune decision tree, RAG pipeline (chunking, embeddings, FAISS, reranking), LoRA, the vLLM serving stack (KV cache, PagedAttention), and reference architectures for code assistants and document QA.
Chain-of-Thought, Test-Time Compute & Multi-Step Reasoning
From CoT and self-consistency to tree search over verbal states and modern reasoning-tuned models. When step-by-step prompts help, when they add latency and variance, and how to evaluate and ship reasoning in production (verifiers, judges, routing) without leaking competitive or customer detail.
Diffusion Models for Images — DDPM, Latent Diffusion, CFG, Stable Training
How denoising diffusion and latent diffusion power modern image gen (DALL·E, Stable Diffusion class systems): forward noise, score matching, DDIM-style fast sampling, classifier-free guidance, and production concerns — VRAM, latency, safety filters, and eval (FID, CLIP score, red-team). Connects the five GenAI planes for *generation-first* (non-LLM) stacks.
Structured Output, Function & Tool Calling — JSON Schema, Strict Mode, Agent Safety
How tool calling actually works in production — OpenAI-style function tools, JSON Schema constraints, strict structured outputs, parallel vs. sequential tools, and the auth/idempotency story interviewers expect. Connects the five GenAI planes from gateway design to eval and incident response.
LLM Evaluation & Benchmarking — HELM, MMLU, MT-Bench, Arena, LLM-as-Judge
How to evaluate foundation and chat models without fooling yourself — HELM’s multi-metric design, instruction-following suites, chat leaderboards, RAGAS for grounded systems, contamination, and when human eval still wins. Connects public benchmarks to a private canary stack you can ship against.
LLM Fundamentals — Transformers, Attention & Architecture
Deep understanding of how large language models work — from self-attention and the transformer architecture to modern optimizations (KV cache, Flash Attention, RoPE, GQA). Essential for senior AI/ML engineer interviews.
Long-Context LLMs — Lost in the Middle, RAG vs. Natively Long, KV Cache & Packing
What 32K–1M+ token support really means: attention and KV cache economics, the lost-in-the-middle result (Liu et al., TACL 2024, arXiv:2307.03172), needle-in-haystack and chunk reordering for RAG, and when retrieval still wins on cost and proof. Ties the five GenAI planes for staff interviews.
Multimodal LLMs — CLIP, Vision-Language Models & Production Vision APIs
How image+text models work at scale — contrastive pretraining, projection layers, and LLaVA-style instruction tuning. Covers evaluation (MMMU, VQA, retrieval), latency and token economics, and failure modes interviewers expect you to name (hallucinated objects, OCR brittleness, eval contamination).
Positional Encoding — Sinusoidal, RoPE, ALiBi & Context Length Extrapolation
Deep-dive into why self-attention is permutation-invariant and how sinusoidal, learned, RoPE, and ALiBi positional encodings solve this — with production guidance on context length extrapolation, YaRN scaling, and why RoPE is now the default for new LLM architectures.
Advanced RAG: Hybrid Retrieval, Reranking, and Production Architecture
Go beyond naive vector search: master hybrid retrieval with BM25 and dense embeddings fused via Reciprocal Rank Fusion, two-stage reranking with cross-encoders, HyDE, RAPTOR, and the RAG vs fine-tuning decision framework. Includes production failure modes and RAGAS evaluation.
RAG Architecture: From Basics to Production
Retrieval-Augmented Generation is the most common GenAI system design topic. Master chunking strategies, embedding models, vector databases, hybrid search, reranking, advanced retrieval patterns (HyDE, RAPTOR), agentic RAG, guardrails, and production evaluation with RAGAS.
Vector Search for GenAI: HNSW, IVF-PQ, FAISS, and ScaNN in Production
Standalone deep dive on vector search systems for GenAI workloads. Learn how HNSW, IVF, IVF-PQ, and ScaNN differ on recall-latency-cost, how to tune parameters like efSearch and nprobe, and how to choose the right index for million-to-billion scale retrieval.
LLM Guardrails and Safety: Input/Output Filters, Red-Teaming, and Constitutional AI
Production LLM systems require multi-layer safety mechanisms: prompt injection defenses, content classifiers, PII detection, output moderation, and red-teaming pipelines. This guide covers the defense-in-depth safety architecture used at OpenAI, Anthropic, Meta, and Google — the techniques increasingly tested in AI engineer interviews at companies building LLM-powered products.
LLM Gateway: Routing, Guardrails, Quotas, and Observability for Production GenAI
LLM Gateway is the control plane between applications and model providers. Learn architecture patterns for model routing, rate limiting, policy enforcement, prompt/response filtering, caching, fallback handling, and cost governance at scale.
LLM Observability & Monitoring: Traces, Cost and Latency SLOs, Eval Harnesses, and Alerting
Traditional APM misses token streams, tool loops, and judge drift. This guide covers OpenTelemetry-style traces for LLM chains, LangSmith / Phoenix-style eval sessions, cost attribution per tenant, latency SLOs by model tier, golden-set regression, and production alerts for schema drift in structured outputs.
LLM Serving at Scale: vLLM, KV Cache, Batching, and LLMOps
The engineering behind serving large language models at high throughput and low latency. Covers prefill/decode distinction, KV cache memory math, PagedAttention, continuous batching, speculative decoding, Flash Attention, MQA/GQA, quantization, and the LLMOps discipline (versioning, release gates, canary, rollback) needed to deploy them safely.
LLM Quantization: INT4/INT8, GPTQ, AWQ, and bitsandbytes
How to compress LLMs from 140GB to 35GB without destroying quality. Covers PTQ vs QAT, INT8 absmax/zero-point methods, GPTQ Hessian-based INT4, AWQ salient-weight protection, bitsandbytes mixed-precision, and the calibration dataset trap most engineers miss.
Speculative Decoding: 2-4x LLM Inference Speedup Without Quality Loss
How speculative decoding exploits GPU underutilization in the decode phase to achieve 2-4x speedup with mathematically guaranteed output distribution equivalence. Covers draft-verify mechanics, acceptance probability, Medusa heads, Lookahead decoding, and the non-obvious constraints around tokenizer matching.