The mechanical playbook for designing LLM applications in interviews. Covers the prompt-vs-RAG-vs-fine-tune decision tree, RAG pipeline (chunking, embeddings, FAISS, reranking), LoRA, the vLLM serving stack (KV cache, PagedAttention), and reference architectures for code assistants and document QA.

45 min read 3 sections 1 interview questions

GenAI DesignRAGFAISSHNSWvLLMPagedAttentionKV CacheLoRARLHFSpeculative DecodingBPEBGE EmbeddingsCross-EncoderAgentic Workflow

GenAI Design Is a Mechanical Skill — Treat It That Way

Strong GenAI candidates do not invent architectures from scratch in 45 minutes. They run a playbook: a sequence of decisions and patterns that produce defensible designs reliably. The skill is not creativity — it is knowing which technique fits which constraint, then assembling the components under time pressure.

This page is the mechanical playbook. It assumes you already know how to approach the interview (covered on the companion page) and focuses on what to design and why.

The rule that organises everything: constraints drive technique selection, not the other way around. Knowledge gap → RAG. Style gap → fine-tuning. Latency budget → quantisation, GQA, speculative decoding, and small-model routing. Cost ceiling → self-host vs hosted, batch vs real-time, model size. If you cannot tie a component to a specific constraint, that component should not be in your design.

The second rule: technique decision first, architecture second, deep-dive third. Most candidates draw boxes (retrieval → LLM → response) before they've decided whether RAG is even the right tool. The reverse is correct: settle the prompt-vs-few-shot-vs-RAG-vs-fine-tune-vs-pretrain question, then draw the architecture that the chosen technique implies.

The 6-Phase GenAI Design Playbook

Phase 1 — Translate the product ask into 4 numbers (5 min)

Accuracy bar (acceptance rate or factual error tolerance) · latency budget (TTFT, ITL, total response) · cost ceiling ($ per request or per active user per month) · scale (QPS, concurrent users, corpus size). Without these, every choice downstream is a guess.

Phase 2 — Climb the technique escalation ladder (5 min)

Prompt → few-shot → RAG → fine-tune (LoRA) → continued pretraining. Each rung is ~10x more expensive and slower to iterate. Pick the lowest rung that meets the bar. Hybrids are common: fine-tune for style + RAG for knowledge.

Phase 3 — Pick the model class (5 min)

Hosted (GPT-4 / Claude / Gemini) for fastest time-to-quality, no infra burden, but per-request cost. Self-hosted (Llama 3, Mistral, Qwen, Mixtral) for cost control at scale, customisation, on-prem. Multi-model routing (small for easy, large for hard) cuts cost 5-10x in production.

Phase 4 — Design the data + retrieval path if RAG (10 min)

Chunking strategy (fixed vs semantic vs hierarchical) · embedding model (BGE-M3, GTE-large, E5-mistral, Cohere v3) · ANN index (FAISS flat / IVF / IVF-PQ vs HNSW vs ScaNN) · reranker (BM25 hybrid + cross-encoder). Anchor with chunk count and recall@k targets.

Phase 5 — Design the serving stack (10 min)

vLLM with PagedAttention + continuous batching · GPU sizing from weights + KV cache math · quantisation (INT8 / FP8 / INT4 / AWQ / GPTQ) · speculative decoding for ITL · prefix caching for repeated system prompts. Anchor with concurrent request capacity and $ per 1M tokens.

Phase 6 — Design eval + failure modes (10 min)

Offline golden set (recall@k, RAGAS faithfulness, RAGAS answer relevance) · online metrics (thumbs-up, escalation rate, abandonment) · LLM-as-judge for continuous monitoring · red-team set for jailbreaks and PII extraction · failure modes (hallucination, injection, retrieval miss, KV cache OOM, model drift).

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.

Upgrade to Pro Sign in to upgrade