Preview — Pro guide
You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.
Sections
Related Guides
Transformers: Self-Attention, Architecture & Modern LLMs
Machine Learning
NLP Fundamentals: Tokenization, Embeddings, BERT vs GPT, and Fine-Tuning
Machine Learning
Neural Networks: Backpropagation, Activations & Training
Machine Learning
Optimization & Training: SGD to AdamW, Learning Rate Scheduling, and Gradient Flow
Machine Learning
Loss Functions: Choosing the Right Objective for Every ML Problem
Machine Learning
RNNs, LSTMs & GRUs: Sequence Models Before Transformers
Deep dive into recurrent neural networks for FAANG ML interviews. Covers vanilla RNN recurrence and BPTT, vanishing/exploding gradients (Pascanu 2013), LSTM cell state and gates (Hochreiter & Schmidhuber 1997), GRU (Cho 2014), seq2seq with Bahdanau attention, why transformers replaced RNNs in 2017, and where RNN-shaped models still win (streaming inference, Mamba 2023). 8 interview questions with answers.
Why Sequence Models Matter — And Why RNNs Were a Big Deal
Before 2017, every serious sequence task — machine translation, speech recognition, language modeling, time-series forecasting — ran on RNNs or their gated variants (LSTM, GRU). Google Translate's GNMT (2016) was an 8-layer stacked LSTM with attention serving billions of queries. Alexa's wake-word detector was a tiny LSTM on device. Character-level language models from Karpathy's 2015 blog post were single-layer LSTMs.
RNNs matter conceptually because they are the minimal viable model for sequential data: they maintain a hidden state that evolves over time, letting the model 'remember' prior context. Unlike a feedforward network that sees a fixed-size input, an RNN processes arbitrary-length sequences by sharing weights across timesteps.
Interviewers still ask about RNNs for three reasons. First, the vanishing/exploding gradient analysis is the canonical example of why depth is hard — understanding it is table-stakes for any deep learning role. Second, LSTMs are still production-critical in streaming, on-device, and extreme-long-sequence workloads where transformers are impractical. Third, state-space models (Mamba 2023, RWKV, RetNet) are RNNs in disguise — candidates who don't understand the recurrent formulation can't reason about the modern revival.
The common misconception: RNNs are dead. They aren't — transformers won the parallelism war but RNN-shaped models are winning the linear-time-inference war.
What Interviewers Evaluate on RNN Questions
A 6/10 answer describes an RNN as 'a neural network with a loop' and an LSTM as 'an RNN with gates to remember things.' A 9/10 answer does four things:
1. Derives the vanishing gradient mathematically — BPTT unrolls the recurrence, gradients w.r.t. early timesteps are a product of Jacobians W_h · diag(tanh'(z)), spectral radius of W_h < 1 → exponential decay. This is not hand-wavy; you write the product.
2. Explains the LSTM fix precisely — the cell state has an additive update c_t = f_t · c_{t-1} + i_t · g_t, so ∂c_t/∂c_{t-1} = f_t (element-wise) not a matrix product. When the forget gate stays near 1, gradients flow unchanged across hundreds of timesteps. This is the Constant Error Carousel (CEC) — the single most important concept in the 1997 paper.
3. Knows when LSTM still beats transformer — streaming inference (O(1) per step vs transformer's growing KV cache), tiny on-device models (keyword spotting, gesture), and extremely long sequences that don't fit transformer context.
4. Places attention correctly in history — Bahdanau 2014 attention was bolted onto seq2seq LSTMs to fix the fixed-size context bottleneck. Transformers (2017) dropped recurrence entirely and kept only the attention. Candidates who don't know this miss the story arc.