Skip to main content

Preview — Pro guide

You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.

RNNs, LSTMs & GRUs: Sequence Models Before Transformers

Deep dive into recurrent neural networks for FAANG ML interviews. Covers vanilla RNN recurrence and BPTT, vanishing/exploding gradients (Pascanu 2013), LSTM cell state and gates (Hochreiter & Schmidhuber 1997), GRU (Cho 2014), seq2seq with Bahdanau attention, why transformers replaced RNNs in 2017, and where RNN-shaped models still win (streaming inference, Mamba 2023). 8 interview questions with answers.

55 min read 3 sections 1 interview questions
RNNLSTMGRUBPTTVanishing GradientsGradient ClippingSeq2SeqBahdanau AttentionSequence ModelsTeacher ForcingMambaState Space ModelsGNMTRecurrent Networks

Why Sequence Models Matter — And Why RNNs Were a Big Deal

Before 2017, every serious sequence task — machine translation, speech recognition, language modeling, time-series forecasting — ran on RNNs or their gated variants (LSTM, GRU). Google Translate's GNMT (2016) was an 8-layer stacked LSTM with attention serving billions of queries. Alexa's wake-word detector was a tiny LSTM on device. Character-level language models from Karpathy's 2015 blog post were single-layer LSTMs.

RNNs matter conceptually because they are the minimal viable model for sequential data: they maintain a hidden state that evolves over time, letting the model 'remember' prior context. Unlike a feedforward network that sees a fixed-size input, an RNN processes arbitrary-length sequences by sharing weights across timesteps.

Interviewers still ask about RNNs for three reasons. First, the vanishing/exploding gradient analysis is the canonical example of why depth is hard — understanding it is table-stakes for any deep learning role. Second, LSTMs are still production-critical in streaming, on-device, and extreme-long-sequence workloads where transformers are impractical. Third, state-space models (Mamba 2023, RWKV, RetNet) are RNNs in disguise — candidates who don't understand the recurrent formulation can't reason about the modern revival.

The common misconception: RNNs are dead. They aren't — transformers won the parallelism war but RNN-shaped models are winning the linear-time-inference war.

IMPORTANT

What Interviewers Evaluate on RNN Questions

A 6/10 answer describes an RNN as 'a neural network with a loop' and an LSTM as 'an RNN with gates to remember things.' A 9/10 answer does four things:

1. Derives the vanishing gradient mathematically — BPTT unrolls the recurrence, gradients w.r.t. early timesteps are a product of Jacobians W_h · diag(tanh'(z)), spectral radius of W_h < 1 → exponential decay. This is not hand-wavy; you write the product.

2. Explains the LSTM fix precisely — the cell state has an additive update c_t = f_t · c_{t-1} + i_t · g_t, so ∂c_t/∂c_{t-1} = f_t (element-wise) not a matrix product. When the forget gate stays near 1, gradients flow unchanged across hundreds of timesteps. This is the Constant Error Carousel (CEC) — the single most important concept in the 1997 paper.

3. Knows when LSTM still beats transformer — streaming inference (O(1) per step vs transformer's growing KV cache), tiny on-device models (keyword spotting, gesture), and extremely long sequences that don't fit transformer context.

4. Places attention correctly in history — Bahdanau 2014 attention was bolted onto seq2seq LSTMs to fix the fixed-size context bottleneck. Transformers (2017) dropped recurrence entirely and kept only the attention. Candidates who don't know this miss the story arc.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.