Preview — Pro guide
You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.
Sections
Related Guides
K-Means Clustering: Complete Guide
Machine Learning
XGBoost: Gradient Boosting Deep Dive
Machine Learning
Feature Engineering: Leakage-Safe Encoding, Interactions, Temporal, and Production Parity
Machine Learning
ML Evaluation Metrics: The Complete Guide
Machine Learning
Neural Networks: Backpropagation, Activations & Training
Machine Learning
PCA & Dimensionality Reduction: PCA, t-SNE, UMAP, Autoencoders
PCA from first principles — variance maximization, eigendecomposition, SVD formulation, and when SVD beats eigendecomposition of the covariance matrix. Plus kernel PCA, t-SNE pitfalls, UMAP, autoencoders, and embeddings as the modern alternative. 10 hard interview questions with production-grade answers.
Why Dimensionality Reduction Is Harder Than It Looks
Dimensionality reduction sounds like a preprocessing afterthought — compress d features into k < d — but it is one of the easiest ways to silently destroy signal in a pipeline. PCA is the canonical tool (Pearson 1901, Hotelling 1933), yet most candidates can't explain why it finds eigenvectors of the covariance matrix, when SVD is mandatory over eigendecomposition, or why t-SNE cluster sizes are meaningless. The deeper issue: PCA preserves variance, not signal. If your target y correlates with a low-variance feature direction, PCA can throw away the single most predictive axis while keeping noise. This is why feature selection via L1 regularization usually beats PCA for downstream supervised models, and why tree-based methods like XGBoost should never be fed PCA-transformed inputs — rotating the axes destroys the axis-aligned splits trees rely on. Dimensionality reduction done right means knowing when variance is the right objective, when neighborhood preservation is, and when to skip classical DR entirely in favor of learned embeddings.
What Interviewers Actually Test on PCA
A 6/10 answer states 'PCA finds directions of maximum variance' and stops. A 9/10 answer derives it from the Lagrangian, explains why SVD of X beats eigendecomposition of X^T X (numerical stability when d >> n, avoids forming the covariance matrix which squares the condition number), connects principal components to the right-singular vectors, and knows that centering is mandatory but scaling is conditional — scale when features have different units (z-score), never scale image pixels. Staff-level signals: randomized SVD complexity (Halko 2011), why PCA hurts random forests, the difference between unsupervised DR (PCA) and supervised DR (LDA, embeddings), and why t-SNE/UMAP are visualization-only tools that must never be used as features for downstream ML.
Clarifying Questions Before You Reach for PCA
What is the downstream task?
Visualization (use t-SNE/UMAP), preprocessing for a linear model (PCA or L1 regularization), compression for ANN indexing (PCA to 128d before FAISS IVF), denoising (reconstruct from top-k components), or feature engineering for a tree model (skip PCA — it destroys axis-aligned splits). Each task implies a different tool.
Is the structure linear or nonlinear?
PCA assumes the data lies on a linear subspace. For Swiss roll, ring, or manifold-structured data, use kernel PCA (Schölkopf 1998), UMAP (McInnes 2018), or an autoencoder. Visualize a 2D PCA scatter first — if classes overlap heavily but you know they're separable, linear DR is the problem.
What is `n` vs `d`?
When d >> n (genomics: n=200, d=20000), never form X^T X (d×d matrix of size 400M entries, numerically unstable). Use SVD directly on X (n×d), or randomized SVD for d > 10^4. When n >> d (typical tabular data), either works.
Do features share units?
Image pixels (0–255) share units — do not z-score, it amplifies noise in dark pixels. Mixed features (age in years + income in dollars) require z-score standardization or PCA will be dominated by the high-variance feature regardless of predictive value.
Is this for features or for humans?
If humans will look at the output (visualization, exploratory analysis), use UMAP/t-SNE and accept the caveats. If a model will consume the output, prefer PCA (deterministic, invertible, interpretable) or learned embeddings (Word2Vec, BERT, sentence-transformers) — never t-SNE features.
Start Solving
You've covered the theory. Now implement it from scratch and run your solution against hidden test cases.
Open Coding Problem →