Skip to main content

Preview — Pro guide

You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.

Machine Learning·Intermediate

Linear & Logistic Regression: From OLS to FTRL

Master linear and logistic regression from first principles — OLS derivation via normal equation and MLE, Gauss-Markov (BLUE), why MSE fails for classification, IRLS, L1 vs L2 geometry, and production patterns like FTRL-Proximal used at Google Ads and Facebook CTR.

60 min read 4 sections 1 interview questions
Linear RegressionLogistic RegressionOLSMLERidgeLassoElasticNetFTRL-ProximalGLMRegularizationGauss-MarkovIRLSCalibrationSigmoidLog-Odds

Why Linear and Logistic Regression Still Matter

Linear and logistic regression look simple — they fit a line or an S-curve. That deceptive simplicity is why most candidates fail the interview: they can recite the formula but can't derive it, don't know why Mean Squared Error breaks logistic regression, and can't explain why L1 produces sparse solutions while L2 does not.

These models remain the workhorse at the scale where tree ensembles become impractical. Google Ads trained trillion-parameter logistic regression models with FTRL-Proximal (McMahan 2013) for CTR prediction. Facebook's ads system combined boosted trees with logistic regression as the final layer (He et al. 2014). Vowpal Wabbit streams billion-sample datasets through online logistic regression at sub-ms latency. Credit scoring, medical risk models, and every regulated domain defaults to logistic regression because it is naturally calibrated, analytically interpretable, and regulator-friendly.

The interview trap: you will be asked to derive — the normal equation from MLE, the log-loss from the Bernoulli likelihood, the L1 sparsity property from subdifferentials. If you cannot show the math, you fail the round regardless of how many models you have shipped.

IMPORTANT

What Separates a 9/10 from a 6/10 Answer

A 6/10 candidate writes down y = Xβ + ε and solves the normal equation. A 9/10 candidate (a) derives OLS two ways — as the least-squares projection AND as MLE under Gaussian noise — and explains why they coincide; (b) states the Gauss-Markov theorem and names its assumptions (LINE + no multicollinearity); (c) explains why MSE on logistic outputs is non-convex via the sigmoid sandwich and why log-loss is the natural choice from Bernoulli MLE; (d) draws the L1-ball diamond and L2-ball circle to argue geometrically why L1 produces sparsity; (e) names FTRL-Proximal and knows why it beats SGD for sparse online learning. Staff-tier answers connect the regularization story to the exponential family GLM unification.

Clarifying Questions Before You Model

01

What is the target — continuous, binary, count, or multi-class?

Continuous → linear regression (OLS). Binary → logistic regression. Count (Poisson) or right-skewed positive → GLM with log link. Multi-class → multinomial logistic / softmax regression. Do NOT use linear regression for probabilities — predictions can fall outside [0,1].

02

How many features and how sparse?

For dense features with n ≫ p, OLS via normal equation is fine. For p > n (more features than samples, common in genomics and text) the normal equation X^T X is singular — you MUST regularize (Ridge, Lasso, or ElasticNet). For billion-feature sparse CTR models, use online FTRL-Proximal rather than batch solvers.

03

Are features correlated or is the true model sparse?

High multicollinearity inflates OLS variance (large condition number of X^T X). Ridge stabilizes correlated features by shrinking proportionally. Lasso on correlated features is arbitrary — it picks one and drops the others. ElasticNet (Zou & Hastie 2005) is the compromise when you believe the truth is sparse but features are correlated.

04

Do you need calibrated probabilities or just ranking?

Logistic regression is the only model that is naturally calibrated — predicted probabilities match empirical frequencies without post-hoc Platt or isotonic scaling. Critical for ad auctions (bid = probability × value), medical risk, and any downstream expected-value decision. Tree-based rankers (XGBoost, random forest) often need recalibration.

05

What is the latency and update cadence?

Logistic regression inference is a single dot product and sigmoid — under 1 microsecond per sample on commodity CPU. It supports online updates via SGD/FTRL at millions of events per second. Trees are ~10-100× slower at inference and cannot be incrementally updated without retraining. For real-time ad CTR, linear models win on latency alone.

06

Is the task regulated?

Credit scoring (FCRA, ECOA in the US), healthcare (FDA), and insurance require interpretable, auditable models. Logistic regression coefficients map directly to odds ratios that can be disclosed to regulators. A +0.7 coefficient on a standardized feature means a ~2× odds increase per standard deviation — that is the kind of statement regulators can sign off on.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.