Skip to main content

Preview — Pro guide

You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.

Machine Learning·Intermediate

Regression Statistics: Coefficients, Confidence Intervals, and Diagnostics

The complete regression statistics reference for data science interviews — OLS assumptions (LINE acronym), coefficient interpretation for continuous and categorical predictors, confidence vs prediction intervals, R² and its limitations, residual diagnostics (Q-Q plots, heteroscedasticity, Cook's distance), multicollinearity and VIF, and logistic regression odds ratios. Covers the most common interviewer traps around R² and model validation.

38 min read 3 sections 1 interview questions
OLS RegressionLinear RegressionLogistic RegressionR-squaredResidual DiagnosticsHeteroscedasticityMulticollinearityVIFConfidence IntervalPrediction IntervalOdds RatioCook's DistanceQ-Q PlotRegression Assumptions

OLS Assumptions — The LINE Acronym

Ordinary Least Squares regression gives unbiased, minimum-variance estimates (BLUE — Best Linear Unbiased Estimator) only when four assumptions hold. The acronym is LINE:

L — Linearity: The relationship between X and E[Y|X] is linear. The model is misspecified if a curved relationship is forced into a line. Diagnosed with: residual vs fitted plot (should show random scatter, not a curve).

I — Independence: Observations are independent of each other. Violated by: time series data (autocorrelation), clustered data (students within schools), repeated measures (same user measured multiple times). Consequence: standard errors are underestimated, p-values are too small. Fix: clustered standard errors, mixed-effects models, or GEE.

N — Normality of residuals: The residuals (Y - Ŷ) are normally distributed. NOT a requirement on X or Y themselves — only on the errors. With large n (typically n > 100), the central limit theorem makes this assumption benign. Violated by: heavy-tailed distributions, outliers. Diagnosed with: Q-Q plot (quantile-quantile plot of residuals vs normal quantiles).

E — Equal variance (Homoscedasticity): The variance of residuals is constant across all levels of X (σ² does not depend on X). Violated by: heteroscedasticity — larger residuals at higher fitted values (common in financial data, count data). Consequence: standard errors are wrong, confidence intervals and p-values are invalid. Diagnosed with: residual vs fitted plot (fan shape = heteroscedasticity). Fix: robust standard errors (HC3), WLS (Weighted Least Squares), or log-transforming the outcome.

Critical point: Violations of L and E are serious (biased estimates or invalid inference). Violations of N are often minor at large n (CLT saves you). Violations of I are serious and often ignored by practitioners to their detriment.

IMPORTANT

The #1 Regression Interview Trap: R² Does Not Validate a Model

R² measures the proportion of variance in Y explained by the model. A model with R² = 0.95 can still be completely wrong.

What R² does NOT tell you:

  • ❌ Whether the relationship is causal
  • ❌ Whether predictions are accurate (a model can overfit perfectly: R²=1.0 but terrible out-of-sample MSE)
  • ❌ Whether the LINE assumptions are satisfied
  • ❌ Whether you have the right variables
  • ❌ Whether predictions are unbiased

Anscombe's Quartet (1973): Four datasets with identical R² (0.67), identical regression coefficients (ŷ = 3 + 0.5x), and identical standard errors — but radically different scatter plots. One is linear, one is curved, one has a single outlier driving the fit, one has all X values identical except one. R² cannot distinguish these.

Adjusted R²: R²_adj = 1 - (1 - R²) × (n-1)/(n-k-1). Penalizes for the number of predictors k. Unlike R², adding a useless predictor will decrease adjusted R². Always prefer adjusted R² when comparing models with different numbers of predictors.

The correct approach: Use R² as one signal alongside residual diagnostics, out-of-sample validation (MAE, RMSE on test set), and assumption checks. A model with R² = 0.60 but clean diagnostics is more trustworthy than a model with R² = 0.92 and severe heteroscedasticity.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.