Master time series forecasting for ML interviews — stationarity, ARIMA/ETS, Prophet, gradient-boosted lag features (the M5 winner), DeepAR/N-BEATS/TFT/PatchTST, MASE evaluation, hierarchical reconciliation, and why classical methods still beat neural nets on small data.

55 min read 4 sections 1 interview questions

Time SeriesARIMASARIMAProphetETS Holt-WintersLightGBM Lag FeaturesDeepARN-BEATSTemporal Fusion TransformerPatchTSTMASEHierarchical ReconciliationM4 M5 CompetitionStationarityChronos Foundation Model

Why Time Series Forecasting Breaks Standard ML Intuition

Time series forecasting is the one ML problem where your instinct to "just throw a neural net at it" is almost always wrong. The M4 competition (Makridakis 2018, 100K series) and M5 competition (2020, Walmart sales) both demonstrated that the winning approaches were either statistical models (ETS, ARIMA) ensembled with ML, or gradient-boosted trees on carefully engineered lag features — not deep learning. The 2020 M5 winner used LightGBM with lag, rolling, and calendar features, beating every neural architecture submitted.

Three properties of time series break the IID assumption that powers most ML:

1. Temporal dependence. Observations are correlated with their own past. You cannot shuffle-and-split a time series. A random 80/20 split leaks future information into training — the model sees tomorrow's seasonality pattern while predicting yesterday. The only valid evaluation is expanding or rolling-window backtest with a strict temporal cutoff.

2. Non-stationarity. The statistical properties (mean, variance, autocovariance) drift over time. Classical models like ARIMA require stationarity and force you to explicitly difference the series. Modern ML models tolerate non-stationarity but still degrade silently when the generating process changes — COVID-19 broke every pre-2020 retail forecast.

3. Autocorrelation in residuals. OLS assumes independent errors. Time series residuals are auto-correlated, which invalidates standard confidence intervals and p-values. This is why naive sklearn regression on t → y reports overconfident, wrong metrics.

The interview trap: a junior candidate says "I'd use an LSTM." A senior says "I'd start with ETS or Prophet as a baseline, engineer lag and calendar features, try LightGBM, and only reach for DeepAR or N-BEATS if I have many series and exogenous regressors."

IMPORTANT

What Interviewers Actually Evaluate

Mid-level: Knows ARIMA exists. Can name Prophet. Uses train_test_split with shuffle=False. Reports RMSE.

Senior: Explains stationarity and why ARIMA requires differencing. Reads ACF/PACF plots to choose (p,q). Uses expanding-window backtest. Knows MAPE breaks on zero-demand series and defaults to MASE or WAPE. Engineers lag features for LightGBM. Acknowledges that Prophet and ETS often beat neural nets on clean seasonal data.

Staff: Cites M4/M5 competition results — hybrid ensembles of statistical + ML models win. Uses global models (one LightGBM across 100K SKUs) instead of per-series ARIMA at scale. Designs hierarchical reconciliation (MinT, Wickramasuriya 2019) to make SKU/store/region forecasts add up consistently. Knows Croston's method for intermittent demand. Discusses quantile pinball loss for probabilistic forecasts and mentions Chronos / TimesFM (2024) foundation models.

Clarifying Questions — Ask These Before Modeling

How many series and what cardinality?

One series (a single country's GDP) → classical ARIMA/ETS. Hundreds (store-level sales) → mixed local + global approach. 100K+ series (SKU × store) → global LightGBM or DeepAR. The cardinality determines whether you train one model per series (local) or one shared model (global).

What is the forecast horizon and cadence?

One-step-ahead (h=1) is easy; most models look good. Multi-step (h=24 hours, h=28 days for M5) exposes error compounding. Ask: is this intraday (5-min bars), daily, weekly? Horizon drives model choice — direct multi-output beats recursive for long horizons.

Are exogenous regressors available at prediction time?

Holidays, promotions, weather forecasts, upstream demand — these can move MAE 20-40% if available. But they must be known at forecast time (future holidays are, future weather is only probabilistic). ARIMAX, Prophet (regressors), and LightGBM all support exogenous features.

What is the sparsity / intermittency?

Dense daily sales → ARIMA/ETS/LightGBM all work. Sparse (many zeros — slow-moving SKUs) → switch to Croston's, SBA, or TSB. MAPE is undefined; use MASE or WAPE.

What is the business cost of over- vs under-forecasting?

Inventory understock costs $X per stockout; overstock costs $Y holding cost. Asymmetric → use quantile (pinball) loss at the relevant quantile, not MAE. Staffing decisions often care about P90, not P50.

Is the hierarchy important?

Retail forecasts must roll up: SKU → store → region → total. Independent forecasts at each level don't add up. If yes → hierarchical reconciliation (MinT) is required, not optional.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.

Upgrade to Pro Sign in to upgrade

Ready to put it into practice?

Start Solving

You've covered the theory. Now implement it from scratch and run your solution against hidden test cases.

Open Coding Problem →