ML System Design: Demand Forecasting System
Design the demand forecasting system used by Uber, Lyft, DoorDash, and Amazon — end-to-end. Covers the three hard problems that make this uniquely challenging: spatial ML (demand is correlated across geography, not independent per zone), online learning (marketplace conditions change faster than batch retraining), and the feedback loop where demand forecasts drive pricing which affects actual demand. Includes H3 spatial graphs, temporal GNNs, online adaptation with drift detection, and why WMAPE beats MAPE for imbalanced demand.
Why Demand Forecasting Isn't Just Time Series Regression
Demand forecasting appears on the surface to be a classical time series problem: given historical demand at location X, predict future demand at location X. ARIMA, Prophet, LSTM — the standard toolkit. But at production scale in a marketplace, three properties of the problem break this framing entirely.
Demand is spatially correlated, not independent per zone. Rain hitting downtown Seattle affects demand across 50 adjacent H3 zones simultaneously. A Taylor Swift concert in zone A creates surge demand in zones A, B, and C (pre-concert) and then in zones D and E (post-concert exits). A road closure in one zone spills demand into neighboring zones. Models that treat each zone independently miss this spatial structure entirely — and the spatial correlations are the most valuable predictive signal at short time horizons.
The marketplace changes faster than batch retraining. A weather event, a sports game, a power outage — any of these can shift demand dramatically within minutes. A model retrained yesterday is already wrong. Production demand forecasting requires online adaptation: continuous or near-continuous model updates as real-world conditions evolve. The Lyft spatiotemporal forecasting system (published 2025) explicitly addresses this: "a model trained on data from a month ago performs significantly worse than one trained on the most recent data."
Demand forecasts drive actions that change demand. This is the feedback loop that makes demand forecasting unique. The forecast → surge pricing is activated → drivers are repositioned → demand responds to price → actual demand differs from the counterfactual. If the model learns from data generated by its own pricing decisions, it learns correlations that are partially self-caused. This creates a confounded training set that produces subtly wrong future forecasts.
This problem is asked at Uber, Lyft, DoorDash, Instacart, Amazon Logistics, Walmart, Target, Starbucks, airline companies, power grid operators, and any company that needs to allocate resources in advance across geographies.
What Interviewers Are Evaluating
Mid-level: Knows demand forecasting is a time series problem. Lists relevant features (time of day, day of week, weather, events). Can describe a Prophet or LSTM baseline. Knows train/test split must be temporal.
Senior-level: Designs spatial features using H3 geohashing and neighbor spillover. Explains why per-zone models fail (misses spatial correlation). Describes online learning with drift detection. Identifies the feedback loop between forecasts and pricing. Names production tools: H3, Apache Flink, Feast, ClickHouse.
Staff-level: Proposes Temporal GNN architecture for joint spatial-temporal learning. Designs the online adaptation strategy with drift detection triggers. Addresses the causal confounding problem (forecast → pricing → demand → training data bias). Proposes the counterfactual evaluation framework. Reasons about ensemble architecture (classical models + GNN + ensemble meta-learner).
Clarifying Questions — Ask These First
What type of demand are we forecasting?
Rideshare pickup demand (number of ride requests)? Food delivery order volume? Inventory demand (units sold per SKU per store)? Each has different features, spatial structures, and feedback mechanisms. For this design: rideshare demand per geographic zone per 5-minute interval.
What is the forecast horizon and resolution?
1-hour ahead at 5-minute intervals for driver repositioning? 24-hour ahead for staffing? 7-day ahead for inventory planning? Horizon determines model architecture and acceptable latency. Short-horizon (1h) requires real-time features and online learning. Long-horizon (7d) allows batch models.
What is the spatial granularity?
City level (1 number per city)? Neighborhood level (~50 zones per city)? Block level (~5,000 zones per city)? Finer granularity requires spatial models to avoid sparse training data per zone. Establish this upfront.
What actions will be taken on the forecast?
Driver repositioning (suggests routes to drivers)? Surge pricing (shown to riders)? Staffing (schedule drivers for peak hours)? The action determines the accuracy requirement and latency budget. Driver repositioning needs forecasts every 5 minutes. Staffing can use 24-hour-ahead forecasts.
What is the cost of overestimation vs underestimation?
Overestimation (too many drivers → driver earnings decrease, platform costs increase). Underestimation (too few drivers → high wait times, poor rider experience). These asymmetric costs determine the loss function design and acceptable bias direction.
ML Objective Options — Why WMAPE Beats MAPE
| Metric | Definition | Problem | Use When |
|---|---|---|---|
| RMSE | √(mean((y - ŷ)²)) | Heavily penalizes outliers; sensitive to scale; dominated by high-demand events | Features have consistent scale; outliers matter equally |
| MAPE | mean(|y - ŷ| / y) | Undefined when y=0; penalizes underforecasting more than overforecasting; fails for sparse demand zones | All zones have positive demand; symmetric cost |
| WMAPE (Weighted MAPE) | Σ|y - ŷ| / Σy (weighted by actual demand) | None for demand forecasting | Gold standard for demand forecasting; naturally handles near-zero demand zones; focuses on high-demand periods |
| Quantile loss (pinball) | q×max(y-ŷ,0) + (1-q)×max(ŷ-y,0) | Requires choosing q; point estimates lose information | When under/overforecasting costs are asymmetric; uncertainty quantification needed |
| Business metric: driver utilization | % of drivers with rides / total on-platform | Delayed; requires marketplace simulation to optimize | Staff-level; when optimizing the full system rather than the forecast alone |
Why MAPE Fails for Demand Forecasting
MAPE = mean(|y - ŷ| / y) fails when actual demand y is 0 (division by zero) or near-zero (inflated percentage errors for tiny absolute errors). In a rideshare network, a zone at 3 AM has near-zero demand. A forecast of 0.1 vs actual 0.2 rides creates a 50% error — indistinguishable from a forecast of 500 vs actual 1000 in a peak zone. MAPE treats these equally. WMAPE = Σ|y-ŷ| / Σy normalizes by total demand volume, not per-zone demand. Low-demand zones contribute little to the metric because their demand is small. High-demand zones (airports, downtown areas at rush hour) dominate. This mirrors business impact: getting rush hour right matters far more than getting 3 AM right.
Demand Forecasting Production Architecture
Spatial ML — Why Independent Zone Models Fail
The naive approach to demand forecasting is: train one model per zone. Zone A has its own time series; predict it independently. This is wrong for three reasons:
Spatial autocorrelation: demand in adjacent zones is strongly correlated. If zone A is experiencing a surge, zones B and C (adjacent) are likely also experiencing elevated demand. A flood of workers leaving office buildings at 5 PM affects a connected cluster of zones, not a single zone in isolation. Independent models miss this — they can't use "zone B is surging" as a feature for predicting zone A.
Spillover effects: demand spills between zones. If zone A becomes fully served (drivers meet demand), excess demand doesn't disappear — it flows to adjacent zones. A model that doesn't capture spillover will systematically underestimate demand in zones adjacent to well-served areas.
New zone generalization: independent models have no data for new zones (new city launch, new neighborhood expansion). A spatial model trained on other zones can transfer learned patterns to the new zone via its graph connections to neighboring zones.
The solution: model all zones jointly using a graph where nodes are H3 cells and edges connect adjacent cells. A Temporal GNN (Graph Neural Network with temporal dynamics) learns to predict demand at each node by aggregating information from neighboring nodes at each timestep.
Spatial Graph Construction — H3 Zones as Graph Nodes
Temporal GNN Architecture — SAGEConv + GRU
import torch
import torch.nn as nn
from torch_geometric.nn import SAGEConv
class DemandForecastGNN(nn.Module):
"""
Temporal GNN for spatial demand forecasting.
SAGEConv aggregates spatial (neighbor) information.
GRU captures temporal dynamics (sequential patterns).
Outputs: forecast for each node (H3 zone) for next T timesteps.
"""
def __init__(
self,
node_features: int, # features per zone per timestep
hidden_dim: int = 128,
n_spatial_layers: int = 2,
forecast_horizon: int = 12, # 12 × 5min = 60min ahead
seq_len: int = 24, # 24 × 5min = 2h of history
):
super().__init__()
self.seq_len = seq_len
self.forecast_horizon = forecast_horizon
# Node feature projection
self.node_embed = nn.Linear(node_features, hidden_dim)
# Spatial aggregation: SAGEConv gathers neighbor signals
# Applied per-timestep to capture spatial correlations
self.spatial_layers = nn.ModuleList([
SAGEConv(hidden_dim, hidden_dim)
for _ in range(n_spatial_layers)
])
# Temporal dynamics: GRU over the spatially-enriched sequence
# After spatial aggregation, each node has a time series of spatial-aware features
self.temporal_gru = nn.GRU(
input_size=hidden_dim,
hidden_size=hidden_dim,
num_layers=2,
batch_first=True,
dropout=0.1,
)
# Output heads: point estimate + quantile estimates
self.head_p50 = nn.Linear(hidden_dim, forecast_horizon) # median
self.head_p10 = nn.Linear(hidden_dim, forecast_horizon) # optimistic
self.head_p90 = nn.Linear(hidden_dim, forecast_horizon) # conservative
def forward(
self,
x: torch.Tensor, # (num_nodes, seq_len, node_features)
edge_index: torch.Tensor, # (2, num_edges) — graph adjacency
edge_weight: torch.Tensor, # (num_edges,) — spatial correlation weights
):
num_nodes, seq_len, _ = x.shape
# Step 1: Spatial aggregation at each timestep
# For each timestep, aggregate neighbor information via SAGEConv
spatial_out = []
for t in range(seq_len):
xt = self.node_embed(x[:, t, :]) # (num_nodes, hidden_dim)
for layer in self.spatial_layers:
xt = torch.relu(layer(xt, edge_index))
spatial_out.append(xt)
# spatial_out: (seq_len, num_nodes, hidden_dim)
# Reshape for GRU: (num_nodes, seq_len, hidden_dim)
spatial_sequence = torch.stack(spatial_out, dim=1)
# Step 2: Temporal modeling — GRU over spatially-enriched sequence
gru_out, _ = self.temporal_gru(spatial_sequence)
# Use final hidden state as the "current state" for forecasting
final_state = gru_out[:, -1, :] # (num_nodes, hidden_dim)
# Step 3: Multi-horizon, multi-quantile forecast
return {
"p50": self.head_p50(final_state), # (num_nodes, horizon)
"p10": self.head_p10(final_state),
"p90": self.head_p90(final_state),
}
def quantile_loss(pred: torch.Tensor, target: torch.Tensor, q: float) -> torch.Tensor:
"""Pinball loss for quantile regression."""
error = target - pred
return torch.mean(torch.max(q * error, (q - 1) * error))
Feature Engineering — Node Features for Each H3 Zone
| Feature | Type | Update Frequency | What It Captures |
|---|---|---|---|
| Recent demand (last 5/15/30 min) | Online, sliding window | Real-time via Flink | Immediate demand trend; leading indicator of short-horizon future demand |
| Historical demand (same hour, same day of week) | Offline, from ClickHouse | Hourly | Seasonal baseline: Tuesday 8 AM always looks like last Tuesday 8 AM |
| Driver supply (available drivers in zone) | Online | Real-time (GPS pings) | Supply-demand imbalance; surplus supply predicts demand will be served; deficit predicts frustrated riders |
| Weather: precipitation, visibility, temperature | Offline (API) | 15-minute forecast updates | Rain increases demand 15–30%; heavy snow decreases it; extreme heat or cold increases it |
| Active events nearby | Offline (event calendar + geocoding) | Daily pre-computation + real-time detection | Concert, game, marathon end → demand spike in surrounding zones in 15–30 min window |
| Time features (cyclical) | Online (from request) | Per-request | Hour of day (sin/cos), day of week, is_holiday, week_of_year — all cyclical encoded to avoid discontinuity |
| Zone type (static) | Offline | Daily | Airport, transit hub, residential, commercial, university — learned zone archetype from historical behavior |
| Neighboring zone demand (graph aggregation) | Computed by GNN | Per-forward-pass | Core spatial feature: SAGEConv aggregates neighbor demand signals |
| Surge pricing state | Online | Real-time | Current surge multiplier affects demand — must be included so model doesn't confuse price effect with demand change |
Cyclical Encoding — Why sin/cos Beats Integer Encoding for Time
Integer encoding for hour of day: 23 and 0 (midnight and 12:01 AM) have integer distance 23. But they're adjacent in time. A linear model or distance-based algorithm treats them as far apart. Cyclical encoding uses sin(2π×hour/24) and cos(2π×hour/24). Hour 23 becomes (sin(2π×23/24), cos(2π×23/24)) ≈ (−0.26, 0.97). Hour 0 becomes (0, 1). These are geometrically close in 2D — correctly representing their temporal proximity. Apply the same transformation to day of week, month, and week of year. This single feature engineering choice improves model performance measurably on time-periodic patterns.
Online Learning — Adapting to a Non-Stationary World
The core problem: marketplace conditions change faster than batch retraining cycles. A model trained on historical data from last week doesn't know about:
- A new housing development that opened last month, adding 5,000 residents to zone A
- A subway line closure that started yesterday, routing commuters through surface streets
- A new corporate campus that began operations this week, creating a new daily demand pattern
Lyft's production finding (2025): "a model trained on data from a month ago performs significantly worse than one trained on the most recent data — often by 30–40% in WMAPE terms during rapid marketplace changes."
Online learning strategies, ordered by complexity:
1. Incremental batch retraining (simplest): retrain the model every hour using the latest N hours of data (sliding window). For classical models (ARIMA, Prophet), this is computationally cheap. For GNNs, hourly full retraining is feasible with small models but becomes expensive at scale.
2. Warm-start fine-tuning: keep the model weights from the previous training run. Initialize the next training run from those weights. Only a few gradient steps needed to adapt to new data. 10–100× faster than training from scratch.
3. Triggered drift-based retraining: only retrain when drift is detected. Reduces unnecessary computation while ensuring freshness during rapid changes.
4. Online gradient descent: update model weights continuously on each incoming data point or mini-batch. Effective for linear models and shallow networks. Requires careful learning rate management to avoid catastrophic forgetting.
Online Adaptation — Drift Detection and Retraining
The Feedback Loop — When Your Forecast Changes What It's Predicting
Demand forecasting in a marketplace is unique: the forecast is not just an observation — it drives actions that change actual demand. This creates a causal loop that standard ML training assumes doesn't exist.
The loop:
- Forecast predicts high demand in zone A at 5 PM Friday
- Surge pricing activated in zone A: 1.8× price multiplier
- Some riders choose not to book (price-sensitive demand elasticity)
- Actual demand is 20% lower than forecast
- Training data: "forecast 1,000 rides at 5PM, observed 800 rides"
- Model learns: "I was wrong, reduce Friday 5PM forecast by 20%"
- New forecast: 800 rides at 5PM Friday next week
- No surge activated (forecast doesn't cross threshold)
- Actual demand: 1,000 rides (without price dampening)
- Model: "I was wrong again, increase forecast by 25%"
- Oscillation continues indefinitely
Why this is a confounding problem: the model is learning the correlation between "forecasted demand" and "observed demand under that forecast's pricing." This is not the same as "underlying demand under zero-price intervention." The model cannot recover the counterfactual (what demand would have been without surge pricing) from observational data alone.
The Demand-Pricing Feedback Loop
Evaluation Metrics — Three Layers
| Category | Metric | Target | Why It Matters |
|---|---|---|---|
| Accuracy (primary) | WMAPE | < 10% for 1h-ahead, < 20% for 24h-ahead | Gold standard for demand; not distorted by zero-demand zones |
| Accuracy (tail) | P90 absolute error by zone-hour bucket | P90 < 2× median error | Tail failures cause worst driver/rider experience; optimize fat tails |
| Spatial accuracy | MAE breakdown by zone type (airport, transit, residential) | No zone type with MAE > 2× global | Spatially uneven errors → systematic misallocation of driver supply |
| Calibration (quantile) | Coverage: % of actuals within [P10, P90] interval | Coverage ≈ 80% | Quantile forecasts enable risk-aware supply decisions; must be calibrated |
| Bias | Mean signed error (forecast - actual) by hour of day | < 5% of mean demand | Systematic underestimation at rush hour = guaranteed supply shortage |
| Temporal eval split | Train on months 1–10; evaluate on months 11–12 | Required | Temporal leakage check — future data must never inform past predictions |
| Online (business) | Driver utilization rate | +2% target | Directly measures supply-demand match; most downstream measure of forecast quality |
| Online (business) | Average pickup wait time | No increase | If forecast underestimates demand → not enough drivers → long waits → rider churn |
A/B Testing Demand Forecasting — The Unique Challenge
Zone-level holdout, not user-level
Unlike recommendation A/B tests (user splits), demand forecasting A/B tests should use geographic zone holdouts. Randomly assign H3 zones to control (old model) and treatment (new model) groups. Actions (pricing, driver repositioning) are driven by the respective model for each zone. This isolates the effect without confounding user-level behavior.
Suppress spillover effects between zones
Demand spills between adjacent zones. If treatment zone A is well-served (good forecast) but control zone B (adjacent) is under-served, drivers from A may migrate to B, confounding the measurement. Use buffer zones between treatment and control clusters. Never place adjacent zones in different experimental groups.
Measure business outcomes, not just WMAPE
WMAPE improvement doesn't automatically translate to business improvement. Measure: driver utilization rate (supply-demand match), average pickup wait time (rider experience), driver earnings per hour (marketplace health). A model with 15% WMAPE that improves driver utilization by 3% is better than one with 10% WMAPE that doesn't affect utilization.
30-day minimum for seasonal patterns
Demand has strong weekly and monthly seasonal patterns. A 1-week A/B test confounds treatment effect with day-of-week variation. Run for at least 30 days to capture at least 4 complete weekly cycles. For annual seasonal products (holiday demand), even 30 days may be insufficient — use historical season-matched comparisons.
Failure Mode Catalog
| Failure Mode | Manifestation | Detection | Mitigation |
|---|---|---|---|
| Feedback loop confounding | Model oscillates between over/underestimation at surge thresholds; systematic bias at high-demand periods | Signed error analysis at different surge multiplier levels; bias in surge vs non-surge periods | Train on requests (not completions); include surge as explicit feature; holdout experiment zones for causal data |
| Spatial independence failure | High driver wait times in zones adjacent to well-served zones; spatial clustering of errors | Spatial correlation of errors (Moran's I statistic on residuals > 0.3) | GNN with spatial aggregation; zone-type-aware model; explicit spillover feature |
| Distribution shift — new events | Model wildly underestimates demand at major events (concerts, games); surge activates too late | Error spikes on event days; post-event audit of missed surges | Event calendar features; online learning adapts quickly after first event |
| Online learning catastrophic forgetting | GNN fine-tuning on recent data degrades long-term seasonality patterns | Model accuracy on 7-day-ahead horizon declines while 1h-ahead improves | Elastic weight consolidation; separate short-horizon and long-horizon models; learning rate warmup |
| Cold start — new zones | New neighborhood added; model has no spatial data for those H3 cells | WMAPE >> 20% for new zones in first 2 weeks | Graph-based transfer from neighbors; zone type prior (residential → typical residential pattern) |
| Temporal leakage in training | Model uses future data as features during training; overly optimistic offline metrics | Offline WMAPE >> online WMAPE; performance drops significantly at deployment | Strict point-in-time correct feature joins in Feast; temporal feature computation window validation |
| Stale weather features | Forecast wrong during sudden weather events (flash storms) because weather API is delayed | Model error spikes correlate with weather severity | Sub-15-minute weather updates; ensemble weight toward classical models during extreme weather (they're more conservative) |
System Evolution — v1 to v4
What Each Level Should Cover
Mid-level: Knows demand forecasting is time series. Lists features: time of day, day of week, weather, events. Can describe Prophet or LSTM baseline. Knows train/test split must be temporal.
Senior-level adds: Explains why per-zone independent models fail (spatial correlation, spillover). Designs H3 spatial graph and neighbor feature aggregation. Explains online learning with hourly fine-tuning. Identifies the feedback loop (forecast → pricing → demand → training label). Names production tools: H3, Apache Flink, ClickHouse, Feast. Uses WMAPE instead of MAPE and explains why.
Staff-level adds: Proposes Temporal GNN (SAGEConv + GRU) as the joint spatial-temporal model. Designs drift-triggered retraining with ADWIN/Page-Hinkley detectors. Addresses the causal confounding problem and proposes training on requests (not completions). Designs zone-level holdout A/B test with spillover buffer zones. Reasons about ensemble weighting: when to trust classical models (stable seasonality) vs GNN (rapid recent changes).
Interview Questions
Click to reveal answersSign in to take the Quiz
This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.
Sign in to take quiz →