Skip to main content

ML System Design: Demand Forecasting System

Design the demand forecasting system used by Uber, Lyft, DoorDash, and Amazon — end-to-end. Covers the three hard problems that make this uniquely challenging: spatial ML (demand is correlated across geography, not independent per zone), online learning (marketplace conditions change faster than batch retraining), and the feedback loop where demand forecasts drive pricing which affects actual demand. Includes H3 spatial graphs, temporal GNNs, online adaptation with drift detection, and why WMAPE beats MAPE for imbalanced demand.

55 min read 20 sections 10 interview questions
Demand ForecastingSpatial MLGraph Neural NetworksOnline LearningFeedback LoopsH3 GeohashingTemporal ForecastingDrift DetectionWMAPESurge PricingSpatiotemporalDistribution Shift

Why Demand Forecasting Isn't Just Time Series Regression

Demand forecasting appears on the surface to be a classical time series problem: given historical demand at location X, predict future demand at location X. ARIMA, Prophet, LSTM — the standard toolkit. But at production scale in a marketplace, three properties of the problem break this framing entirely.

Demand is spatially correlated, not independent per zone. Rain hitting downtown Seattle affects demand across 50 adjacent H3 zones simultaneously. A Taylor Swift concert in zone A creates surge demand in zones A, B, and C (pre-concert) and then in zones D and E (post-concert exits). A road closure in one zone spills demand into neighboring zones. Models that treat each zone independently miss this spatial structure entirely — and the spatial correlations are the most valuable predictive signal at short time horizons.

The marketplace changes faster than batch retraining. A weather event, a sports game, a power outage — any of these can shift demand dramatically within minutes. A model retrained yesterday is already wrong. Production demand forecasting requires online adaptation: continuous or near-continuous model updates as real-world conditions evolve. The Lyft spatiotemporal forecasting system (published 2025) explicitly addresses this: "a model trained on data from a month ago performs significantly worse than one trained on the most recent data."

Demand forecasts drive actions that change demand. This is the feedback loop that makes demand forecasting unique. The forecast → surge pricing is activated → drivers are repositioned → demand responds to price → actual demand differs from the counterfactual. If the model learns from data generated by its own pricing decisions, it learns correlations that are partially self-caused. This creates a confounded training set that produces subtly wrong future forecasts.

This problem is asked at Uber, Lyft, DoorDash, Instacart, Amazon Logistics, Walmart, Target, Starbucks, airline companies, power grid operators, and any company that needs to allocate resources in advance across geographies.

TIP

What Interviewers Are Evaluating

Mid-level: Knows demand forecasting is a time series problem. Lists relevant features (time of day, day of week, weather, events). Can describe a Prophet or LSTM baseline. Knows train/test split must be temporal.

Senior-level: Designs spatial features using H3 geohashing and neighbor spillover. Explains why per-zone models fail (misses spatial correlation). Describes online learning with drift detection. Identifies the feedback loop between forecasts and pricing. Names production tools: H3, Apache Flink, Feast, ClickHouse.

Staff-level: Proposes Temporal GNN architecture for joint spatial-temporal learning. Designs the online adaptation strategy with drift detection triggers. Addresses the causal confounding problem (forecast → pricing → demand → training data bias). Proposes the counterfactual evaluation framework. Reasons about ensemble architecture (classical models + GNN + ensemble meta-learner).

Clarifying Questions — Ask These First

01

What type of demand are we forecasting?

Rideshare pickup demand (number of ride requests)? Food delivery order volume? Inventory demand (units sold per SKU per store)? Each has different features, spatial structures, and feedback mechanisms. For this design: rideshare demand per geographic zone per 5-minute interval.

02

What is the forecast horizon and resolution?

1-hour ahead at 5-minute intervals for driver repositioning? 24-hour ahead for staffing? 7-day ahead for inventory planning? Horizon determines model architecture and acceptable latency. Short-horizon (1h) requires real-time features and online learning. Long-horizon (7d) allows batch models.

03

What is the spatial granularity?

City level (1 number per city)? Neighborhood level (~50 zones per city)? Block level (~5,000 zones per city)? Finer granularity requires spatial models to avoid sparse training data per zone. Establish this upfront.

04

What actions will be taken on the forecast?

Driver repositioning (suggests routes to drivers)? Surge pricing (shown to riders)? Staffing (schedule drivers for peak hours)? The action determines the accuracy requirement and latency budget. Driver repositioning needs forecasts every 5 minutes. Staffing can use 24-hour-ahead forecasts.

05

What is the cost of overestimation vs underestimation?

Overestimation (too many drivers → driver earnings decrease, platform costs increase). Underestimation (too few drivers → high wait times, poor rider experience). These asymmetric costs determine the loss function design and acceptable bias direction.

ML Objective Options — Why WMAPE Beats MAPE

MetricDefinitionProblemUse When
RMSE√(mean((y - ŷ)²))Heavily penalizes outliers; sensitive to scale; dominated by high-demand eventsFeatures have consistent scale; outliers matter equally
MAPEmean(|y - ŷ| / y)Undefined when y=0; penalizes underforecasting more than overforecasting; fails for sparse demand zonesAll zones have positive demand; symmetric cost
WMAPE (Weighted MAPE)Σ|y - ŷ| / Σy (weighted by actual demand)None for demand forecastingGold standard for demand forecasting; naturally handles near-zero demand zones; focuses on high-demand periods
Quantile loss (pinball)q×max(y-ŷ,0) + (1-q)×max(ŷ-y,0)Requires choosing q; point estimates lose informationWhen under/overforecasting costs are asymmetric; uncertainty quantification needed
Business metric: driver utilization% of drivers with rides / total on-platformDelayed; requires marketplace simulation to optimizeStaff-level; when optimizing the full system rather than the forecast alone
⚠ WARNING

Why MAPE Fails for Demand Forecasting

MAPE = mean(|y - ŷ| / y) fails when actual demand y is 0 (division by zero) or near-zero (inflated percentage errors for tiny absolute errors). In a rideshare network, a zone at 3 AM has near-zero demand. A forecast of 0.1 vs actual 0.2 rides creates a 50% error — indistinguishable from a forecast of 500 vs actual 1000 in a peak zone. MAPE treats these equally. WMAPE = Σ|y-ŷ| / Σy normalizes by total demand volume, not per-zone demand. Low-demand zones contribute little to the metric because their demand is small. High-demand zones (airports, downtown areas at rush hour) dominate. This mirrors business impact: getting rush hour right matters far more than getting 3 AM right.

Demand Forecasting Production Architecture

Rendering diagram...

Spatial ML — Why Independent Zone Models Fail

The naive approach to demand forecasting is: train one model per zone. Zone A has its own time series; predict it independently. This is wrong for three reasons:

Spatial autocorrelation: demand in adjacent zones is strongly correlated. If zone A is experiencing a surge, zones B and C (adjacent) are likely also experiencing elevated demand. A flood of workers leaving office buildings at 5 PM affects a connected cluster of zones, not a single zone in isolation. Independent models miss this — they can't use "zone B is surging" as a feature for predicting zone A.

Spillover effects: demand spills between zones. If zone A becomes fully served (drivers meet demand), excess demand doesn't disappear — it flows to adjacent zones. A model that doesn't capture spillover will systematically underestimate demand in zones adjacent to well-served areas.

New zone generalization: independent models have no data for new zones (new city launch, new neighborhood expansion). A spatial model trained on other zones can transfer learned patterns to the new zone via its graph connections to neighboring zones.

The solution: model all zones jointly using a graph where nodes are H3 cells and edges connect adjacent cells. A Temporal GNN (Graph Neural Network with temporal dynamics) learns to predict demand at each node by aggregating information from neighboring nodes at each timestep.

Spatial Graph Construction — H3 Zones as Graph Nodes

Rendering diagram...

Temporal GNN Architecture — SAGEConv + GRU

pythondemand_gnn.py
import torch
import torch.nn as nn
from torch_geometric.nn import SAGEConv

class DemandForecastGNN(nn.Module):
    """
    Temporal GNN for spatial demand forecasting.
    SAGEConv aggregates spatial (neighbor) information.
    GRU captures temporal dynamics (sequential patterns).
    Outputs: forecast for each node (H3 zone) for next T timesteps.
    """
    def __init__(
        self,
        node_features: int,      # features per zone per timestep
        hidden_dim: int = 128,
        n_spatial_layers: int = 2,
        forecast_horizon: int = 12,  # 12 × 5min = 60min ahead
        seq_len: int = 24,           # 24 × 5min = 2h of history
    ):
        super().__init__()
        self.seq_len = seq_len
        self.forecast_horizon = forecast_horizon

        # Node feature projection
        self.node_embed = nn.Linear(node_features, hidden_dim)

        # Spatial aggregation: SAGEConv gathers neighbor signals
        # Applied per-timestep to capture spatial correlations
        self.spatial_layers = nn.ModuleList([
            SAGEConv(hidden_dim, hidden_dim)
            for _ in range(n_spatial_layers)
        ])

        # Temporal dynamics: GRU over the spatially-enriched sequence
        # After spatial aggregation, each node has a time series of spatial-aware features
        self.temporal_gru = nn.GRU(
            input_size=hidden_dim,
            hidden_size=hidden_dim,
            num_layers=2,
            batch_first=True,
            dropout=0.1,
        )

        # Output heads: point estimate + quantile estimates
        self.head_p50 = nn.Linear(hidden_dim, forecast_horizon)   # median
        self.head_p10 = nn.Linear(hidden_dim, forecast_horizon)   # optimistic
        self.head_p90 = nn.Linear(hidden_dim, forecast_horizon)   # conservative

    def forward(
        self,
        x: torch.Tensor,         # (num_nodes, seq_len, node_features)
        edge_index: torch.Tensor, # (2, num_edges) — graph adjacency
        edge_weight: torch.Tensor, # (num_edges,) — spatial correlation weights
    ):
        num_nodes, seq_len, _ = x.shape

        # Step 1: Spatial aggregation at each timestep
        # For each timestep, aggregate neighbor information via SAGEConv
        spatial_out = []
        for t in range(seq_len):
            xt = self.node_embed(x[:, t, :])   # (num_nodes, hidden_dim)
            for layer in self.spatial_layers:
                xt = torch.relu(layer(xt, edge_index))
            spatial_out.append(xt)

        # spatial_out: (seq_len, num_nodes, hidden_dim)
        # Reshape for GRU: (num_nodes, seq_len, hidden_dim)
        spatial_sequence = torch.stack(spatial_out, dim=1)

        # Step 2: Temporal modeling — GRU over spatially-enriched sequence
        gru_out, _ = self.temporal_gru(spatial_sequence)
        # Use final hidden state as the "current state" for forecasting
        final_state = gru_out[:, -1, :]   # (num_nodes, hidden_dim)

        # Step 3: Multi-horizon, multi-quantile forecast
        return {
            "p50": self.head_p50(final_state),   # (num_nodes, horizon)
            "p10": self.head_p10(final_state),
            "p90": self.head_p90(final_state),
        }

def quantile_loss(pred: torch.Tensor, target: torch.Tensor, q: float) -> torch.Tensor:
    """Pinball loss for quantile regression."""
    error = target - pred
    return torch.mean(torch.max(q * error, (q - 1) * error))

Feature Engineering — Node Features for Each H3 Zone

FeatureTypeUpdate FrequencyWhat It Captures
Recent demand (last 5/15/30 min)Online, sliding windowReal-time via FlinkImmediate demand trend; leading indicator of short-horizon future demand
Historical demand (same hour, same day of week)Offline, from ClickHouseHourlySeasonal baseline: Tuesday 8 AM always looks like last Tuesday 8 AM
Driver supply (available drivers in zone)OnlineReal-time (GPS pings)Supply-demand imbalance; surplus supply predicts demand will be served; deficit predicts frustrated riders
Weather: precipitation, visibility, temperatureOffline (API)15-minute forecast updatesRain increases demand 15–30%; heavy snow decreases it; extreme heat or cold increases it
Active events nearbyOffline (event calendar + geocoding)Daily pre-computation + real-time detectionConcert, game, marathon end → demand spike in surrounding zones in 15–30 min window
Time features (cyclical)Online (from request)Per-requestHour of day (sin/cos), day of week, is_holiday, week_of_year — all cyclical encoded to avoid discontinuity
Zone type (static)OfflineDailyAirport, transit hub, residential, commercial, university — learned zone archetype from historical behavior
Neighboring zone demand (graph aggregation)Computed by GNNPer-forward-passCore spatial feature: SAGEConv aggregates neighbor demand signals
Surge pricing stateOnlineReal-timeCurrent surge multiplier affects demand — must be included so model doesn't confuse price effect with demand change
EXAMPLE

Cyclical Encoding — Why sin/cos Beats Integer Encoding for Time

Integer encoding for hour of day: 23 and 0 (midnight and 12:01 AM) have integer distance 23. But they're adjacent in time. A linear model or distance-based algorithm treats them as far apart. Cyclical encoding uses sin(2π×hour/24) and cos(2π×hour/24). Hour 23 becomes (sin(2π×23/24), cos(2π×23/24)) ≈ (−0.26, 0.97). Hour 0 becomes (0, 1). These are geometrically close in 2D — correctly representing their temporal proximity. Apply the same transformation to day of week, month, and week of year. This single feature engineering choice improves model performance measurably on time-periodic patterns.

Online Learning — Adapting to a Non-Stationary World

The core problem: marketplace conditions change faster than batch retraining cycles. A model trained on historical data from last week doesn't know about:

  • A new housing development that opened last month, adding 5,000 residents to zone A
  • A subway line closure that started yesterday, routing commuters through surface streets
  • A new corporate campus that began operations this week, creating a new daily demand pattern

Lyft's production finding (2025): "a model trained on data from a month ago performs significantly worse than one trained on the most recent data — often by 30–40% in WMAPE terms during rapid marketplace changes."

Online learning strategies, ordered by complexity:

1. Incremental batch retraining (simplest): retrain the model every hour using the latest N hours of data (sliding window). For classical models (ARIMA, Prophet), this is computationally cheap. For GNNs, hourly full retraining is feasible with small models but becomes expensive at scale.

2. Warm-start fine-tuning: keep the model weights from the previous training run. Initialize the next training run from those weights. Only a few gradient steps needed to adapt to new data. 10–100× faster than training from scratch.

3. Triggered drift-based retraining: only retrain when drift is detected. Reduces unnecessary computation while ensuring freshness during rapid changes.

4. Online gradient descent: update model weights continuously on each incoming data point or mini-batch. Effective for linear models and shallow networks. Requires careful learning rate management to avoid catastrophic forgetting.

Online Adaptation — Drift Detection and Retraining

Rendering diagram...

The Feedback Loop — When Your Forecast Changes What It's Predicting

Demand forecasting in a marketplace is unique: the forecast is not just an observation — it drives actions that change actual demand. This creates a causal loop that standard ML training assumes doesn't exist.

The loop:

  1. Forecast predicts high demand in zone A at 5 PM Friday
  2. Surge pricing activated in zone A: 1.8× price multiplier
  3. Some riders choose not to book (price-sensitive demand elasticity)
  4. Actual demand is 20% lower than forecast
  5. Training data: "forecast 1,000 rides at 5PM, observed 800 rides"
  6. Model learns: "I was wrong, reduce Friday 5PM forecast by 20%"
  7. New forecast: 800 rides at 5PM Friday next week
  8. No surge activated (forecast doesn't cross threshold)
  9. Actual demand: 1,000 rides (without price dampening)
  10. Model: "I was wrong again, increase forecast by 25%"
  11. Oscillation continues indefinitely

Why this is a confounding problem: the model is learning the correlation between "forecasted demand" and "observed demand under that forecast's pricing." This is not the same as "underlying demand under zero-price intervention." The model cannot recover the counterfactual (what demand would have been without surge pricing) from observational data alone.

The Demand-Pricing Feedback Loop

Rendering diagram...

Evaluation Metrics — Three Layers

CategoryMetricTargetWhy It Matters
Accuracy (primary)WMAPE< 10% for 1h-ahead, < 20% for 24h-aheadGold standard for demand; not distorted by zero-demand zones
Accuracy (tail)P90 absolute error by zone-hour bucketP90 < 2× median errorTail failures cause worst driver/rider experience; optimize fat tails
Spatial accuracyMAE breakdown by zone type (airport, transit, residential)No zone type with MAE > 2× globalSpatially uneven errors → systematic misallocation of driver supply
Calibration (quantile)Coverage: % of actuals within [P10, P90] intervalCoverage ≈ 80%Quantile forecasts enable risk-aware supply decisions; must be calibrated
BiasMean signed error (forecast - actual) by hour of day< 5% of mean demandSystematic underestimation at rush hour = guaranteed supply shortage
Temporal eval splitTrain on months 1–10; evaluate on months 11–12RequiredTemporal leakage check — future data must never inform past predictions
Online (business)Driver utilization rate+2% targetDirectly measures supply-demand match; most downstream measure of forecast quality
Online (business)Average pickup wait timeNo increaseIf forecast underestimates demand → not enough drivers → long waits → rider churn

A/B Testing Demand Forecasting — The Unique Challenge

01

Zone-level holdout, not user-level

Unlike recommendation A/B tests (user splits), demand forecasting A/B tests should use geographic zone holdouts. Randomly assign H3 zones to control (old model) and treatment (new model) groups. Actions (pricing, driver repositioning) are driven by the respective model for each zone. This isolates the effect without confounding user-level behavior.

02

Suppress spillover effects between zones

Demand spills between adjacent zones. If treatment zone A is well-served (good forecast) but control zone B (adjacent) is under-served, drivers from A may migrate to B, confounding the measurement. Use buffer zones between treatment and control clusters. Never place adjacent zones in different experimental groups.

03

Measure business outcomes, not just WMAPE

WMAPE improvement doesn't automatically translate to business improvement. Measure: driver utilization rate (supply-demand match), average pickup wait time (rider experience), driver earnings per hour (marketplace health). A model with 15% WMAPE that improves driver utilization by 3% is better than one with 10% WMAPE that doesn't affect utilization.

04

30-day minimum for seasonal patterns

Demand has strong weekly and monthly seasonal patterns. A 1-week A/B test confounds treatment effect with day-of-week variation. Run for at least 30 days to capture at least 4 complete weekly cycles. For annual seasonal products (holiday demand), even 30 days may be insufficient — use historical season-matched comparisons.

Failure Mode Catalog

Failure ModeManifestationDetectionMitigation
Feedback loop confoundingModel oscillates between over/underestimation at surge thresholds; systematic bias at high-demand periodsSigned error analysis at different surge multiplier levels; bias in surge vs non-surge periodsTrain on requests (not completions); include surge as explicit feature; holdout experiment zones for causal data
Spatial independence failureHigh driver wait times in zones adjacent to well-served zones; spatial clustering of errorsSpatial correlation of errors (Moran's I statistic on residuals > 0.3)GNN with spatial aggregation; zone-type-aware model; explicit spillover feature
Distribution shift — new eventsModel wildly underestimates demand at major events (concerts, games); surge activates too lateError spikes on event days; post-event audit of missed surgesEvent calendar features; online learning adapts quickly after first event
Online learning catastrophic forgettingGNN fine-tuning on recent data degrades long-term seasonality patternsModel accuracy on 7-day-ahead horizon declines while 1h-ahead improvesElastic weight consolidation; separate short-horizon and long-horizon models; learning rate warmup
Cold start — new zonesNew neighborhood added; model has no spatial data for those H3 cellsWMAPE >> 20% for new zones in first 2 weeksGraph-based transfer from neighbors; zone type prior (residential → typical residential pattern)
Temporal leakage in trainingModel uses future data as features during training; overly optimistic offline metricsOffline WMAPE >> online WMAPE; performance drops significantly at deploymentStrict point-in-time correct feature joins in Feast; temporal feature computation window validation
Stale weather featuresForecast wrong during sudden weather events (flash storms) because weather API is delayedModel error spikes correlate with weather severitySub-15-minute weather updates; ensemble weight toward classical models during extreme weather (they're more conservative)

System Evolution — v1 to v4

Rendering diagram...
EXAMPLE

What Each Level Should Cover

Mid-level: Knows demand forecasting is time series. Lists features: time of day, day of week, weather, events. Can describe Prophet or LSTM baseline. Knows train/test split must be temporal.

Senior-level adds: Explains why per-zone independent models fail (spatial correlation, spillover). Designs H3 spatial graph and neighbor feature aggregation. Explains online learning with hourly fine-tuning. Identifies the feedback loop (forecast → pricing → demand → training label). Names production tools: H3, Apache Flink, ClickHouse, Feast. Uses WMAPE instead of MAPE and explains why.

Staff-level adds: Proposes Temporal GNN (SAGEConv + GRU) as the joint spatial-temporal model. Designs drift-triggered retraining with ADWIN/Page-Hinkley detectors. Addresses the causal confounding problem and proposes training on requests (not completions). Designs zone-level holdout A/B test with spillover buffer zones. Reasons about ensemble weighting: when to trust classical models (stable seasonality) vs GNN (rapid recent changes).

Interview Questions

Click to reveal answers
Test your knowledge

Sign in to take the Quiz

This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.

Sign in to take quiz →