Probability Distributions: The Production ML Engineer's Reference
The 10 distributions that appear in every ML system — not as textbook formulas but as modeling tools. Covers when each distribution arises naturally, its connection to ML algorithms (Bernoulli→logistic regression, Poisson→NLP count models, Gaussian→linear regression, Log-normal→revenue modeling, Beta→Thompson Sampling, Dirichlet→LDA). Includes the Central Limit Theorem, heavy tails, and the Maximum Likelihood Estimation framework.
Why Distributions Matter for ML Practitioners
Every ML model makes assumptions about the distribution of errors, targets, and latent variables — whether or not you're explicit about it. Linear regression assumes Gaussian residuals (MSE loss = MLE under Gaussian noise). Logistic regression assumes Bernoulli outputs (cross-entropy = MLE under Bernoulli). Poisson regression for count data. Cox regression for survival times.
When you use the wrong distribution for your data, you're solving the wrong optimization problem. A model trained with MSE on log-normal revenue data is maximizing likelihood under the wrong model — it'll be miscalibrated, its confidence intervals will be wrong, and its predictions in the tails will be unreliable.
Beyond model choice, distributions are the language of probability in interviews. 'What distribution would you use to model daily page views?' 'Why is revenue typically log-normal?' 'How would you model time between system failures?' — these come up in every quant/DS/MLE interview.
The 10 Essential Distributions — Quick Reference
| Distribution | Type | Parameters | Mean | Variance | ML Connection | Production Use Case |
|---|---|---|---|---|---|---|
| Bernoulli | Discrete | p ∈ [0,1] | p | p(1-p) | Logistic regression, BCE loss | Binary labels: click/no-click, fraud/not-fraud |
| Binomial | Discrete | n, p | np | np(1-p) | Aggregated binary outcomes | k successes in n trials: CTR measurement |
| Poisson | Discrete | λ > 0 | λ | λ | Poisson regression, NLP count models | Events per unit time: requests/sec, word counts |
| Gaussian (Normal) | Continuous | μ, σ² | μ | σ² | Linear regression (MLE), BN, VAE latent | Measurement errors, CLT approximations |
| Log-Normal | Continuous | μ, σ (of log) | e^(μ+σ²/2) | Complex | Regression on log-transformed targets | Revenue, session duration, file sizes |
| Exponential | Continuous | λ > 0 | 1/λ | 1/λ² | Survival analysis, MTBF | Time between events: service failures, inter-arrivals |
| Beta | Continuous | α, β > 0 | α/(α+β) | αβ/(α+β)²(α+β+1) | Conjugate prior for Bernoulli, Thompson Sampling | CTR posterior, multi-armed bandit arms |
| Gamma | Continuous | α, β | α/β | α/β² | Conjugate prior for Poisson, waiting times | Aggregated service times, overdispersed counts |
| Dirichlet | Continuous | α₁,...,αₖ | αᵢ/Σαⱼ | Complex | Conjugate prior for Multinomial, LDA topic model | Topic distributions, class priors |
| Student's t | Continuous | ν (df) | 0 (ν>1) | ν/(ν-2) | Small-sample inference, robust regression | Two-sample t-test for A/B experiments |
The Distributions You'll Model Production Data With
Bernoulli and Binomial: Bernoulli: a single binary trial with success probability p. Logistic regression models P(Y=1|X) as a Bernoulli probability — the output sigmoid is the probability of success. Binary cross-entropy loss = negative log-likelihood of the Bernoulli distribution.
Binomial(n, p): k successes in n independent Bernoulli trials. When n is large: Binomial(n,p) ≈ Normal(np, np(1-p)) by CLT. Approximation holds when min(np, n(1-p)) > 5.
Poisson: Models number of events in a fixed time/space interval. The single-parameter distribution: both mean and variance = λ. Key property: memoryless — the count in one interval is independent of counts in other intervals (given λ is constant). Used in: NLP word frequency models (unigram language models), web traffic modeling, queueing theory.
When to not use Poisson: when variance >> mean (overdispersion). If Var(Y) > E[Y], use Negative Binomial (adds a dispersion parameter). Twitter follower counts, Airbnb bookings per listing — both heavily overdispersed.
Gaussian (Normal): Defined by mean μ and variance σ². The Central Limit Theorem: the sum (or mean) of n i.i.d. random variables converges to Normal as n → ∞, regardless of the underlying distribution. Requires finite mean and variance. CLT is why so many real-world averages are approximately normal — and why t-tests and Z-tests work for large samples.
Gaussian is the maximum entropy distribution given a fixed mean and variance — it makes the fewest additional assumptions beyond mean and variance. This is why linear regression (minimizes MSE = maximizes Gaussian likelihood) is the right choice when residuals are symmetric and light-tailed.
Log-Normal: If log(X) ~ Normal(μ, σ²), then X ~ Log-Normal. All positive multiplicative processes converge to log-normal (just as additive processes converge to normal). Arises naturally for: revenue per user (large outliers from power users), session duration (most users have short sessions, a few very long), file sizes, insurance claims.
Log-normal vs Gaussian: Log-normal is right-skewed with heavy right tail. Median << mean (the mean is pulled up by rare extreme values). Standard practice: log-transform before modeling → work with Normal in log-space → exponentiate predictions back. This is why e-commerce demand forecasting models often predict log(revenue) with MSE loss rather than raw revenue with MSE loss.
Exponential: Models time between events in a Poisson process. Memoryless: P(X > s+t | X > s) = P(X > t). The only continuous memoryless distribution. Used for: time between system failures (MTBF), inter-arrival times (queuing), item session duration modeling (as an approximation).
Exponential has a thin right tail — insufficient for modeling human behavior (some users stay for very long times). For user session data: Weibull distribution (more flexible) or empirical distribution is typically better.
Distribution Relationships — How They Connect
Heavy Tails, Power Laws, and When Normal Fails
Many real-world quantities have heavy tails — extreme values occur far more often than a Gaussian would predict. Revenue, social media followers, earthquake magnitudes, city populations — all follow power laws (Pareto distributions) where a small fraction of units accounts for most of the total.
The 80/20 rule is a Pareto distribution: The top 20% of customers generate 80% of revenue. Formally, Pareto(α) distribution: P(X > x) = (x_min/x)^α. For α < 2, variance is infinite — CLT doesn't apply, sample means are unstable.
Consequences for ML:
- MSE is inappropriate for heavy-tailed targets (dominated by rare extremes)
- Feature scaling that assumes Gaussian distribution (standardization) will fail — outliers dominate
- Log transformation is the standard fix: log(1+X) makes power-law distributed features approximately Gaussian
The log-normal vs Pareto distinction: Log-normal has a lighter tail than Pareto. For most product metrics (revenue, engagement), log-normal is a reasonable approximation. For very extreme distributions (wealth, social media virality), Pareto is more appropriate.
KL Divergence between distributions:
KL(P||Q) = Σ P(x) log(P(x)/Q(x))
Measures how much information is lost when using Q to approximate P. Not symmetric: KL(P||Q) ≠ KL(Q||P). In VAEs, the KL loss term penalizes the encoder from deviating from the prior N(0,I). In information retrieval, KL divergence measures distributional shift for monitoring data drift. KL divergence between two Gaussians has a closed form: KL(N(μ₁,σ₁²) || N(μ₂,σ₂²)) = log(σ₂/σ₁) + (σ₁² + (μ₁-μ₂)²)/(2σ₂²) - 1/2
Distribution Fitting and Visualization — Production Pattern
import numpy as np
from scipy import stats
import warnings
# --- Identifying which distribution fits your data ---
def fit_distribution(data: np.ndarray) -> dict:
"""
Fit multiple distributions to data and compare AIC.
Returns the best-fitting distribution.
"""
distributions = {
'normal': stats.norm,
'lognormal': stats.lognorm,
'exponential': stats.expon,
'gamma': stats.gamma,
'pareto': stats.pareto,
'weibull_min': stats.weibull_min,
}
results = {}
for name, dist in distributions.items():
try:
with warnings.catch_warnings():
warnings.simplefilter("ignore")
params = dist.fit(data)
log_lik = np.sum(dist.logpdf(data, *params))
k = len(params)
aic = 2*k - 2*log_lik
results[name] = {'params': params, 'aic': aic}
except Exception:
pass
best = min(results, key=lambda k: results[k]['aic'])
return {'best_distribution': best, 'aic': results[best]['aic'],
'all_fits': {k: v['aic'] for k, v in results.items()}}
# --- Revenue modeling: log-normal is almost always right ---
def analyze_revenue(revenue: np.ndarray) -> dict:
"""
Check if revenue follows log-normal distribution.
Log-normal: log(revenue) ~ Normal.
"""
log_rev = np.log(revenue[revenue > 0])
stat, pval = stats.normaltest(log_rev) # D'Agostino K² test
return {
'mean': revenue.mean(),
'median': np.median(revenue),
'p90': np.percentile(revenue, 90),
'p99': np.percentile(revenue, 99),
'log_normal_pval': pval, # p > 0.05 → fail to reject log-normality
'skewness': stats.skew(revenue),
'kurtosis': stats.kurtosis(revenue),
'recommendation': 'log-normal' if pval > 0.05 else 'investigate further'
}
# --- Poisson vs Negative Binomial for count data ---
def check_overdispersion(counts: np.ndarray) -> dict:
"""
Poisson assumes Var(X) = E(X). If Var >> E, use Negative Binomial.
Dispersion ratio = Var / Mean. >1.5 suggests significant overdispersion.
"""
mean = counts.mean()
var = counts.var()
dispersion = var / mean
return {
'mean': mean,
'variance': var,
'dispersion_ratio': dispersion,
'recommendation': 'Negative Binomial' if dispersion > 1.5 else 'Poisson',
'note': 'Poisson assumes Var = Mean. Use NB if Var >> Mean.'
}
# --- CLT verification: when is n 'large enough'? ---
def clt_convergence_check(dist_samples_fn, n_values=[10, 30, 100, 1000]):
"""
Empirically verify CLT: does sample mean converge to Normal?
Tests normality of 10,000 sample means at each n.
"""
results = {}
for n in n_values:
sample_means = [dist_samples_fn(n).mean() for _ in range(10_000)]
_, pval = stats.normaltest(sample_means)
results[n] = {'normal_pval': pval, 'approximately_normal': pval > 0.05}
return results
Distribution Interview Quick-Reference
Q: 'What distribution models the number of user sessions per day?' → Poisson (if low variance) or Negative Binomial (if overdispersed, which is typical for real user behavior).
Q: 'What distribution models revenue per user?' → Log-Normal. The median should be much lower than the mean (positive skew). Verify: log-transform → check for Gaussian.
Q: 'What distribution would you use as a prior for A/B test conversion rates?' → Beta distribution. Beta(1,1) = uniform prior. Beta(α, β) where α + β = effective prior sample size.
Q: 'You're building a Thompson Sampling bandit. What maintains the posterior over each arm's CTR?' → Beta distribution (conjugate to Bernoulli/Binomial). Update: Beta(α+clicks, β+non-clicks).
Q: 'What is the Central Limit Theorem?' → The sample mean of n i.i.d. random variables with finite mean μ and variance σ² converges in distribution to N(μ, σ²/n) as n→∞. Requires finite variance (fails for Pareto distributions with α < 2).
Q: 'What is KL divergence?' → KL(P||Q) = E_P[log(P/Q)]. Measures information lost when approximating P with Q. Not symmetric. KL(P||Q) = 0 iff P=Q. Used in: VAE loss, model monitoring for distributional shift.
Interview Questions
Click to reveal answersSign in to take the Quiz
This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.
Sign in to take quiz →