Sections

0/7

Related Guides

Statistics & Probability Foundations

Machine Learning

90m

Bayesian Inference: Priors, Posteriors, MCMC, and Variational Inference

Machine Learning

45m

A/B Testing & Experimentation at Scale

Machine Learning

70m

Statistical Power, Sample Size & Experiment Design: The Complete Guide

Machine Learning

40m

Quiz

← Back to Library

Machine Learning·Intermediate

Probability Distributions: The Production ML Engineer's Reference

The 10 distributions that appear in every ML system — not as textbook formulas but as modeling tools. Covers when each distribution arises naturally, its connection to ML algorithms (Bernoulli→logistic regression, Poisson→NLP count models, Gaussian→linear regression, Log-normal→revenue modeling, Beta→Thompson Sampling, Dirichlet→LDA). Includes the Central Limit Theorem, heavy tails, and the Maximum Likelihood Estimation framework.

35 min read 7 sections 6 interview questions

Probability DistributionsNormal DistributionBernoulliBinomialPoissonExponentialBeta DistributionDirichletLog-NormalHeavy TailsCentral Limit TheoremMLEMaximum LikelihoodConjugate PriorKL Divergence

Why Distributions Matter for ML Practitioners

Every ML model makes assumptions about the distribution of errors, targets, and latent variables — whether or not you're explicit about it. Linear regression assumes Gaussian residuals (MSE loss = MLE under Gaussian noise). Logistic regression assumes Bernoulli outputs (cross-entropy = MLE under Bernoulli). Poisson regression for count data. Cox regression for survival times.

When you use the wrong distribution for your data, you're solving the wrong optimization problem. A model trained with MSE on log-normal revenue data is maximizing likelihood under the wrong model — it'll be miscalibrated, its confidence intervals will be wrong, and its predictions in the tails will be unreliable.

Beyond model choice, distributions are the language of probability in interviews. 'What distribution would you use to model daily page views?' 'Why is revenue typically log-normal?' 'How would you model time between system failures?' — these come up in every quant/DS/MLE interview.

The 10 Essential Distributions — Quick Reference

Distribution	Type	Parameters	Mean	Variance	ML Connection	Production Use Case
Bernoulli	Discrete	p ∈ [0,1]	p	p(1-p)	Logistic regression, BCE loss	Binary labels: click/no-click, fraud/not-fraud
Binomial	Discrete	n, p	np	np(1-p)	Aggregated binary outcomes	k successes in n trials: CTR measurement
Poisson	Discrete	λ > 0	λ	λ	Poisson regression, NLP count models	Events per unit time: requests/sec, word counts
Gaussian (Normal)	Continuous	μ, σ²	μ	σ²	Linear regression (MLE), BN, VAE latent	Measurement errors, CLT approximations
Log-Normal	Continuous	μ, σ (of log)	e^(μ+σ²/2)	Complex	Regression on log-transformed targets	Revenue, session duration, file sizes
Exponential	Continuous	λ > 0	1/λ	1/λ²	Survival analysis, MTBF	Time between events: service failures, inter-arrivals
Beta	Continuous	α, β > 0	α/(α+β)	αβ/(α+β)²(α+β+1)	Conjugate prior for Bernoulli, Thompson Sampling	CTR posterior, multi-armed bandit arms
Gamma	Continuous	α, β	α/β	α/β²	Conjugate prior for Poisson, waiting times	Aggregated service times, overdispersed counts
Dirichlet	Continuous	α₁,...,αₖ	αᵢ/Σαⱼ	Complex	Conjugate prior for Multinomial, LDA topic model	Topic distributions, class priors
Student's t	Continuous	ν (df)	0 (ν>1)	ν/(ν-2)	Small-sample inference, robust regression	Two-sample t-test for A/B experiments

The Distributions You'll Model Production Data With

Bernoulli and Binomial: Bernoulli: a single binary trial with success probability p. Logistic regression models P(Y=1|X) as a Bernoulli probability — the output sigmoid is the probability of success. Binary cross-entropy loss = negative log-likelihood of the Bernoulli distribution.

Binomial(n, p): k successes in n independent Bernoulli trials. When n is large: Binomial(n,p) ≈ Normal(np, np(1-p)) by CLT. Approximation holds when min(np, n(1-p)) > 5.

Poisson: Models number of events in a fixed time/space interval. The single-parameter distribution: both mean and variance = λ. Key property: memoryless — the count in one interval is independent of counts in other intervals (given λ is constant). Used in: NLP word frequency models (unigram language models), web traffic modeling, queueing theory.

When to not use Poisson: when variance >> mean (overdispersion). If Var(Y) > E[Y], use Negative Binomial (adds a dispersion parameter). Twitter follower counts, Airbnb bookings per listing — both heavily overdispersed.

Gaussian (Normal): Defined by mean μ and variance σ². The Central Limit Theorem: the sum (or mean) of n i.i.d. random variables converges to Normal as n → ∞, regardless of the underlying distribution. Requires finite mean and variance. CLT is why so many real-world averages are approximately normal — and why t-tests and Z-tests work for large samples.

Gaussian is the maximum entropy distribution given a fixed mean and variance — it makes the fewest additional assumptions beyond mean and variance. This is why linear regression (minimizes MSE = maximizes Gaussian likelihood) is the right choice when residuals are symmetric and light-tailed.

Log-Normal: If log(X) ~ Normal(μ, σ²), then X ~ Log-Normal. All positive multiplicative processes converge to log-normal (just as additive processes converge to normal). Arises naturally for: revenue per user (large outliers from power users), session duration (most users have short sessions, a few very long), file sizes, insurance claims.

Log-normal vs Gaussian: Log-normal is right-skewed with heavy right tail. Median << mean (the mean is pulled up by rare extreme values). Standard practice: log-transform before modeling → work with Normal in log-space → exponentiate predictions back. This is why e-commerce demand forecasting models often predict log(revenue) with MSE loss rather than raw revenue with MSE loss.

Exponential: Models time between events in a Poisson process. Memoryless: P(X > s+t | X > s) = P(X > t). The only continuous memoryless distribution. Used for: time between system failures (MTBF), inter-arrival times (queuing), item session duration modeling (as an approximation).

Exponential has a thin right tail — insufficient for modeling human behavior (some users stay for very long times). For user session data: Weibull distribution (more flexible) or empirical distribution is typically better.

Distribution Relationships — How They Connect

Rendering diagram...

Heavy Tails, Power Laws, and When Normal Fails

Many real-world quantities have heavy tails — extreme values occur far more often than a Gaussian would predict. Revenue, social media followers, earthquake magnitudes, city populations — all follow power laws (Pareto distributions) where a small fraction of units accounts for most of the total.

The 80/20 rule is a Pareto distribution: The top 20% of customers generate 80% of revenue. Formally, Pareto(α) distribution: P(X > x) = (x_min/x)^α. For α < 2, variance is infinite — CLT doesn't apply, sample means are unstable.

Consequences for ML:

MSE is inappropriate for heavy-tailed targets (dominated by rare extremes)
Feature scaling that assumes Gaussian distribution (standardization) will fail — outliers dominate
Log transformation is the standard fix: log(1+X) makes power-law distributed features approximately Gaussian

The log-normal vs Pareto distinction: Log-normal has a lighter tail than Pareto. For most product metrics (revenue, engagement), log-normal is a reasonable approximation. For very extreme distributions (wealth, social media virality), Pareto is more appropriate.

KL Divergence between distributions: KL(P||Q) = Σ P(x) log(P(x)/Q(x)) Measures how much information is lost when using Q to approximate P. Not symmetric: KL(P||Q) ≠ KL(Q||P). In VAEs, the KL loss term penalizes the encoder from deviating from the prior N(0,I). In information retrieval, KL divergence measures distributional shift for monitoring data drift. KL divergence between two Gaussians has a closed form: KL(N(μ₁,σ₁²) || N(μ₂,σ₂²)) = log(σ₂/σ₁) + (σ₁² + (μ₁-μ₂)²)/(2σ₂²) - 1/2

Distribution Fitting and Visualization — Production Pattern

pythondistribution_analysis.py

import numpy as np
from scipy import stats
import warnings

# --- Identifying which distribution fits your data ---
def fit_distribution(data: np.ndarray) -> dict:
    """
    Fit multiple distributions to data and compare AIC.
    Returns the best-fitting distribution.
    """
    distributions = {
        'normal':       stats.norm,
        'lognormal':    stats.lognorm,
        'exponential':  stats.expon,
        'gamma':        stats.gamma,
        'pareto':       stats.pareto,
        'weibull_min':  stats.weibull_min,
    }
    results = {}
    for name, dist in distributions.items():
        try:
            with warnings.catch_warnings():
                warnings.simplefilter("ignore")
                params = dist.fit(data)
            log_lik = np.sum(dist.logpdf(data, *params))
            k = len(params)
            aic = 2*k - 2*log_lik
            results[name] = {'params': params, 'aic': aic}
        except Exception:
            pass
    best = min(results, key=lambda k: results[k]['aic'])
    return {'best_distribution': best, 'aic': results[best]['aic'],
            'all_fits': {k: v['aic'] for k, v in results.items()}}

# --- Revenue modeling: log-normal is almost always right ---
def analyze_revenue(revenue: np.ndarray) -> dict:
    """
    Check if revenue follows log-normal distribution.
    Log-normal: log(revenue) ~ Normal.
    """
    log_rev = np.log(revenue[revenue > 0])
    stat, pval = stats.normaltest(log_rev)  # D'Agostino K² test
    return {
        'mean': revenue.mean(),
        'median': np.median(revenue),
        'p90': np.percentile(revenue, 90),
        'p99': np.percentile(revenue, 99),
        'log_normal_pval': pval,    # p > 0.05 → fail to reject log-normality
        'skewness': stats.skew(revenue),
        'kurtosis': stats.kurtosis(revenue),
        'recommendation': 'log-normal' if pval > 0.05 else 'investigate further'
    }

# --- Poisson vs Negative Binomial for count data ---
def check_overdispersion(counts: np.ndarray) -> dict:
    """
    Poisson assumes Var(X) = E(X). If Var >> E, use Negative Binomial.
    Dispersion ratio = Var / Mean. >1.5 suggests significant overdispersion.
    """
    mean = counts.mean()
    var = counts.var()
    dispersion = var / mean
    return {
        'mean': mean,
        'variance': var,
        'dispersion_ratio': dispersion,
        'recommendation': 'Negative Binomial' if dispersion > 1.5 else 'Poisson',
        'note': 'Poisson assumes Var = Mean. Use NB if Var >> Mean.'
    }

# --- CLT verification: when is n 'large enough'? ---
def clt_convergence_check(dist_samples_fn, n_values=[10, 30, 100, 1000]):
    """
    Empirically verify CLT: does sample mean converge to Normal?
    Tests normality of 10,000 sample means at each n.
    """
    results = {}
    for n in n_values:
        sample_means = [dist_samples_fn(n).mean() for _ in range(10_000)]
        _, pval = stats.normaltest(sample_means)
        results[n] = {'normal_pval': pval, 'approximately_normal': pval > 0.05}
    return results

TIP

Distribution Interview Quick-Reference

Q: 'What distribution models the number of user sessions per day?' → Poisson (if low variance) or Negative Binomial (if overdispersed, which is typical for real user behavior).

Q: 'What distribution models revenue per user?' → Log-Normal. The median should be much lower than the mean (positive skew). Verify: log-transform → check for Gaussian.

Q: 'What distribution would you use as a prior for A/B test conversion rates?' → Beta distribution. Beta(1,1) = uniform prior. Beta(α, β) where α + β = effective prior sample size.

Q: 'You're building a Thompson Sampling bandit. What maintains the posterior over each arm's CTR?' → Beta distribution (conjugate to Bernoulli/Binomial). Update: Beta(α+clicks, β+non-clicks).

Q: 'What is the Central Limit Theorem?' → The sample mean of n i.i.d. random variables with finite mean μ and variance σ² converges in distribution to N(μ, σ²/n) as n→∞. Requires finite variance (fails for Pareto distributions with α < 2).

Q: 'What is KL divergence?' → KL(P||Q) = E_P[log(P/Q)]. Measures information lost when approximating P with Q. Not symmetric. KL(P||Q) = 0 iff P=Q. Used in: VAE loss, model monitoring for distributional shift.

Interview Questions

Click to reveal answers

Test your knowledge

Sign in to take the Quiz

This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.