The Bayesian reasoning framework that underpins Thompson Sampling, Bayesian A/B testing, uncertainty-aware ML, and Bayesian optimization. Covers Bayes' theorem from first principles, conjugate priors, MCMC (Metropolis-Hastings, NUTS), variational inference (ELBO), and when Bayesian methods outperform frequentist approaches in production ML systems.

45 min read 2 sections 1 interview questions

Bayesian InferenceBayes TheoremPriorPosteriorLikelihoodMCMCMetropolis-HastingsNUTSVariational InferenceELBOConjugate PriorBeta DistributionCredible IntervalThompson SamplingBayesian Optimization

Bayesian Thinking: The Fundamental Shift from Frequentist Statistics

Frequentist statistics treats model parameters as fixed, unknown constants. The data is random — different experiments would give different data. Probability is a long-run frequency: if you repeat the experiment infinitely, 95% of confidence intervals contain the true parameter.

Bayesian statistics treats model parameters as random variables with probability distributions. You express your prior beliefs about the parameter, observe data, and update your beliefs to produce a posterior distribution. Probability is a degree of belief — subjective but coherent.

Why this matters in ML:

Uncertainty quantification: Bayesian models return distributions over predictions, not just point estimates. Critical for medical diagnosis, autonomous driving, financial risk — where knowing 'I'm 80% confident' vs 'I'm 50% confident' changes the decision.
Small data regimes: When you have 50 samples, the prior contains useful information. Frequentist MLE may overfit; Bayesian posterior regularizes naturally via the prior.
Online learning: Bayesian models update incrementally — the posterior becomes the prior for the next batch. Thompson Sampling (used in production recommendation systems) is pure Bayesian inference on reward distributions.
Bayesian optimization: Used by Google Vizier, Optuna, and every hyperparameter tuning service to efficiently search high-dimensional spaces.

Bayes' Theorem: P(θ | data) = P(data | θ) × P(θ) / P(data)

P(θ) — prior: your belief about θ before seeing data
P(data | θ) — likelihood: how probable is this data given parameter θ
P(data) — evidence (marginal likelihood): normalizing constant, often intractable
P(θ | data) — posterior: updated belief after observing data

The denominator P(data) = ∫ P(data|θ) P(θ) dθ is an integral over all possible θ values. For most real models, this is analytically intractable — which is why MCMC and variational inference exist.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.

Upgrade to Pro Sign in to upgrade