Reasoning about uncertainty with priors, likelihoods and posteriors
0.5.dev0+git.20260626.e137512 - June 26, 2026 18:41 UTC
Bayesian Data Analysis#
Bayesian analysis treats unknown quantities as probability distributions and updates them with data. Instead of a single “best” estimate, you get a full posterior — a principled account of what the data do and do not tell you. This hub builds from first principles up to the nonparametric models (mixtures, density estimation, Dirichlet processes) that the source corpus emphasises.
Three reading levels run through the page:
newcomers — the intuition of prior → likelihood → posterior;
practitioners — how to actually compute and check posteriors;
researchers — hierarchical and nonparametric (infinite-mixture) models.
Note
Open a dropdown for detail; follow See also links to related
ideas. Code snippets use real scipy.stats / PyMC / ArviZ
/ scikit-learn calls. This page pairs with the
Terminology reference (probability and
distributions) and the Time Series hub
(where Bayesian estimation also appears).
Discovery at a Glance#
The one equation everything rests on.
Posterior ∝ likelihood × prior — how belief is updated by evidence.
The three ingredients, what each encodes, and where they come from.
A 95 % interval you can read as “95 % probability” — unlike a confidence interval.
From conjugate shortcuts to general-purpose sampling.
When prior and posterior share a family, the update is exact and closed-form.
Drawing from any posterior when no formula exists — the workhorse of modern Bayes.
Simulating new data to check the model and forecast.
Sharing strength across groups; letting complexity grow with data.
Partial pooling: groups borrow strength from each other.
Sub-populations, label switching, and choosing the number of components.
Nonparametric priors that let the number of clusters grow with the data.
Part 1 — The Bayesian Idea#
Bayes’ Theorem#
What is it?
Bayes’ theorem inverts conditional probability to turn a model of “how data arise given parameters” into “what parameters are plausible given data”:
The denominator \(p(y)\) (the evidence) is a normalising constant; for inference about \(\theta\) the proportionality on the right is what matters.
When to use it — whenever you want to combine prior knowledge with observed data and quantify the remaining uncertainty as a distribution.
Prior, Likelihood & Posterior#
Prior \(p(\theta)\) — belief about the parameter before seeing this data (from theory, past studies, or a deliberately weak “let the data speak” choice).
Likelihood \(p(y\mid\theta)\) — the data-generating model, read as a function of \(\theta\) for the observed \(y\).
Posterior \(p(\theta\mid y)\) — the updated belief; the output of the analysis and the input to every decision.
As data accumulate, the likelihood dominates and the posterior becomes insensitive to a reasonable prior.
Credible Intervals (and how they differ from CIs)#
What is it?
A 95 % credible interval is any region containing 95 % of the posterior probability mass. It supports the natural statement “there is a 95 % probability the parameter lies in this range” — which a frequentist confidence interval does not.
import numpy as np
# equal-tailed 95% credible interval from posterior samples
lo, hi = np.percentile(posterior_samples, [2.5, 97.5])
Part 2 — Computing Posteriors#
Conjugacy (the exact, closed-form case)#
What is it?
A prior is conjugate to a likelihood when the posterior stays in the same family. The classic example is Beta–Binomial: a \(\text{Beta}(\alpha, \beta)\) prior on a success probability, with \(k\) successes in \(n\) trials, yields
scipy
from scipy import stats
alpha, beta = 1, 1 # uniform prior
k, n = 8, 10
post = stats.beta(alpha + k, beta + n - k)
print(post.mean(), post.interval(0.95))
When to use it — quick, exact updates for simple models, and as building blocks inside larger samplers.
See also
MCMC Sampling#
What is it?
When the posterior has no closed form, Markov chain Monte Carlo draws correlated samples whose stationary distribution is the posterior. Modern tools use Hamiltonian Monte Carlo / NUTS for efficient exploration.
PyMC + ArviZ
import pymc as pm
import arviz as az
with pm.Model() as model:
theta = pm.Beta("theta", alpha=1, beta=1)
y = pm.Binomial("y", n=10, p=theta, observed=8)
idata = pm.sample(2000, tune=1000)
az.summary(idata) # means, sd, 94% HDI, r_hat
az.plot_trace(idata) # convergence diagnostics
Check before trusting — \(\hat{R} \approx 1.0\), healthy effective sample size, no divergences.
Posterior Predictive Checks#
What is it?
The posterior predictive distribution simulates new data by integrating the likelihood over the posterior:
Comparing simulated datasets to the real one is the primary Bayesian model-checking tool: systematic mismatch signals a misspecified model.
with model:
pm.sample_posterior_predictive(idata, extend_inferencedata=True)
az.plot_ppc(idata)
See also
Part 3 — Hierarchies, Mixtures & Nonparametrics#
Hierarchical Models (Partial Pooling)#
What is it?
When data come in groups (schools, patients, sites), a hierarchical model gives each group its own parameter while tying those parameters to a shared population distribution:
This partial pooling shrinks noisy small-group estimates toward the overall mean — between “one estimate for everyone” (complete pooling) and “every group alone” (no pooling). The source’s hierarchical dependence posts develop exactly this structure.
See also
Mixture Models & Label Switching#
What is it?
A finite mixture models a population as a weighted blend of sub-populations:
Label switching is the identifiability quirk that the components can be permuted without changing the likelihood — a thing to handle when summarising posteriors. Choosing \(K\) is a model-selection problem (AIC / BIC, or let it be infinite — see Dirichlet processes).
scikit-learn + scikit-plots
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=3).fit(X)
# scikit-plots: compare K via AIC / AICc / BIC
import scikitplot as skplt
skplt.stats.plot_gaussian_mixture_models(X)
Dirichlet Processes (Nonparametric Bayes)#
What is it?
A Dirichlet process \(\text{DP}(\alpha, G_0)\) is a prior over distributions — the foundation of infinite mixture / density- estimation models where the number of clusters is not fixed in advance but grows with the data. The stick-breaking view builds the mixing weights as
The concentration \(\alpha\) controls how readily new clusters appear. This underpins the source’s Dirichlet process mixtures, Bayesian histograms, and density estimation posts.
When to use it — clustering / density estimation where you cannot commit to a fixed number of components a priori.
See also
Map to scikit-plots & the Bayesian Stack#
scikit-plots’ role here is diagnostic and model-selection visual support; the heavy lifting is done by the probabilistic-programming stack.
Choose the number of mixture components by information criteria.
Distributional / Q–Q checks on fitted models.
Probabilistic programming for building and sampling models.
Diagnostics, summaries and plots for Bayesian inference.
Sources#
Verified during preparation of this page; resolvable at build date.
Source context (framing only, re-expressed in our own words)
Bayesian Data Analysis category (144 posts): https://insightful-data-lab.com/category/bayesian-data-analysis/
Official documentation (API calls used above)
SciPy —
scipy.statsdistributions: https://docs.scipy.org/doc/scipy/reference/stats.htmlscikit-learn — Gaussian mixture models: https://scikit-learn.org/stable/modules/mixture.html
PyMC — probabilistic programming: https://www.pymc.io/
ArviZ — exploratory analysis of Bayesian models: https://python.arviz.org/
scikit-plots (this project)
Example gallery: https://scikit-plots.github.io/dev/auto_examples/index.html
Terminology reference: terminology-index
Standard reference
Gelman, Carlin, Stern, Dunson, Vehtari & Rubin, Bayesian Data Analysis (3rd ed.): http://www.stat.columbia.edu/~gelman/book/