Building A Distributed Sampler From Scratch: Parallel Mcmc With Consensus And Mode Seeking

2021-09-24 · Leonardo Benicio
Summary

A comprehensive technical exploration of building a distributed sampler from scratch: parallel mcmc with consensus and mode seeking, covering key concepts, practical implementations, and real-world applications.

Contents

The God Problem: Why Your MCMC Chain is Crying in the Corner

Introduction: The 3 AM Revelation

There is a moment in the life of every statistician or machine learning engineer—usually around 3 AM, fueled by a cold brew and a dataset that refuses to cooperate—when they realize their Markov Chain Monte Carlo (MCMC) sampler is failing spectacularly. The trace plots are a horror show: chains that refuse to mix like oil and water, autocorrelations that laugh in the face of thinning, and a target posterior so multimodal it resembles a field of razor-sharp needles rather than a smooth, friendly landscape. You have increased the number of iterations to a million. You have tuned the step size until your fingers bleed. You have even whispered prayers to Thomas Bayes. And still, the sampler is trapped in a single mode, ignoring the vast, unexplored geography of the probability space.

This is the “God Problem” that torments anyone who dares to sample from complex, high-dimensional distributions. The name is borrowed from a thought experiment: if God were to look down upon the posterior landscape, They would see every mode, every ridge, every valley. But your MCMC chain is a lazy tourist, content to sunbathe on the same beach forever, never discovering the hidden coves and majestic mountains just over the horizon. The problem is not merely computational cost; it is a fundamental failure of exploration.

Why should you care? Because Bayesian inference is a powerful engine for uncertainty quantification, model averaging, and decision-making under uncertainty. Its workhorse—MCMC—is supposed to provide a representative sample from the posterior distribution. But the workhorse is fundamentally serial and path-dependent. A single chain, no matter how well-tuned, is a biased tourist guide. It walks a single path through the probability forest, and if that path starts in a local maximum, it may never discover the global palace of the posterior. For decades, the standard solution was to throw more compute at the problem: longer chains, more chains, better algorithms like Hamiltonian Monte Carlo (HMC). But these are additive improvements, not structural revolutions. With data scaling to millions of observations and models growing to billions of parameters, the limitations of traditional single-chain MCMC become not just annoying but crippling.

If you are a data scientist training Bayesian neural networks, a climatologist calibrating a complex simulation model, or a biologist inferring phylogenetic trees, you have faced this God Problem. You have stared at a trace plot that looks like a flat line punctuated by rare, desperate jumps. You have computed the effective sample size and found it to be a laughable fraction of your total iterations. You have run multiple chains from different starting points, only to have them all converge to the same local mode, giving you a false sense of confidence. The God Problem is real, and it is the silent killer of Bayesian workflows.

In this post, we will dissect the anatomy of this problem. We’ll start with a refresher on MCMC fundamentals, then dive deep into why multimodal and high-dimensional posteriors are so treacherous. We’ll explore traditional fixes and why they often fail, then introduce more advanced techniques like parallel tempering and replica exchange that offer a way out. Along the way, we’ll use concrete examples, code snippets, and mathematical intuition to illuminate the dark corners of MCMC. By the end, you’ll understand why your chain is crying—and how to dry its tears.

Background: The MCMC Toolbox

Before we diagnose the disease, let’s review the healthy functioning of MCMC. MCMC is a class of algorithms for sampling from probability distributions that are known only up to a normalizing constant. In Bayesian inference, we often have a posterior distribution:

[ p(\theta \mid D) \propto p(D \mid \theta) , p(\theta) ]

where (\theta) is a vector of parameters, (D) is the data, (p(D \mid \theta)) is the likelihood, and (p(\theta)) is the prior. The normalizing constant (the marginal likelihood) is typically intractable, so we rely on sampling methods that only require evaluating the unnormalized posterior.

The core idea of MCMC is to construct a Markov chain whose stationary distribution is the target posterior. By simulating the chain for many steps, we collect samples that approximate draws from the target. The most famous algorithm is Metropolis-Hastings (MH):

  1. Start at current state (\theta^{(t)}).
  2. Propose a new state (\theta^) from a proposal distribution (q(\theta^ \mid \theta^{(t)})).
  3. Compute the acceptance probability: [ \alpha = \min\left(1, \frac{p(\theta^_ \mid D) , q(\theta^{(t)} \mid \theta^_)}{p(\theta^{(t)} \mid D) , q(\theta^* \mid \theta^{(t)})}\right) ]
  4. Accept (\theta^*) with probability (\alpha), else stay at (\theta^{(t)}).

This simple algorithm guarantees that the stationary distribution is correct, provided the chain is ergodic (i.e., it can reach any state from any other state in finite steps). However, ergodicity is a theoretical property; in practice, the chain may mix extremely slowly if the posterior has multiple well-separated modes.

Gibbs sampling is a popular variant that updates one coordinate at a time using the full conditional distributions. It is efficient when the conditionals are easy to sample from, but it suffers from the same exploration problem in correlated or multimodal spaces.

Hamiltonian Monte Carlo (HMC) uses gradient information to propose distant states with high acceptance probability. It augments the parameter space with momentum variables and simulates Hamiltonian dynamics. HMC is far more efficient than random-walk MH in high-dimensional problems because it tends to move in directions of high probability mass. However, even HMC can get trapped in a single mode if the energy barriers between modes are too high. The leapfrog integrator used in HMC conserves energy approximately, and if the chain starts in a deep well, it may never have enough energy to climb out.

No-U-Turn Sampler (NUTS) is an adaptive variant of HMC that automatically tunes the step length. It is the default in probabilistic programming languages like Stan. NUTS avoids the need for manual tuning, but it does not fundamentally solve the multimodality problem.

Let’s illustrate with a simple yet pathological example. Consider a 1D mixture of two Gaussian distributions:

[ p(\theta) = 0.5 \cdot \mathcal{N}(\theta; -10, 0.5) + 0.5 \cdot \mathcal{N}(\theta; 10, 0.5) ]

These two modes are far apart relative to their widths. A random-walk Metropolis with a proposal standard deviation of, say, 2 will rarely propose a jump from one mode to the other because the probability of proposing a value near the other mode is tiny. Even if the proposal is accepted, the chain must pass through the low-probability region between modes, which has near-zero density. The result: the chain stays in the mode where it started.

Here’s a quick simulation in Python using PyMC:

import pymc as pm
import numpy as np
import arviz as az

# Define a bimodal likelihood (not really needed, just show sampling)
with pm.Model() as model:
    # Prior that is bimodal manually? Let's use a custom distribution
    # Actually, let's use a simple mixture via Potential
    mu = pm.Flat('mu')
    # Log-probability of mixture
    def logp_mixture(mu):
        logp1 = pm.logp(pm.Normal.dist(mu=-10, sigma=0.5), mu)
        logp2 = pm.logp(pm.Normal.dist(mu=10, sigma=0.5), mu)
        return pm.math.log(0.5 * pm.math.exp(logp1) + 0.5 * pm.math.exp(logp2))
    pm.Potential('mixture', logp_mixture(mu))
    trace = pm.sample(draws=5000, tune=1000, chains=1, step=pm.Metropolis())

If you run this, you’ll see the trace plot stuck at either -10 or 10. The effective sample size will be tiny because the chain never moves between modes. This is the God Problem in miniature.

The God Problem in Detail

What Makes a Posterior “God-like”?

The God Problem is not just about multimodality in one dimension. It becomes exponentially worse as dimensionality increases. In high-dimensional spaces, the posterior density is concentrated in thin manifolds—a phenomenon known as the “curse of dimensionality.” For a multimodal posterior with many parameters, the modes become isolated peaks separated by vast low-probability voids. The volume of the region where the density is appreciable shrinks dramatically.

Consider a Bayesian neural network with thousands of weights. The posterior over weights is typically highly multimodal: many different weight configurations yield similar predictive performance. These modes correspond to different local minima of the loss landscape. Traditional MCMC samplers, even with gradient information, can get stuck in a single mode because the energy barriers are enormous. The chain may drift slowly within a basin but never jump to another basin.

The term “God Problem” also hints at an epistemic limitation: you, as the sampler, do not know the full structure of the posterior. You are blindfolded, exploring a landscape you cannot see. A single chain gives you a single perspective. If God could see the entire landscape, they would know that there are multiple modes, but you don’t. You only see the trace of your chain, and if that trace is stationary, you might mistakenly think the chain has converged—when in fact it is trapped.

Diagnostics That Cry

How do you detect the God Problem? The usual diagnostic tools can be misleading:

  • Trace plots: If the chain is stuck, the trace plot will look like a noisy horizontal line. That might appear fine—it’s “mixing” locally. But the horizontal line is the wrong altitude.
  • Gelman-Rubin statistic ((\hat{R})): This compares within-chain variance to between-chain variance. If you run multiple chains from different starting points, a stuck chain will have low within-chain variance and high between-chain variance if chains end up in different modes. The (\hat{R}) will be >1.1, signaling non-convergence. However, if all chains start in the same mode (e.g., from the same initial guess), they may all converge to the same local mode, giving a deceivingly low (\hat{R}). The God Problem can be hidden if you don’t initialize chains in diverse regions.
  • Effective sample size (ESS): When the chain is stuck, the autocorrelation is high, and ESS is low. But even a local random walk within a single mode can have decent ESS if the mode is wide. ESS is a measure of independent information within the visited region, not of exploration across modes.
  • Energy diagnostics: In HMC, the energy (Hamiltonian) should be constant on average. If the chain is trapped, the energy trace may still look healthy because it is oscillating within the local well.

The most telling sign is a trace that never jumps between distinct values. For example, in a mixture model, the chain might never switch the label assignments of components. This is often called “label switching” in Bayesian mixture models—the posterior is symmetric under permutation of components, but the sampler may stay in one permutation. A proper sampler should visit all permutations equally.

A Concrete Example: Gaussian Mixture Model

Let’s consider a more realistic scenario: a 2-component Gaussian mixture model with unknown means and variances. We have 100 data points drawn equally from two well-separated clusters. The posterior over the component means can be symmetric: averaging over the two possible assignments of data to components. A standard MCMC sampler using Gibbs or Metropolis often gets stuck in one labeling because switching requires a low-probability intermediate step.

Here’s a PyMC implementation:

import pymc as pm
import numpy as np

# Generate data from two clusters
np.random.seed(42)
data = np.concatenate([np.random.normal(-3, 0.5, 50),
                       np.random.normal(3, 0.5, 50)])

with pm.Model() as mixture:
    # Priors for means
    mu1 = pm.Normal('mu1', mu=0, sigma=10)
    mu2 = pm.Normal('mu2', mu=0, sigma=10)
    # Priors for standard deviations
    sigma1 = pm.HalfNormal('sigma1', sigma=2)
    sigma2 = pm.HalfNormal('sigma2', sigma=2)
    # Mixing weight
    w = pm.Dirichlet('w', a=np.array([1, 1]))
    # Likelihood as a mixture
    y = pm.Mixture('y', w=w, comp_dists=[
        pm.Normal.dist(mu=mu1, sigma=sigma1),
        pm.Normal.dist(mu=mu2, sigma=sigma2)
    ], observed=data)
    trace = pm.sample(1000, tune=1000, chains=2, cores=1)

If you run this, the two chains may both assign the same labeling (e.g., mu1 ~ -3, mu2 ~ 3) and never switch. The trace plots for mu1 and mu2 will be stable, but if you look at the difference mu2 - mu1, it will be positive and never negative. The symmetry is broken. This is a classic instance of the God Problem: the posterior is multimodal (two symmetric labelings), but the sampler only reveals one.

Why Traditional Fixes Fail

When faced with a stuck chain, the typical instinct is to tweak the sampler. Let’s examine common “fixes” and why they don’t solve the core problem.

1. Run More Iterations

“If I just run the chain for a million iterations, it will eventually jump to the other mode.” This is an understandable hope, but it ignores the exponential scaling of tunneling times. For a simple 1D two-Gaussian mixture with modes at -10 and 10 and standard deviation 0.5, the probability of a random-walk Metropolis proposal landing near the other mode is roughly the tail probability of a Gaussian centered at current mode with proposal scale. If the proposal scale is 1, then the probability of proposing a value greater than 5 (given current is -10) is about 0 (numerically zero). So the waiting time is essentially infinite. Even with a larger proposal scale, the acceptance probability of a point in the low-density region is minute. The chain would need to “tunnel” through the valley, which requires many improbable steps—exponentially many in the distance between modes.

HMC can fare better because it can move in a straight line with momentum, but it still must cross the region of low probability. The leapfrog integrator will try to follow a trajectory that explores the local geometry; if the potential energy barrier is high, the Hamiltonian will not be conserved, and the trajectory will likely be rejected. In practice, HMC may jump between modes if the barriers are low or if there is a path that bypasses the low-density region (e.g., through a high-dimensional ridge). But for symmetric, isolated modes, HMC will also get stuck.

2. Increase Number of Chains

Running more chains from different starting points can help detect non-convergence via (\hat{R}), but it does not guarantee that any chain will explore both modes. If you start one chain near each mode, each chain will likely stay in its respective mode. The combined sample will be a mixture of samples from both modes—but the weighting may be off because the chains do not communicate. The stationary distribution of each individual chain is the correct posterior, but only locally. If you simply pool all chains, you will get a sample that approximates the posterior only if the chains have mixed globally. But they haven’t. The sampled points will cluster around the initial modes, and the relative weights of the two modes will be determined by the number of chains that started in each mode, not by the true posterior probabilities. This introduces bias.

In the mixture model example, if you run two chains starting from different initial means (e.g., mu1= -5, mu2=5 for chain1, and mu1=5, mu2=-5 for chain2), each chain will stay in its labeling. Pooling them gives a sample that has both labelings, but the assignment of “mu1” to the left mode versus right mode is inconsistent. You have to post-process to align labels, which is nontrivial.

3. Thin the Chain

Thinning—keeping every (k)-th sample to reduce autocorrelation—is often recommended, but it does nothing to fix the exploration problem. If the chain never moves between modes, thinning will just give you fewer samples from the same local region. It’s like taking a photo every hour of a house that never moves: you get many photos of the same house, not of the whole neighborhood.

4. Tune the Step Size

In Metropolis-Hastings, a step size that is too small leads to high acceptance but slow exploration. Too large leads to low acceptance and also slow exploration (because many proposals are rejected). The optimal acceptance rate for random-walk MH is about 23.4% in high dimensions. But even optimal tuning will not help the chain cross large energy barriers. The barrier crossing is not a function of step size; it’s a function of the distance to the other mode relative to the typical scale of the posterior. The chain will need to “randomly walk” through a low-density region, which is akin to a gambler’s ruin process with a drift towards the mode.

In HMC, tuning the step size and number of leapfrog steps can optimize the acceptance rate, but the inability to cross barriers persists. The difference is that HMC can take large steps that go far away in parameter space, but if the energy barrier is high, the momentum is not sufficient to climb.

5. Use Simulated Annealing

Simulated annealing (SA) gradually decreases a temperature parameter to “cool” the system into a global minimum. In MCMC, one can anneal the posterior by raising it to a power: (p(\theta|D)^\beta), where (\beta) is the inverse temperature. At high temperature ((\beta) small), the distribution flattens, allowing the chain to explore broadly. Then you gradually increase (\beta) (cool down), hoping the chain settles into the global mode. However, SA is not a sampling algorithm for the target distribution; it’s an optimization heuristic. The samples during the annealing schedule are not from the true posterior. One can use the final cooled state as a starting point for a standard MCMC run, but that only samples the mode it landed in. Moreover, SA can still get stuck if the cooling schedule is too fast, or if the landscape has many deep, narrow modes.

6. Use Variational Inference (VI)

VI turns sampling into optimization: approximate the posterior with a simpler distribution (e.g., a Gaussian or mixture of Gaussians). VI is much faster than MCMC and scales to big data. However, it underestimates uncertainty and can miss modes. The standard mean-field VI assumes independence across parameters, which is often wrong. Even when using a mixture of Gaussians as the variational family, the optimization can still settle into a single mode (a “local optimum”) because the ELBO is non-convex. VI does not solve the God Problem; it just hides it behind a cost function that may be easier to optimize.

The Structural Issue: Serial Path Dependence

All the above fixes fail because they do not address the fundamental structural issue: MCMC constructs a single trajectory in parameter space. The chain’s next state depends on the current state, and if the chain starts in a region that is isolated by low-probability valleys, it can never escape. The chain is path-dependent, and its long-run behavior is a random walk within a connected component of the posterior support. Multimodal distributions have disconnected high-probability regions (or weakly connected through thin bridges). The chain cannot jump between components unless the proposal distribution has support across the valley, which is rare.

The mathematical condition for the chain to be able to move between modes is that the proposal distribution must allow transitions that go through low-density regions with positive probability. But if the posterior is essentially zero there, the acceptance probability is also nearly zero. The expected hitting time to the other mode grows exponentially with the distance between modes squared (in units of the posterior standard deviation). This is a classic result from the theory of Markov chains on energy landscapes.

So what can we do? We need to change the underlying Markov chain dynamics, not just tune parameters. We need a way for the chain to “see” the global structure—or at least to be occasional jolted out of its local slumber. This is where advanced methods like parallel tempering come in.

Solutions: Parallel Tempering and Replica Exchange

The Core Idea

Parallel tempering (also known as replica exchange or Metropolis-coupled MCMC) runs multiple chains at different temperatures simultaneously. The key innovation is that chains at higher temperature have a flatter distribution and can more easily traverse low-probability regions. Periodically, we attempt to swap the states between a hot and a cold chain. If the swap is accepted (according to a Metropolis-like criterion), the cold chain can inherit a state from the hot chain, effectively jumping to a different mode.

The temperature parameter (\beta = 1/T) (inverse temperature) modifies the posterior to (p(\theta|D)^{\beta}). For (\beta = 1), we have the original target distribution. For (\beta < 1), the distribution is flattened; (\beta = 0) gives a uniform distribution (if the prior is proper). In practice, we use a ladder of temperatures: (\beta_1 = 1) (cold chain), (\beta_2, \beta_3, …, \beta_K) where (\beta_K) is close to 0 (hot chain). The number of chains and the spacing of temperatures is chosen so that the acceptance rate for swaps between adjacent chains is reasonable (e.g., 20-50%).

The algorithm works as follows:

  1. For each temperature (\beta_i), initialize a chain (\theta_i).
  2. For each iteration:
    • Perform a few MCMC steps (e.g., Metropolis or HMC) independently on each chain at its own temperature.
    • Attempt to swap states between two randomly chosen adjacent chains (e.g., (i) and (i+1)):
      • Compute the swap acceptance probability: [ \alpha = \min\left(1, \frac{p(\theta*{i+1}|D)^{\beta_i} , p(\theta_i|D)^{\beta*{i+1}}}{p(\thetai|D)^{\beta_i} , p(\theta{i+1}|D)^{\beta*{i+1}}}\right) ] This simplifies to (\min\left(1, \left(\frac{p(\theta*{i+1}|D)}{p(\thetai|D)}\right)^{\beta_i - \beta{i+1}}\right)).
      • With probability (\alpha), swap the states (\thetai) and (\theta{i+1}).

The cold chain (our target) mixes rapidly because it can occasionally receive states that have explored different modes via the hot chains. The hot chain explores globally, and the swaps propagate that diversity down the temperature ladder.

Why It Works

The key insight is that the hot chain is not subject to the same energy barriers. Since the density is raised to a fractional power, the valleys become less deep. The hot chain can comfortably traverse the whole space. The swaps are accepted with a probability that depends on the ratio of densities at the two temperatures. If the hot chain is in a high-probability region relative to the cold chain’s current state, the swap is likely accepted. This transfers the hot chain’s good state to the cold chain.

Parallel tempering respects detailed balance for the joint distribution over all replicas (the product of their tempered target distributions). Therefore, the marginal distribution of the cold chain is exactly the target posterior. We only need to collect samples from the cold chain; the other chains are auxiliary.

Implementation in PyMC

PyMC does not have built-in parallel tempering (though there are some external projects). However, we can implement a crude version using manual loops. Let’s illustrate with our 1D bimodal example:

import numpy as np
import pymc as pm

# Define the log-posterior (up to constant)
def logp(theta):
    # mixture of two Gaussians
    logp1 = -0.5 * ((theta + 10) / 0.5)**2 - np.log(0.5 * np.sqrt(2*np.pi))
    logp2 = -0.5 * ((theta - 10) / 0.5)**2 - np.log(0.5 * np.sqrt(2*np.pi))
    return np.log(0.5 * (np.exp(logp1) + np.exp(logp2)))

# Parallel tempering parameters
n_chains = 10
betas = np.linspace(0.1, 1.0, n_chains)  # inverse temperatures
theta = np.random.uniform(-15, 15, n_chains)  # initial states
n_iterations = 5000
swap_interval = 10  # try a swap every 10 MCMC steps
step_size = 1.0

# Store cold chain samples
cold_samples = []

for i in range(n_iterations):
    # MCMC step for each chain (simple Metropolis)
    for j in range(n_chains):
        proposal = theta[j] + np.random.normal(0, step_size)
        logp_current = betas[j] * logp(theta[j])
        logp_proposal = betas[j] * logp(proposal)
        if np.log(np.random.rand()) < logp_proposal - logp_current:
            theta[j] = proposal

    # Swap attempts between adjacent chains
    if i % swap_interval == 0:
        for j in range(n_chains - 1):
            logp_j = betas[j] * logp(theta[j])
            logp_j1 = betas[j+1] * logp(theta[j+1])
            # swap acceptance
            # alpha = min(1, exp((logp_j1 - logp_j)*(betas[j] - betas[j+1])))
            delta = (logp_j1 - logp_j) * (betas[j] - betas[j+1])
            if np.log(np.random.rand()) < delta:
                theta[j], theta[j+1] = theta[j+1], theta[j]

    # Record cold chain sample
    cold_samples.append(theta[0])  # coldest chain (beta=1)

After running this, the cold chain’s samples should include both modes. You can check the histogram. The key is that the hot chain (lowest beta) explores both modes easily, and swaps bring those modes to the cold chain.

Practical Considerations

Parallel tempering is not a silver bullet. It introduces overhead: you must simulate multiple chains, which increases computational cost by roughly the number of temperatures. However, you can often get away with a moderate number (e.g., 10-20) because the hot chains are cheap if the target evaluation is the bottleneck. Moreover, the cold chain samples are of higher quality, requiring fewer effective iterations.

Choosing the temperature ladder is critical. If the temperatures are too far apart, swap acceptance will be low; if too close, the hot chain may not explore enough. A common heuristic is to space temperatures geometrically so that the acceptance rate between adjacent chains is around 0.2-0.5. There are also adaptive schemes that adjust temperatures online.

Another nuance: in high dimensions, even hot chains can suffer from the curse of dimensionality. The distribution at high temperature may still be multimodal if the modes are separated by regions of near-zero density even when flattened. However, as the temperature increases, the distribution becomes more uniform, and eventually the modes merge. In the limit of infinite temperature, the distribution is flat, and the chain explores perfectly. So in principle, with enough temperature levels, parallel tempering can overcome any barrier. In practice, the number of replicas needed may scale exponentially with the barrier height. But for many real problems, it is feasible.

Beyond Parallel Tempering

Parallel tempering is not the only way to escape the God Problem. Here are a few other approaches:

  • Equi-Energy Sampling: A variant that uses energy bins to propose jumps between states with similar energy levels.
  • Mode-Jumping Proposals: Design proposals that directly attempt to move between modes, e.g., by using deterministic mappings or symmetry transformations. This is problem-specific.
  • Sequential Monte Carlo (SMC): Also known as particle filtering for static parameters. SMC evolves a population of particles through a sequence of tempered distributions. It naturally handles multimodality because particles can be resampled and moved. SMC is more robust than MCMC for multimodal problems, but requires careful tuning of the tempering schedule.
  • Stein Variational Gradient Descent (SVGD): A deterministic particle-based method that pushes a set of particles to approximate the posterior. It can capture multiple modes if initialized appropriately, but it may underrepresent uncertainty.
  • Ensemble MCMC (e.g., Affine Invariant MCMC): Uses multiple chains that interact via stretch moves. This can help when the modes are isotropic, but still struggles with isolated modes.

None of these methods are perfect. The God Problem remains a fundamental challenge in Bayesian computation. The best approach is to be aware of it, diagnose it diligently (using multiple diverse starting points and (\hat{R})), and apply methods like parallel tempering when necessary.

Conclusion: Accepting the Limitations of Single-Chain MCMC

The God Problem is a stark reminder that MCMC, for all its elegance, is a local exploration algorithm. It relies on local moves to build a global picture. When the posterior is rugged and high-dimensional, the local moves are insufficient. The chain cries because it knows it cannot see the whole truth.

As practitioners, we must adapt. We must not trust a single chain blindly. We must run multiple chains from overdispersed starting points, compute (\hat{R}) and ESS for each parameter, and look for signs of mode collapse. We must consider using parallel tempering or other ensemble methods when the posterior is suspected to be multimodal. We must also remember that for extremely large problems, Bayesian inference may be better served by variational approximations or by using optimization to find the mode and then using Laplace approximation. But those come with their own trade-offs.

The God Problem also has a philosophical side: it highlights the gap between the mathematical guarantee of MCMC convergence (which requires infinite time) and the practical reality of finite compute. It forces us to think about what we really want from a posterior sample. Do we need the exact global distribution, or is a local approximation sufficient? In many applications, the posterior is approximately unimodal (e.g., with large amounts of data), and single-chain MCMC works fine. But when the stakes are high, and uncertainty quantification is critical, we must respect the complexity of the landscape.

So next time you see your trace plot look like a flat line, don’t just blame the sampler. Blame the geometry of the posterior. And then reach for a multifocal lens: parallel tempering, multiple chains, and a good dose of skepticism. Your MCMC chain will stop crying—and start exploring.


This article is part of a series on practical Bayesian computation. In the next post, we will dive deeper into adaptive temperature ladders for parallel tempering and how to implement them efficiently in modern probabilistic programming frameworks.