MAP vs MMSE Estimation: Discrete Prior Examples

This notebook demonstrates the conceptual and practical differences between two fundamental Bayesian estimators:

The idea for this demonstration is taken from The Perception-Distortion Tradeoff by Blau and Michaeli.


PART 1 — Small 1D Demonstration

Problem Setup

We start with a simple discrete prior example to build intuition:

  • Discrete prior: X \in \{-1, 0, 1\} with probabilities p_X = [0.4, 0.2, 0.4]
  • Measurement model: Y = X + N where N \sim \mathcal{N}(0, 1)

Given an observation y, we compute:

  • Posterior: f(x_i \mid y) \propto f(y \mid x_i) f(x_i)
  • MAP estimator: Selects the single most probable dictionary element \hat{x}_{\text{MAP}}(y) = \arg\max_x f(x \mid y)
  • MMSE estimator: Computes the weighted average of all dictionary elements, weighted by their posterior probabilities \hat{x}_{\text{MMSE}}(y) = \mathcal{E}\{X \mid Y=y\} = \sum_i x_i f(x_i \mid y)

Likelihood and Posterior

The likelihood of observing y given candidate x_i is:

f(y \mid x_i) \propto \exp\left( -\frac{\|y - x_i\|^2}{2\sigma^2} \right)

Resulting Estimator

Summary: Why MAP and MMSE Differ

Key Observations:

  1. MAP estimator is piecewise constant: \hat{x}_{\text{MAP}}(y) takes discrete values \{-1, 0, 1\} and jumps between them as y crosses decision boundaries. This is because MAP selects the single most probable value from the discrete set.

  2. MMSE estimator is smooth: \hat{x}_{\text{MMSE}}(y) is a continuous, smooth function of y because it computes the weighted average \sum_i x_i p(x_i \mid y). As y changes, the posterior probabilities change smoothly, leading to a smooth estimate.

  3. Distribution differences:

    • The MAP estimate distribution is discrete (only takes values \{-1, 0, 1\}) and differs from the prior because MAP selects based on the posterior, which depends on the observation y.
    • The MMSE estimate distribution is continuous and centered around the prior mean, but with reduced variance (it’s a “shrunk” version of the prior).
  4. Why MMSE is smoother: MMSE integrates over uncertainty by averaging all candidates weighted by their posterior probability. MAP ignores this uncertainty and picks only the mode, leading to discrete jumps.

This simple 1D example illustrates the fundamental difference: MAP selects a single discrete candidate, while MMSE computes a weighted average, resulting in smooth, continuous estimates even when the prior is discrete.


PART 2 — MNIST Image Denoising Example

Problem Setup

We now apply the same concepts to a high-dimensional image denoising problem using MNIST images.

Likelihood and Posterior

The likelihood of observing y given candidate image x_i is:

f(y \mid x_i) \propto \exp\left( -\frac{\|y - x_i\|^2}{2\sigma^2} \right)

The posterior over the discrete set is:

p(x_i \mid y) \propto p(y \mid x_i) p(x_i)

where p(x_i) is the prior probability of image x_i (uniform in our main experiment).

1. Load MNIST Data

We’ll download MNIST and normalize pixel values to [0, 1]. Images will be flattened for computation but we’ll keep a reshape helper for visualization.

Training images shape: (60000, 784)
Test images shape: (10000, 784)
Pixel value range: [0.000, 1.000]

2. Select Test Image and Add Noise

We’ll pick a test image and add Gaussian white noise to create our noisy observation y.

Selected test image: digit 2
Noise standard deviation: 1.0

3. Build Discrete Prior Dictionary

We’ll use the first N_CANDIDATES training images as our dictionary \{x_1, \dots, x_N\}. This approximates a continuous prior by a finite set of candidate images. We assume a uniform prior over these images.

Dictionary size: 4000 images
Dictionary shape: (4000, 784)
Prior: Uniform

4. Compute Log-Likelihoods and Posterior Probabilities

Log-Likelihood Function

The log-likelihood is:

\log p(y \mid x_i) = -\frac{\|y - x_i\|^2}{2\sigma^2} + \text{constant}

We can ignore the constant term since it cancels out when normalizing the posterior.

Numerical Stability

For numerical stability, we work entirely in the log-domain and use the log-sum-exp trick to normalize probabilities:

\log p(x_i \mid y) = \log p(y \mid x_i) + \log p(x_i) - \log Z

where Z = \sum_j \exp(\log p(y \mid x_j) + \log p(x_j)) is the normalization constant.

Sum of posterior probabilities: 1.000000
Max posterior probability: 0.236598
Min posterior probability: 8.227841e-16
Number of non-negligible probabilities (>1e-6): 758

5. MAP Estimator

The MAP estimator selects the single most probable dictionary element:

\hat{x}_{\text{MAP}}(y) = \arg\max_i p(x_i \mid y)

This is equivalent to finding the index with maximum unnormalized log-posterior (since log is monotonic).

Key insight: MAP picks the single “best match” from the dictionary. If the true image is in the dictionary and noise is not too large, MAP will likely pick it or something visually very similar.

MAP estimate: dictionary index 1609
MAP posterior probability: 0.236598
MAP digit label: 2

6. MMSE Estimator

The MMSE estimator computes the posterior mean:

\hat{x}_{\text{MMSE}}(y) = \mathbb{E}[x \mid y] = \sum_i x_i p(x_i \mid y)

Key insight: MMSE averages over all candidate images weighted by their posterior probabilities. This often produces a smoother, more “blurred” result because it integrates over multiple plausible candidates, not just the single most probable one.

MMSE estimate computed as weighted average over 4000 images
Effective number of images (exp entropy): 13.30

7. Visualize Posterior Structure

Let’s examine the top K most probable dictionary images according to the posterior. This helps us understand:

  • MAP picks only the top-1 image (the single most probable)
  • MMSE integrates over the full posterior, including all of these top candidates

The MMSE estimate is a weighted average of these (and many other) images.

8. Direct Comparison: MAP vs MMSE

Let’s create a clean comparison figure showing both estimators side-by-side.

Qualitative Differences

MAP Estimator:

  • Picks the single most probable dictionary element
  • Often produces sharper results (since it’s a single clean image)
  • Can be “wrong” in a structured way if the true image isn’t in the dictionary
  • Computationally simpler (just find argmax)

MMSE Estimator:

  • Computes the weighted average over all dictionary elements
  • Often appears smoother/blurrier because it averages multiple plausible candidates
  • Optimal in the MSE sense (minimizes expected squared error)
  • Can produce “ghost” images that blend features from multiple digits

The key difference is that MAP selects a single discrete candidate, while MMSE integrates over the entire posterior distribution, producing a continuous-valued estimate that may not correspond to any single dictionary element.

9. MSE Comparison

Let’s compute the pixel-wise Mean Squared Error (MSE) for each estimator to quantitatively compare their performance.

Mean Squared Error (MSE) Comparison:
  Noisy observation y:     MSE = 0.264886
  MAP estimate:           MSE = 0.047396
  MMSE estimate:          MSE = 0.045642

Improvement over noisy:
  MAP:  82.1% reduction
  MMSE: 82.8% reduction

MMSE vs MAP:
  MMSE has lower MSE (as expected, by 3.7%)

Why MMSE Minimizes Expected MSE

The MMSE estimator \hat{x}_{\text{MMSE}} = \mathbb{E}[x \mid y] minimizes the expected squared error:

\mathbb{E}[\|x - \hat{x}\|^2 \mid y]

This is a fundamental result in Bayesian estimation theory. For any specific realization, MAP might occasionally achieve lower MSE, but on average (over many realizations), MMSE will have the lowest MSE.

In our discrete setting, MMSE computes the posterior mean, which is the optimal estimator under the squared error loss function.

10. Optional: True Image Not in Dictionary

Let’s explore what happens when the true image is not in the dictionary. We’ll create a modified test image (e.g., by masking part of it) and see how MAP and MMSE behave differently.

Modified test image (top half masked out)
This image is NOT in the dictionary


MSE for modified image:
  MAP:  0.055568
  MMSE: 0.047006

Discussion: When True Signal is Outside Prior Support

When the true image is not in the dictionary (outside the support of the prior):

  • MAP picks the single “closest” digit from the dictionary, which may be visually similar but structurally different
  • MMSE produces an average of multiple similar digits, which can create a “blended” or “ghost” image that doesn’t correspond to any single digit

This illustrates an important limitation: both estimators are constrained by the dictionary. If the true signal space is much richer than the dictionary, the estimates will be biased toward the dictionary elements.

In practice, this motivates:

  • Using larger, more diverse dictionaries
  • Using continuous priors (e.g., generative models) instead of discrete dictionaries
  • Understanding the trade-off between computational tractability and representational power