Overview of Estimators

This document provides a comprehensive overview of different classes of estimators, organized by their assumptions and computational requirements. We progress from the most general (requiring full probabilistic knowledge) to the most restrictive (requiring only moments or empirical data).

1. Problem Setup

We wish to estimate an unknown quantity X from observations Y, where Y is possibly a vector of observations. An estimator is a function g with

\hat{X} = g(Y)

The fundamental problem in estimation theory is: How do we design the estimator function g?

The answer depends on:

What probabilistic information we have (full distributions, moments, or only data)
What assumptions we’re willing to make about the form of g
What criterion we use to measure estimation quality

2. Risk Minimization Framework

The quality of an estimator is measured by a loss function \ell(X, \hat{X}) that quantifies the cost of estimation error. Common loss functions include:

Squared error: \ell(X, \hat{X}) = \|X - \hat{X}\|^2
Absolute error: \ell(X, \hat{X}) = |X - \hat{X}|
Hit-or-miss (0-1 loss): \ell(X, \hat{X}) = \delta(X - \hat{X})

The risk (or expected loss) of an estimator is:

\mathcal{R}(g) = \mathbb{E}[\ell(X, g(Y))] = \int \int \ell(x, g(y)) \, p_{X,Y}(x,y) \, dx \, dy

The optimal estimator minimizes this risk:

g^* = \arg\min_g \mathcal{R}(g)

Different classes of estimators arise from:

Different assumptions about available probabilistic information
Different constraints on the form of g
Different loss functions \ell

3. General Estimators (Full Probabilistic Knowledge)

When we have access to full probabilistic information, we can derive optimal estimators.

3.1 Minimum Mean Squared Error (MMSE)

Optimal solution: For squared error loss, the MMSE estimator minimizes the risk

\mathbb{E}[\|X - \hat{X}\|^2]

and the optimal estimator is (see for details)

g^*(Y) = \hat{X}_{\text{MMSE}}(Y) = \mathbb{E}[X \mid Y] = \int x \, p_{X|Y}(x|Y) \, dx

Requires: posterior distribution p_{X|Y}

Properties:

mean of the posterior probability
Generally nonlinear (unless (X,Y) is jointly Gaussian)

3.2 Maximum A Posteriori (MAP)

Optimal solution: For hit-or-miss loss, the MAP estimator minimizes the risk

\mathbb{E}[\delta(X - \hat{X})]

and the optimal estimator is

g^*(Y) = \hat{X}_{\text{MAP}}(Y) = \arg\max_x p_{X|Y}(x|Y) = \arg\max_x p_{Y|X}(Y|x) p_X(x)

Requires: posterior distribution p_{X|Y}

Properties:

Maximizes the posterior probability p_{X|Y}
Not necessarily MSE-optimal
Can be strongly nonlinear depending on the prior p_X(x)

3.3 Maximum Likelihood Estimator (MLE)

Starting with the MAP estimator, we can ignore the information of the prior p_X(x) by assuming it is a constant function (which is an uniform improper probability distribution).

Optimal solution:

g^*(Y) = \hat{X}_{\text{MLE}}(Y) = \arg\max_x p_{Y|X}(Y|x)

Requires: Likelihood function p_{Y|X}(y|x).

Properties:

No prior knowledge needed on the X required
Asymptotically optimal under regularity conditions

4. Parametric Estimators

Instead of finding an arbitrary function g, we restrict ourselves to a parametric family:

\hat{X} = g_\theta(Y)

where \theta \in \Theta are parameters. The estimation problem becomes finding the optimal \theta.

Examples:

Neural networks: g_\theta is a deep network with weights \theta
Polynomial estimators: g_\theta(Y) = \sum_{k=0}^K \theta_k Y^k
Kernel methods: g_\theta(Y) = \sum_{i=1}^n \theta_i K(Y, y_i)

Requirements: Similar to general estimators—we need probabilistic information to define the risk, but now we optimize over a constrained set.

Advantages:

Reduces the search space (from all functions to a parameterized family)
Can incorporate domain knowledge through architecture
More computationally tractable

Disadvantages:

May not achieve the optimal risk if the true g^* is not in the family

Extension:

Finding the parameter \theta is regression problem, where we need paired data (X,Y). Prior knowledge on \theta can be included regularize the regression.

5. Affine and Linear Estimators

We further restrict the estimator to be affine (linear + constant):

\hat{X} = A Y + b

or linear (no constant term):

\hat{X} = A Y

5.1 Affine Minimum Mean Squared Error (AMMSE)

We seek the optimal affine estimator \hat{X} = A Y + b that minimizes the mean squared error:

\min_{A, b} \mathbb{E}[\|X - (A Y + b)\|^2]

\hat{X}_{\text{AMMSE}}(Y) = A^* Y + b^* = \mathbf{C}_{XY} \mathbf{C}_{YY}^{-1} (Y - \boldsymbol{\mu}_Y) + \boldsymbol{\mu}_X

Requires: Only second-order moments

Mean: \boldsymbol{\mu}_X = \mathbb{E}[X], \boldsymbol{\mu}_Y = \mathbb{E}[Y]
Covariances: \mathbf{C}_{XX} = \text{Cov}(X,X), \mathbf{C}_{YY} = \text{Cov}(Y,Y), \mathbf{C}_{XY} = \text{Cov}(X,Y)

5.2 Linear Minimum Mean Squared Error (LMMSE)

If we further restrict to linear estimators (no constant term), we set b = 0 and optimize only over A:

\min_A \mathbb{E}[\|X - A Y\|^2]

The optimal estimator linear estimator is then \hat{X}_{\text{LMMSE}}(Y) = \mathbf{R}_{XY} \mathbf{R}_{YY}^{-1} Y

This is the Wiener filter.

Note: The AMMSE solution automatically handles the constant term optimally, so LMMSE is just AMMSE when means are zero (or when we center the data first).

Properties:

Only requires second-order statistics (means and covariances)
Affine/Linear in the observations
Optimal among all affine estimators
If (X,Y) is jointly Gaussian, AMMSE/LMMSE = MMSE
Much simpler than general MMSE (no need for full distributions)

6. Empirical Risk Minimization

When we don’t have access to the true distribution, we can use empirical data to approximate the risk by replacing the expectation with a sample mean: \mathbb E[\cdot] \;\longrightarrow\; \frac{1}{N}\sum_{i=1}^N (\cdot).

6.1 Problem Setup

Instead of minimizing the true risk:

\mathcal{R}(g) = \mathbb{E}[\ell(X, g(Y))]

we minimize the empirical risk:

\hat{\mathcal{R}}(g) = \frac{1}{N} \sum_{i=1}^N \ell(x_i, g(y_i))

where \{(x_i, y_i)\}_{i=1}^N are i.i.d. training samples.

6.2 Parametric Empirical Risk Minimization

Example: Deep Neural Networks (DNN)

\hat{\theta} = \arg\min_\theta \frac{1}{N} \sum_{i=1}^N \ell(x_i, g_\theta(y_i)) + \lambda \mathcal{R}(\theta)

g_\theta is a deep neural network
Optimized via gradient descent (backpropagation)
Requires large amounts of training data
Can approximate complex nonlinear functions

6.3 Linear Empirical Risk Minimization

Example: Least Squares (LS)

For linear estimators \hat{X} = A Y:

\hat{A} = \arg\min_A \frac{1}{N} \sum_{i=1}^N \|x_i - A y_i\|^2

Solution:

\hat{A} = \left(\frac{1}{N} \sum_{i=1}^N x_i y_i^T\right) \left(\frac{1}{N} \sum_{i=1}^N y_i y_i^T\right)^{-1} = \hat{\mathbf{R}}_{XY} \hat{\mathbf{R}}_{YY}^{-1}

This is the empirical version of LMMSE, where we replace true correlations with sample correlations.

Properties:

Simple closed-form solution
Consistent estimator (converges to LMMSE as N \to \infty)
Requires N \geq \dim(Y) for invertibility
Can be regularized (ridge regression, Lasso)
To analyze the performance of ERM, replace the samples (x_i,y_i) with i.i.d. random variables (X_i,Y_i).

7. Summary: Hierarchy of Estimators

The following table summarizes the different classes of estimators:

Estimator Class	Assumptions	Form of g	Requirements
MMSE	Full joint p_{X,Y}	Arbitrary	Full posterior
MAP	Full joint p_{X,Y}	Arbitrary	Full posterior
MLE	Likelihood p_{Y\|X}	Arbitrary	Likelihood only
Parametric	Parametric family	g_\theta	Similar to above
LMMSE	Second moments	Linear/Affine	Means, covariances
Empirical (DNN)	Training data	g_\theta (DNN)	Large dataset
Empirical (LS)	Training data	Linear	Dataset, N \geq \dim(Y)

Key tradeoffs:

More assumptions → Better performance (if assumptions hold)
Fewer assumptions → More robust, but may require more data
More restrictive g → Easier to compute, but potentially suboptimal
Less restrictive g → Can achieve optimal performance, but harder to compute

Example: Comparing MMSE, MAP, MLE, and LMMSE

Let’s consider a simple 1D estimation problem where:

X \sim \mathcal{N}(0, \sigma_X^2) (Gaussian prior)
Y = X + N where N \sim \mathcal{N}(0, \sigma_N^2) (additive Gaussian noise)
We observe Y and want to estimate X

This is a classic denoising problem. Let’s see how different estimators behave.

Observations:

MMSE/MAP/LMMSE: Shrinks observations toward zero (the prior mean). The shrinkage factor is \sigma_X^2/(\sigma_X^2 + \sigma_N^2).
MLE: No shrinkage—simply returns the observation. This ignores the prior information.
For Gaussian priors and likelihoods, MMSE, MAP, and LMMSE all coincide and are linear. LMMSE only requires second-order moments (means and covariances), while MMSE/MAP require the full distribution.

Now let’s see what happens with a non-Gaussian prior where MMSE and MAP differ:

Key difference: With a Laplace (sparse) prior:

MAP: Performs soft-thresholding—small observations are set to zero (sparsity-promoting)
MMSE: Also shrinks toward zero but more smoothly—it’s the expected value under the posterior, not the mode
LMMSE: Linear estimator (straight line) that only uses second-order moments. It’s suboptimal for non-Gaussian priors but much simpler to compute—only requires means and covariances, not the full distribution

Example: Empirical Risk Minimization (Least Squares)

Finally, let’s demonstrate empirical risk minimization with least squares:

True coefficient: A = 0.700
Least squares estimate: A_hat = 0.641
Error: |A - A_hat| = 0.059

Key points:

Least squares estimates the linear relationship from data
As N \to \infty, the estimate converges to the true value (consistency)
This is the empirical version of LMMSE—we replace true covariances with sample covariances
With finite data, there’s estimation error, but it decreases with more samples

Summary

This overview has covered the main classes of estimators:

General estimators (MMSE, MAP, MLE) require full probabilistic knowledge
Parametric estimators restrict the function class but still need probabilistic information
Linear estimators (LMMSE) only need second-order moments
Empirical risk minimization uses data to approximate the risk

Each class makes different tradeoffs between:

Assumptions (what probabilistic information is needed)
Complexity (computational cost)
Performance (how close to optimal)
Robustness (sensitivity to model assumptions)

The choice of estimator depends on your specific problem, available information, and computational constraints.