Overview of Estimators
This document provides a comprehensive overview of different classes of estimators, organized by their assumptions and computational requirements. We progress from the most general (requiring full probabilistic knowledge) to the most restrictive (requiring only moments or empirical data).
1. Problem Setup
We wish to estimate an unknown quantity X from observations Y, where Y is possibly a vector of observations. An estimator is a function g with
\hat{X} = g(Y)
The fundamental problem in estimation theory is: How do we design the estimator function g?
The answer depends on:
- What probabilistic information we have (full distributions, moments, or only data)
- What assumptions we’re willing to make about the form of g
- What criterion we use to measure estimation quality
2. Risk Minimization Framework
The quality of an estimator is measured by a loss function \ell(X, \hat{X}) that quantifies the cost of estimation error. Common loss functions include:
- Squared error: \ell(X, \hat{X}) = \|X - \hat{X}\|^2
- Absolute error: \ell(X, \hat{X}) = |X - \hat{X}|
- Hit-or-miss (0-1 loss): \ell(X, \hat{X}) = \delta(X - \hat{X})
The risk (or expected loss) of an estimator is:
\mathcal{R}(g) = \mathbb{E}[\ell(X, g(Y))] = \int \int \ell(x, g(y)) \, p_{X,Y}(x,y) \, dx \, dy
The optimal estimator minimizes this risk:
g^* = \arg\min_g \mathcal{R}(g)
Different classes of estimators arise from:
- Different assumptions about available probabilistic information
- Different constraints on the form of g
- Different loss functions \ell
3. General Estimators (Full Probabilistic Knowledge)
When we have access to full probabilistic information, we can derive optimal estimators.
3.1 Minimum Mean Squared Error (MMSE)
Optimal solution: For squared error loss, the MMSE estimator minimizes the risk
\mathbb{E}[\|X - \hat{X}\|^2]
and the optimal estimator is (see for details)
g^*(Y) = \hat{X}_{\text{MMSE}}(Y) = \mathbb{E}[X \mid Y] = \int x \, p_{X|Y}(x|Y) \, dx
Requires: posterior distribution p_{X|Y}
Properties:
- mean of the posterior probability
- Generally nonlinear (unless (X,Y) is jointly Gaussian)
3.2 Maximum A Posteriori (MAP)
Optimal solution: For hit-or-miss loss, the MAP estimator minimizes the risk
\mathbb{E}[\delta(X - \hat{X})]
and the optimal estimator is
g^*(Y) = \hat{X}_{\text{MAP}}(Y) = \arg\max_x p_{X|Y}(x|Y) = \arg\max_x p_{Y|X}(Y|x) p_X(x)
Requires: posterior distribution p_{X|Y}
Properties:
- Maximizes the posterior probability p_{X|Y}
- Not necessarily MSE-optimal
- Can be strongly nonlinear depending on the prior p_X(x)
3.3 Maximum Likelihood Estimator (MLE)
Starting with the MAP estimator, we can ignore the information of the prior p_X(x) by assuming it is a constant function (which is an uniform improper probability distribution).
Optimal solution:
g^*(Y) = \hat{X}_{\text{MLE}}(Y) = \arg\max_x p_{Y|X}(Y|x)
Requires: Likelihood function p_{Y|X}(y|x).
Properties:
- No prior knowledge needed on the X required
- Asymptotically optimal under regularity conditions
4. Parametric Estimators
Instead of finding an arbitrary function g, we restrict ourselves to a parametric family:
\hat{X} = g_\theta(Y)
where \theta \in \Theta are parameters. The estimation problem becomes finding the optimal \theta.
Examples:
- Neural networks: g_\theta is a deep network with weights \theta
- Polynomial estimators: g_\theta(Y) = \sum_{k=0}^K \theta_k Y^k
- Kernel methods: g_\theta(Y) = \sum_{i=1}^n \theta_i K(Y, y_i)
Requirements: Similar to general estimators—we need probabilistic information to define the risk, but now we optimize over a constrained set.
Advantages:
- Reduces the search space (from all functions to a parameterized family)
- Can incorporate domain knowledge through architecture
- More computationally tractable
Disadvantages:
- May not achieve the optimal risk if the true g^* is not in the family
Extension:
- Finding the parameter \theta is regression problem, where we need paired data (X,Y). Prior knowledge on \theta can be included regularize the regression.
5. Affine and Linear Estimators
We further restrict the estimator to be affine (linear + constant):
\hat{X} = A Y + b
or linear (no constant term):
\hat{X} = A Y
5.1 Affine Minimum Mean Squared Error (AMMSE)
We seek the optimal affine estimator \hat{X} = A Y + b that minimizes the mean squared error:
\min_{A, b} \mathbb{E}[\|X - (A Y + b)\|^2]
\hat{X}_{\text{AMMSE}}(Y) = A^* Y + b^* = \mathbf{C}_{XY} \mathbf{C}_{YY}^{-1} (Y - \boldsymbol{\mu}_Y) + \boldsymbol{\mu}_X
Requires: Only second-order moments
- Mean: \boldsymbol{\mu}_X = \mathbb{E}[X], \boldsymbol{\mu}_Y = \mathbb{E}[Y]
- Covariances: \mathbf{C}_{XX} = \text{Cov}(X,X), \mathbf{C}_{YY} = \text{Cov}(Y,Y), \mathbf{C}_{XY} = \text{Cov}(X,Y)
5.2 Linear Minimum Mean Squared Error (LMMSE)
If we further restrict to linear estimators (no constant term), we set b = 0 and optimize only over A:
\min_A \mathbb{E}[\|X - A Y\|^2]
The optimal estimator linear estimator is then \hat{X}_{\text{LMMSE}}(Y) = \mathbf{R}_{XY} \mathbf{R}_{YY}^{-1} Y
This is the Wiener filter.
Note: The AMMSE solution automatically handles the constant term optimally, so LMMSE is just AMMSE when means are zero (or when we center the data first).
Properties:
- Only requires second-order statistics (means and covariances)
- Affine/Linear in the observations
- Optimal among all affine estimators
- If (X,Y) is jointly Gaussian, AMMSE/LMMSE = MMSE
- Much simpler than general MMSE (no need for full distributions)
6. Empirical Risk Minimization
When we don’t have access to the true distribution, we can use empirical data to approximate the risk by replacing the expectation with a sample mean: \mathbb E[\cdot] \;\longrightarrow\; \frac{1}{N}\sum_{i=1}^N (\cdot).
6.1 Problem Setup
Instead of minimizing the true risk:
\mathcal{R}(g) = \mathbb{E}[\ell(X, g(Y))]
we minimize the empirical risk:
\hat{\mathcal{R}}(g) = \frac{1}{N} \sum_{i=1}^N \ell(x_i, g(y_i))
where \{(x_i, y_i)\}_{i=1}^N are i.i.d. training samples.
6.2 Parametric Empirical Risk Minimization
Example: Deep Neural Networks (DNN)
\hat{\theta} = \arg\min_\theta \frac{1}{N} \sum_{i=1}^N \ell(x_i, g_\theta(y_i)) + \lambda \mathcal{R}(\theta)
- g_\theta is a deep neural network
- Optimized via gradient descent (backpropagation)
- Requires large amounts of training data
- Can approximate complex nonlinear functions
6.3 Linear Empirical Risk Minimization
Example: Least Squares (LS)
For linear estimators \hat{X} = A Y:
\hat{A} = \arg\min_A \frac{1}{N} \sum_{i=1}^N \|x_i - A y_i\|^2
Solution:
\hat{A} = \left(\frac{1}{N} \sum_{i=1}^N x_i y_i^T\right) \left(\frac{1}{N} \sum_{i=1}^N y_i y_i^T\right)^{-1} = \hat{\mathbf{R}}_{XY} \hat{\mathbf{R}}_{YY}^{-1}
This is the empirical version of LMMSE, where we replace true correlations with sample correlations.
Properties:
- Simple closed-form solution
- Consistent estimator (converges to LMMSE as N \to \infty)
- Requires N \geq \dim(Y) for invertibility
- Can be regularized (ridge regression, Lasso)
- To analyze the performance of ERM, replace the samples (x_i,y_i) with i.i.d. random variables (X_i,Y_i).
7. Summary: Hierarchy of Estimators
The following table summarizes the different classes of estimators:
| Estimator Class | Assumptions | Form of g | Requirements |
|---|---|---|---|
| MMSE | Full joint p_{X,Y} | Arbitrary | Full posterior |
| MAP | Full joint p_{X,Y} | Arbitrary | Full posterior |
| MLE | Likelihood p_{Y|X} | Arbitrary | Likelihood only |
| Parametric | Parametric family | g_\theta | Similar to above |
| LMMSE | Second moments | Linear/Affine | Means, covariances |
| Empirical (DNN) | Training data | g_\theta (DNN) | Large dataset |
| Empirical (LS) | Training data | Linear | Dataset, N \geq \dim(Y) |
Key tradeoffs:
- More assumptions → Better performance (if assumptions hold)
- Fewer assumptions → More robust, but may require more data
- More restrictive g → Easier to compute, but potentially suboptimal
- Less restrictive g → Can achieve optimal performance, but harder to compute
Example: Comparing MMSE, MAP, MLE, and LMMSE
Let’s consider a simple 1D estimation problem where:
- X \sim \mathcal{N}(0, \sigma_X^2) (Gaussian prior)
- Y = X + N where N \sim \mathcal{N}(0, \sigma_N^2) (additive Gaussian noise)
- We observe Y and want to estimate X
This is a classic denoising problem. Let’s see how different estimators behave.
Observations:
- MMSE/MAP/LMMSE: Shrinks observations toward zero (the prior mean). The shrinkage factor is \sigma_X^2/(\sigma_X^2 + \sigma_N^2).
- MLE: No shrinkage—simply returns the observation. This ignores the prior information.
- For Gaussian priors and likelihoods, MMSE, MAP, and LMMSE all coincide and are linear. LMMSE only requires second-order moments (means and covariances), while MMSE/MAP require the full distribution.
Now let’s see what happens with a non-Gaussian prior where MMSE and MAP differ:
Key difference: With a Laplace (sparse) prior:
- MAP: Performs soft-thresholding—small observations are set to zero (sparsity-promoting)
- MMSE: Also shrinks toward zero but more smoothly—it’s the expected value under the posterior, not the mode
- LMMSE: Linear estimator (straight line) that only uses second-order moments. It’s suboptimal for non-Gaussian priors but much simpler to compute—only requires means and covariances, not the full distribution
Example: Empirical Risk Minimization (Least Squares)
Finally, let’s demonstrate empirical risk minimization with least squares:
True coefficient: A = 0.700
Least squares estimate: A_hat = 0.641
Error: |A - A_hat| = 0.059
Key points:
- Least squares estimates the linear relationship from data
- As N \to \infty, the estimate converges to the true value (consistency)
- This is the empirical version of LMMSE—we replace true covariances with sample covariances
- With finite data, there’s estimation error, but it decreases with more samples
Summary
This overview has covered the main classes of estimators:
- General estimators (MMSE, MAP, MLE) require full probabilistic knowledge
- Parametric estimators restrict the function class but still need probabilistic information
- Linear estimators (LMMSE) only need second-order moments
- Empirical risk minimization uses data to approximate the risk
Each class makes different tradeoffs between:
- Assumptions (what probabilistic information is needed)
- Complexity (computational cost)
- Performance (how close to optimal)
- Robustness (sensitivity to model assumptions)
The choice of estimator depends on your specific problem, available information, and computational constraints.