1. Warm-up: Recursive Mean Estimation

Before diving into the full parameter estimation problem, let’s start with the simplest case: recursively estimating the mean of a sequence of observations.

Problem Setup

We observe a sequence of noisy measurements: y_k = \mu + v_k, \quad k = 1, 2, \ldots

where:

\mu is the unknown constant mean we want to estimate
v_k \sim \mathcal{N}(0, \sigma_v^2) is white noise

Batch Solution (Non-recursive)

If we had all k observations, the optimal estimate would be: \hat{\mu}_k = \frac{1}{k} \sum_{i=1}^k y_i

This requires storing all past observations and recomputing the sum each time.

Recursive Solution

We can compute this recursively without storing all past data:

\hat{\mu}_k = \frac{1}{k} \sum_{i=1}^k y_i = \frac{1}{k} \left( \sum_{i=1}^{k-1} y_i + y_k \right)

= \frac{1}{k} \left( (k-1) \hat{\mu}_{k-1} + y_k \right)

= \hat{\mu}_{k-1} + \frac{1}{k} (y_k - \hat{\mu}_{k-1})

This is a recursive update! Notice the structure:

Old estimate: \hat{\mu}_{k-1}
Innovation: (y_k - \hat{\mu}_{k-1}) (the prediction error)
Gain: \frac{1}{k} (decreases over time, giving less weight to new observations)

We will see that this recursive mean can be interpreted as a Kalman filter.

2. Simplifying the Kalman Filter: Static Parameter Estimation

The Simplification

When we estimate a static parameter vector \mathbf{x} (that doesn’t change over time) and scalar observations y, the Kalman filter equations simplify dramatically.

State Space Model for Static Parameters:

\mathbf{x}_{k+1} = \mathbf{x}_k \quad \text{(no process noise, } \mathbf{Q} = \mathbf{0} \text{)}

y_k = \mathbf{h}_k^T \mathbf{x}_k + v_k

where:

\mathbf{x}_k \in \mathbb{R}^n is the static parameter vector we want to estimate
\mathbf{h}_k \in \mathbb{R}^n is the observation vector at time k (regressor)
y_k \in \mathbb{R} is the scalar observation at time k
v_k is measurement noise with variance R = \sigma_v^2

This is called a static-state Kalman filter.

Simplified Kalman Filter Equations

Predict Step (since \mathbf{x} doesn’t change): \hat{\mathbf{x}}_{k|k-1} = \hat{\mathbf{x}}_{k-1|k-1} \mathbf{P}_{k|k-1} = \mathbf{P}_{k-1|k-1}

Update Step: \mathbf{K}_k = \frac{\mathbf{P}_{k-1} \mathbf{h}_k}{\mathbf{h}_k^T \mathbf{P}_{k-1} \mathbf{h}_k + R} = \hat{\mathbf{r}}_{k}\mathbf{H}_k^T R^{-1}

\hat{\mathbf{x}}_{k|k} = \hat{\mathbf{x}}_{k-1|k-1} + \mathbf{K}_k (y_k - \mathbf{h}_k^T \hat{\mathbf{x}}_{k-1|k-1})

\hat{\mathbf{r}}_{k|k} = (\mathbf{I} - \mathbf{K}_k \mathbf{h}_k^T) \hat{\mathbf{r}}_{k-1|k-1}

Note: We can drop the |k-1 and |k notation since prediction is trivial, and write: \hat{\mathbf{x}}_k = \hat{\mathbf{x}}_{k-1} + \mathbf{K}_k (y_k - \mathbf{h}_k^T \hat{\mathbf{x}}_{k-1})

Recursive Mean as a Static-state Kalman Filter

This recursive mean is exactly what we get from the Kalman filter when:

\mathbf{x} = \mu (scalar)
\mathbf{h}_k = 1 (scalar observation vector)
\mathbf{P}_0 = \infty (no prior information)

The Kalman gain becomes: K_k = \frac{P_{k-1}}{P_{k-1} + R}

The error covariance update for the mean estimation problem is: P_k = (1 - K_k) P_{k-1} = \frac{P_{k-1} R}{P_{k-1} + R} Thus P_1 = \lim_{P_0\to\infty}\frac{P_0R}{P_0+R} = R. and K_1 = \lim_{P_0\to\infty}\frac{P_0}{P_0+R} = 1.

This results in K_k = \frac{1}{k} and P_k = \frac{R}{k}, which recursively gives (as above): \hat{\mathbf{x}}_k = \hat{\mathbf{x}}_{k-1} + \frac{1}{k} (y_k - \hat{\mathbf{x}}_{k-1})

Final batch mean: 4.9889
Final recursive mean: 4.9889
True mean: 5.0000

3. Recursive Least Squares (RLS) with Explicit Noise Variance

We extend the recursive mean example to a vector-valued state \mathbf{x} being observed through a linear combination \mathbf{h}_k. This results in a recursive formulation of the Least Squares (LS) problem.

Problem Setup

We observe linear measurements y_k = \mathbf{h}_k^T \mathbf{x} + v_k, where:

\mathbf{x} \in \mathbb{R}^n is an unknown static parameter vector
\mathbf{h}_k \in \mathbb{R}^n is the known regressor (row vector)
v_k \sim \mathcal{N}(0,\,\sigma_v^2) is measurement noise

Batch (Maximum Likelihood / Weighted Least Squares)

Under Gaussian noise, the maximum likelihood estimator minimizes J(\mathbf{x}) = \sum_{i=1}^k \left( y_i - \mathbf{h}_i^T \mathbf{x} \right)^2.

Define the weighted correlation quantities: \hat{\mathbf{R}}_k = \sum_{i=1}^k \mathbf{h}_i \mathbf{h}_i^T, \qquad \hat{\mathbf{r}}_{k} = \sum_{i=1}^k \mathbf{h}_i y_i.

The batch solution is \hat{\mathbf{x}}_k = \hat{\mathbf{R}}_k^{-1} \hat{\mathbf{r}}_{k}.

Recursive Updates

Correlation matrix update: \hat{\mathbf{R}}_k = \hat{\mathbf{R}}_{k-1} + \mathbf{h}_k \mathbf{h}_k^T

Cross-correlation update: \hat{\mathbf{r}}_{k} = \hat{\mathbf{r}}_{k-1} + \mathbf{h}_k y_k

Define the inverse correlation matrix: \mathbf{P}_k \triangleq \hat{\mathbf{R}}_k^{-1}

Using the Matrix Inversion Lemma (Sherman–Morrison formula), we get

\mathbf{P}_k = \mathbf{P}_{k-1} - \frac{ \mathbf{P}_{k-1} \mathbf{h}_k \mathbf{h}_k^T \mathbf{P}_{k-1} }{ \mathbf{h}_k^T \mathbf{P}_{k-1} \mathbf{h}_k + \sigma_v^2 }

Parameter Update (RLS Form)

The estimate update can be written as \hat{\mathbf{x}}_k = \hat{\mathbf{x}}_{k-1} + \mathbf{K}_k \left( y_k - \mathbf{h}_k^T \hat{\mathbf{x}}_{k-1} \right)

with the RLS gain: \boxed{ \mathbf{K}_k = \frac{ \mathbf{P}_{k-1} \mathbf{h}_k }{ \mathbf{h}_k^T \mathbf{P}_{k-1} \mathbf{h}_k + \sigma_v^2 } }

and the covariance update: \boxed{ \mathbf{P}_k = \left( \mathbf{I} - \mathbf{K}_k \mathbf{h}_k^T \right) \mathbf{P}_{k-1} }

Key Insight

Recursive Least Squares is exactly the Kalman filter for static parameter estimation. The classical RLS equations correspond to the special case \sigma_v^2 = 1, i.e. whitened measurements.

Keeping \sigma_v^2 explicit clarifies how measurement uncertainty controls the gain and the rate at which new data influence the estimate.

True parameters: [ 2.  -1.5  0.8]
Final RLS estimates: [ 1.98886372 -1.45515     0.85961517]
Final error: 0.075429

RLS with Forgetting Factor

In practical applications, we often want the algorithm to “forget” old data in order to track parameters that change slowly over time. This is achieved by introducing a forgetting factor \lambda \in (0,1].

The modified cost function with forgetting factor is: J(\mathbf{x}) = \sum_{i=1}^k \lambda^{k-i} \left( y_i - \mathbf{h}_i^T \mathbf{x} \right)^2.

Here, more recent observations are given higher weight. The corresponding exponentially weighted correlation quantities are \hat{\mathbf{R}}_k = \sum_{i=1}^k \lambda^{k-i} \mathbf{h}_i \mathbf{h}_i^T, \qquad \hat{\mathbf{r}}_k = \sum_{i=1}^k \lambda^{k-i} \mathbf{h}_i y_i.

They satisfy the recursions \hat{\mathbf{R}}_k = \lambda \hat{\mathbf{R}}_{k-1} + \mathbf{h}_k \mathbf{h}_k^T,

\hat{\mathbf{r}}_k = \lambda \hat{\mathbf{r}}_{k-1} + \mathbf{h}_k y_k.

Defining \mathbf{P}_k = \hat{\mathbf{R}}_k^{-1}, the RLS update equations with forgetting factor become:

\mathbf{K}_k = \frac{\mathbf{P}_{k-1} \mathbf{h}_k} {\mathbf{h}_k^\top \mathbf{P}_{k-1} \mathbf{h}_k + \sigma_v^2}

\hat{\mathbf{x}}_k = \hat{\mathbf{x}}_{k-1} + \mathbf{K}_k \left( y_k - \mathbf{h}_k^\top \hat{\mathbf{x}}_{k-1} \right)

\mathbf{P}_k = \frac{1}{\lambda} \left(\mathbf{I} - \mathbf{K}_k \mathbf{h}_k^\top\right) \mathbf{P}_{k-1} Thus, the factor \frac{1}{\lambda} leads to a growth in uncertainty at every step.

Connection to Kalman Filter with Process Noise:

When \lambda < 1, using a forgetting factor is equivalent to including process noise in the Kalman filter model:

\mathbf{x}_{k+1} = \mathbf{x}_k + \mathbf{q}_k, \quad \mathbf{q}_k \sim \mathcal{N}(\mathbf{0},\, \mathbf{Q})

with measurement model y_k = \mathbf{h}_k^\top \mathbf{x}_k + v_k, \quad v_k \sim \mathcal{N}(0,\sigma_v^2).

The forgetting factor corresponds to the covariance prediction \mathbf{P}_{k|k-1} = \frac{1}{\lambda}\mathbf{P}_{k-1} = \mathbf{P}_{k-1} + \mathbf{Q},

which implies the effective process noise covariance \mathbf{Q} = \left(\frac{1}{\lambda} - 1\right)\mathbf{P}_{k-1}.

4. Deriving LMS from RLS

The Least Mean Squares (LMS) algorithm is a simplified, computationally cheaper version of RLS.

The Key Approximation

RLS requires updating the full covariance matrix \mathbf{P}_k at each step, which is O(n^2) operations. LMS approximates:

\mathbf{P}_{k-1} \approx \mu \mathbf{I}

where \mu > 0 is a step size parameter.

Derivation

Starting from the RLS update: \hat{\mathbf{x}}_k = \hat{\mathbf{x}}_{k-1} + \mathbf{K}_k (y_k - \mathbf{h}_k^T \hat{\mathbf{x}}_{k-1})

where: \mathbf{K}_k = \frac{\mathbf{P}_{k-1} \mathbf{h}_k}{1 + \mathbf{h}_k^T \mathbf{P}_{k-1} \mathbf{h}_k}

Substituting \mathbf{P}_{k-1} \approx \mu \mathbf{I}:

\mathbf{K}_k \approx \frac{\mu \mathbf{h}_k}{1 + \mu \mathbf{h}_k^T \mathbf{h}_k}

If \mu is small (typical in LMS), then \mu \mathbf{h}_k^T \mathbf{h}_k \ll 1, so:

\mathbf{K}_k \approx \mu \mathbf{h}_k

Therefore, the LMS update is:

\hat{\mathbf{x}}_k = \hat{\mathbf{x}}_{k-1} + \mu \mathbf{h}_k (y_k - \mathbf{h}_k^T \hat{\mathbf{x}}_{k-1})

This is the LMS algorithm!

Interpretation

Gradient descent: LMS can be viewed as stochastic gradient descent on the instantaneous squared error
Computational cost: O(n) per iteration (vs O(n^2) for RLS)
Trade-off: Simpler but potentially slower convergence and less stable than RLS

Step Size Selection

The step size \mu must be chosen carefully:

Too small: Slow convergence
Too large: Instability, divergence

A common bound for stability is: 0 < \mu < \frac{2}{\lambda_{\max}(\mathbb{E}[\mathbf{h}_k \mathbf{h}_k^T])}

where \lambda_{\max} is the maximum eigenvalue of the input correlation matrix.

Final RLS error: 0.075429
Final LMS error: 0.484112

5. Deriving Normalized LMS (NLMS) from LMS

Normalized LMS (NLMS) is an improvement over LMS that automatically adjusts the step size based on the input signal power.

Motivation

The LMS update is: \hat{\mathbf{x}}_k = \hat{\mathbf{x}}_{k-1} + \mu \mathbf{h}_k (y_k - \mathbf{h}_k^T \hat{\mathbf{x}}_{k-1})

The step size \mu is fixed, but the effective step size depends on \|\mathbf{h}_k\|^2:

When \|\mathbf{h}_k\|^2 is large, the update is large (potentially unstable)
When \|\mathbf{h}_k\|^2 is small, the update is small (slow convergence)

Derivation from RLS

Recall the RLS gain: \mathbf{K}_k = \frac{\mathbf{P}_{k-1} \mathbf{h}_k}{1 + \mathbf{h}_k^T \mathbf{P}_{k-1} \mathbf{h}_k}

If we approximate \mathbf{P}_{k-1} \approx \mu \mathbf{I} (like in LMS), but keep the denominator:

\mathbf{K}_k \approx \frac{\mu \mathbf{h}_k}{1 + \mu \mathbf{h}_k^T \mathbf{h}_k}

This gives the NLMS update:

\hat{\mathbf{x}}_k = \hat{\mathbf{x}}_{k-1} + \frac{1}{1/\mu + \mathbf{h}_k^T \mathbf{h}_k} \mathbf{h}_k (y_k - \mathbf{h}_k^T \hat{\mathbf{x}}_{k-1})

where 1/\mu is a small regularization constant to prevent division by zero when \|\mathbf{h}_k\|^2 is very small.

Normalized Step Size

The effective step size is now: \mu_{\text{eff}} =\frac{1}{1/\mu + \mathbf{h}_k^T \mathbf{h}_k}

This normalizes the update by the input power, making the algorithm:

More stable (less sensitive to input scaling)
Faster converging (larger effective step when input is small)
More robust to different signal levels

Connection to Kalman Filter

NLMS can be seen as a better approximation to RLS/Kalman filter than standard LMS, because it preserves more of the structure of the optimal gain matrix.

Final RLS error: 0.075429
Final LMS error: 0.484112
Final NLMS error: 0.051615

6. Summary: The Hierarchy of Adaptive Filters

We have derived a complete hierarchy of adaptive filtering algorithms, all starting from the Kalman filter:

The Derivation Path

Kalman Filter (Full dynamic state estimation)
- State: \mathbf{x}_{k+1} = \mathbf{F}_k \mathbf{x}_k + \mathbf{x}_k
- Observation: \mathbf{y}_k = \mathbf{H}_k \mathbf{x}_k + \mathbf{v}_k
- Complexity: O(n^2) per iteration
Static Kalman Filter (Static parameter estimation)
- State: \mathbf{x}_{k+1} = \mathbf{x}_k (no process noise)
- Observation: y_k = \mathbf{h}_k^T \mathbf{x}_k + v_k
- Same complexity, simpler structure
RLS (Recursive Least Squares)
- Equivalent to static Kalman filter
- Optimal for static parameters
- Complexity: O(n^2) per iteration
- Update: \hat{\mathbf{x}}_k = \hat{\mathbf{x}}_{k-1} + \mathbf{K}_k (y_k - \mathbf{h}_k^T \hat{\mathbf{x}}_{k-1})
- Gain: \mathbf{K}_k = \frac{\mathbf{P}_{k-1} \mathbf{h}_k}{\lambda + \mathbf{h}_k^T \mathbf{P}_{k-1} \mathbf{h}_k}
LMS (Least Mean Squares)
- Approximation: \mathbf{P}_{k-1} \approx \mu \mathbf{I}
- Simpler, but suboptimal
- Complexity: O(n) per iteration
- Update: \hat{\mathbf{x}}_k = \hat{\mathbf{x}}_{k-1} + \mu \mathbf{h}_k (y_k - \mathbf{h}_k^T \hat{\mathbf{x}}_{k-1})
NLMS (Normalized LMS)
- Better approximation: keeps normalization term
- More stable and faster than LMS
- Complexity: O(n) per iteration
- Update: \hat{\mathbf{x}}_k = \hat{\mathbf{x}}_{k-1} + \frac{1}{1/\mu + \|\mathbf{h}_k\|^2} \mathbf{h}_k (y_k - \mathbf{h}_k^T \hat{\mathbf{x}}_{k-1})

Key Insights

All adaptive filters share the same structure: old estimate + gain × innovation
The gain determines the algorithm: RLS uses optimal gain, LMS/NLMS use approximations
Trade-off: Optimality vs. computational complexity
Recursive mean is the simplest case (scalar, unit regressor)

When to Use Which?

RLS: When you need optimal performance and can afford O(n^2) complexity
NLMS: Good balance of performance and complexity, robust to input scaling
LMS: Simplest, but requires careful step size tuning and may be slower