Final batch mean: 4.9889
Final recursive mean: 4.9889
True mean: 5.0000
1. Warm-up: Recursive Mean Estimation
Before diving into the full parameter estimation problem, let’s start with the simplest case: recursively estimating the mean of a sequence of observations.
Problem Setup
We observe a sequence of noisy measurements: y_k = \mu + v_k, \quad k = 1, 2, \ldots
where:
- \mu is the unknown constant mean we want to estimate
- v_k \sim \mathcal{N}(0, \sigma_v^2) is white noise
Batch Solution (Non-recursive)
If we had all k observations, the optimal estimate would be: \hat{\mu}_k = \frac{1}{k} \sum_{i=1}^k y_i
This requires storing all past observations and recomputing the sum each time.
Recursive Solution
We can compute this recursively without storing all past data:
\hat{\mu}_k = \frac{1}{k} \sum_{i=1}^k y_i = \frac{1}{k} \left( \sum_{i=1}^{k-1} y_i + y_k \right)
= \frac{1}{k} \left( (k-1) \hat{\mu}_{k-1} + y_k \right)
= \hat{\mu}_{k-1} + \frac{1}{k} (y_k - \hat{\mu}_{k-1})
This is a recursive update! Notice the structure:
- Old estimate: \hat{\mu}_{k-1}
- Innovation: (y_k - \hat{\mu}_{k-1}) (the prediction error)
- Gain: \frac{1}{k} (decreases over time, giving less weight to new observations)
We will see that this recursive mean can be interpreted as a Kalman filter.
2. Simplifying the Kalman Filter: Static Parameter Estimation
The Simplification
When we estimate a static parameter vector \mathbf{x} (that doesn’t change over time) and scalar observations y, the Kalman filter equations simplify dramatically.
State Space Model for Static Parameters:
\mathbf{x}_{k+1} = \mathbf{x}_k \quad \text{(no process noise, } \mathbf{Q} = \mathbf{0} \text{)}
y_k = \mathbf{h}_k^T \mathbf{x}_k + v_k
where:
- \mathbf{x}_k \in \mathbb{R}^n is the static parameter vector we want to estimate
- \mathbf{h}_k \in \mathbb{R}^n is the observation vector at time k (regressor)
- y_k \in \mathbb{R} is the scalar observation at time k
- v_k is measurement noise with variance R = \sigma_v^2
This is called a static-state Kalman filter.
Simplified Kalman Filter Equations
Predict Step (since \mathbf{x} doesn’t change): \hat{\mathbf{x}}_{k|k-1} = \hat{\mathbf{x}}_{k-1|k-1} \mathbf{P}_{k|k-1} = \mathbf{P}_{k-1|k-1}
Update Step: \mathbf{K}_k = \frac{\mathbf{P}_{k-1} \mathbf{h}_k}{\mathbf{h}_k^T \mathbf{P}_{k-1} \mathbf{h}_k + R} = \hat{\mathbf{r}}_{k}\mathbf{H}_k^T R^{-1}
\hat{\mathbf{x}}_{k|k} = \hat{\mathbf{x}}_{k-1|k-1} + \mathbf{K}_k (y_k - \mathbf{h}_k^T \hat{\mathbf{x}}_{k-1|k-1})
\hat{\mathbf{r}}_{k|k} = (\mathbf{I} - \mathbf{K}_k \mathbf{h}_k^T) \hat{\mathbf{r}}_{k-1|k-1}
Note: We can drop the |k-1 and |k notation since prediction is trivial, and write: \hat{\mathbf{x}}_k = \hat{\mathbf{x}}_{k-1} + \mathbf{K}_k (y_k - \mathbf{h}_k^T \hat{\mathbf{x}}_{k-1})
Recursive Mean as a Static-state Kalman Filter
This recursive mean is exactly what we get from the Kalman filter when:
- \mathbf{x} = \mu (scalar)
- \mathbf{h}_k = 1 (scalar observation vector)
- \mathbf{P}_0 = \infty (no prior information)
The Kalman gain becomes: K_k = \frac{P_{k-1}}{P_{k-1} + R}
The error covariance update for the mean estimation problem is: P_k = (1 - K_k) P_{k-1} = \frac{P_{k-1} R}{P_{k-1} + R} Thus P_1 = \lim_{P_0\to\infty}\frac{P_0R}{P_0+R} = R. and K_1 = \lim_{P_0\to\infty}\frac{P_0}{P_0+R} = 1.
This results in K_k = \frac{1}{k} and P_k = \frac{R}{k}, which recursively gives (as above): \hat{\mathbf{x}}_k = \hat{\mathbf{x}}_{k-1} + \frac{1}{k} (y_k - \hat{\mathbf{x}}_{k-1})
3. Recursive Least Squares (RLS) with Explicit Noise Variance
We extend the recursive mean example to a vector-valued state \mathbf{x} being observed through a linear combination \mathbf{h}_k. This results in a recursive formulation of the Least Squares (LS) problem.
Problem Setup
We observe linear measurements y_k = \mathbf{h}_k^T \mathbf{x} + v_k, where:
- \mathbf{x} \in \mathbb{R}^n is an unknown static parameter vector
- \mathbf{h}_k \in \mathbb{R}^n is the known regressor (row vector)
- v_k \sim \mathcal{N}(0,\,\sigma_v^2) is measurement noise
Batch (Maximum Likelihood / Weighted Least Squares)
Under Gaussian noise, the maximum likelihood estimator minimizes J(\mathbf{x}) = \sum_{i=1}^k \left( y_i - \mathbf{h}_i^T \mathbf{x} \right)^2.
Define the weighted correlation quantities: \hat{\mathbf{R}}_k = \sum_{i=1}^k \mathbf{h}_i \mathbf{h}_i^T, \qquad \hat{\mathbf{r}}_{k} = \sum_{i=1}^k \mathbf{h}_i y_i.
The batch solution is \hat{\mathbf{x}}_k = \hat{\mathbf{R}}_k^{-1} \hat{\mathbf{r}}_{k}.
Recursive Updates
Correlation matrix update: \hat{\mathbf{R}}_k = \hat{\mathbf{R}}_{k-1} + \mathbf{h}_k \mathbf{h}_k^T
Cross-correlation update: \hat{\mathbf{r}}_{k} = \hat{\mathbf{r}}_{k-1} + \mathbf{h}_k y_k
Define the inverse correlation matrix: \mathbf{P}_k \triangleq \hat{\mathbf{R}}_k^{-1}
Using the Matrix Inversion Lemma (Sherman–Morrison formula), we get
\mathbf{P}_k = \mathbf{P}_{k-1} - \frac{ \mathbf{P}_{k-1} \mathbf{h}_k \mathbf{h}_k^T \mathbf{P}_{k-1} }{ \mathbf{h}_k^T \mathbf{P}_{k-1} \mathbf{h}_k + \sigma_v^2 }
Parameter Update (RLS Form)
The estimate update can be written as \hat{\mathbf{x}}_k = \hat{\mathbf{x}}_{k-1} + \mathbf{K}_k \left( y_k - \mathbf{h}_k^T \hat{\mathbf{x}}_{k-1} \right)
with the RLS gain: \boxed{ \mathbf{K}_k = \frac{ \mathbf{P}_{k-1} \mathbf{h}_k }{ \mathbf{h}_k^T \mathbf{P}_{k-1} \mathbf{h}_k + \sigma_v^2 } }
and the covariance update: \boxed{ \mathbf{P}_k = \left( \mathbf{I} - \mathbf{K}_k \mathbf{h}_k^T \right) \mathbf{P}_{k-1} }
Key Insight
Recursive Least Squares is exactly the Kalman filter for static parameter estimation. The classical RLS equations correspond to the special case \sigma_v^2 = 1, i.e. whitened measurements.
Keeping \sigma_v^2 explicit clarifies how measurement uncertainty controls the gain and the rate at which new data influence the estimate.
True parameters: [ 2. -1.5 0.8]
Final RLS estimates: [ 1.98886372 -1.45515 0.85961517]
Final error: 0.075429
RLS with Forgetting Factor
In practical applications, we often want the algorithm to “forget” old data in order to track parameters that change slowly over time. This is achieved by introducing a forgetting factor \lambda \in (0,1].
The modified cost function with forgetting factor is: J(\mathbf{x}) = \sum_{i=1}^k \lambda^{k-i} \left( y_i - \mathbf{h}_i^T \mathbf{x} \right)^2.
Here, more recent observations are given higher weight. The corresponding exponentially weighted correlation quantities are \hat{\mathbf{R}}_k = \sum_{i=1}^k \lambda^{k-i} \mathbf{h}_i \mathbf{h}_i^T, \qquad \hat{\mathbf{r}}_k = \sum_{i=1}^k \lambda^{k-i} \mathbf{h}_i y_i.
They satisfy the recursions \hat{\mathbf{R}}_k = \lambda \hat{\mathbf{R}}_{k-1} + \mathbf{h}_k \mathbf{h}_k^T,
\hat{\mathbf{r}}_k = \lambda \hat{\mathbf{r}}_{k-1} + \mathbf{h}_k y_k.
Defining \mathbf{P}_k = \hat{\mathbf{R}}_k^{-1}, the RLS update equations with forgetting factor become:
\mathbf{K}_k = \frac{\mathbf{P}_{k-1} \mathbf{h}_k} {\mathbf{h}_k^\top \mathbf{P}_{k-1} \mathbf{h}_k + \sigma_v^2}
\hat{\mathbf{x}}_k = \hat{\mathbf{x}}_{k-1} + \mathbf{K}_k \left( y_k - \mathbf{h}_k^\top \hat{\mathbf{x}}_{k-1} \right)
\mathbf{P}_k = \frac{1}{\lambda} \left(\mathbf{I} - \mathbf{K}_k \mathbf{h}_k^\top\right) \mathbf{P}_{k-1} Thus, the factor \frac{1}{\lambda} leads to a growth in uncertainty at every step.
Connection to Kalman Filter with Process Noise:
When \lambda < 1, using a forgetting factor is equivalent to including process noise in the Kalman filter model:
\mathbf{x}_{k+1} = \mathbf{x}_k + \mathbf{q}_k, \quad \mathbf{q}_k \sim \mathcal{N}(\mathbf{0},\, \mathbf{Q})
with measurement model y_k = \mathbf{h}_k^\top \mathbf{x}_k + v_k, \quad v_k \sim \mathcal{N}(0,\sigma_v^2).
The forgetting factor corresponds to the covariance prediction \mathbf{P}_{k|k-1} = \frac{1}{\lambda}\mathbf{P}_{k-1} = \mathbf{P}_{k-1} + \mathbf{Q},
which implies the effective process noise covariance \mathbf{Q} = \left(\frac{1}{\lambda} - 1\right)\mathbf{P}_{k-1}.
4. Deriving LMS from RLS
The Least Mean Squares (LMS) algorithm is a simplified, computationally cheaper version of RLS.
The Key Approximation
RLS requires updating the full covariance matrix \mathbf{P}_k at each step, which is O(n^2) operations. LMS approximates:
\mathbf{P}_{k-1} \approx \mu \mathbf{I}
where \mu > 0 is a step size parameter.
Derivation
Starting from the RLS update: \hat{\mathbf{x}}_k = \hat{\mathbf{x}}_{k-1} + \mathbf{K}_k (y_k - \mathbf{h}_k^T \hat{\mathbf{x}}_{k-1})
where: \mathbf{K}_k = \frac{\mathbf{P}_{k-1} \mathbf{h}_k}{1 + \mathbf{h}_k^T \mathbf{P}_{k-1} \mathbf{h}_k}
Substituting \mathbf{P}_{k-1} \approx \mu \mathbf{I}:
\mathbf{K}_k \approx \frac{\mu \mathbf{h}_k}{1 + \mu \mathbf{h}_k^T \mathbf{h}_k}
If \mu is small (typical in LMS), then \mu \mathbf{h}_k^T \mathbf{h}_k \ll 1, so:
\mathbf{K}_k \approx \mu \mathbf{h}_k
Therefore, the LMS update is:
\hat{\mathbf{x}}_k = \hat{\mathbf{x}}_{k-1} + \mu \mathbf{h}_k (y_k - \mathbf{h}_k^T \hat{\mathbf{x}}_{k-1})
This is the LMS algorithm!
Interpretation
- Gradient descent: LMS can be viewed as stochastic gradient descent on the instantaneous squared error
- Computational cost: O(n) per iteration (vs O(n^2) for RLS)
- Trade-off: Simpler but potentially slower convergence and less stable than RLS
Step Size Selection
The step size \mu must be chosen carefully:
- Too small: Slow convergence
- Too large: Instability, divergence
A common bound for stability is: 0 < \mu < \frac{2}{\lambda_{\max}(\mathbb{E}[\mathbf{h}_k \mathbf{h}_k^T])}
where \lambda_{\max} is the maximum eigenvalue of the input correlation matrix.
Final RLS error: 0.075429
Final LMS error: 0.484112
5. Deriving Normalized LMS (NLMS) from LMS
Normalized LMS (NLMS) is an improvement over LMS that automatically adjusts the step size based on the input signal power.
Motivation
The LMS update is: \hat{\mathbf{x}}_k = \hat{\mathbf{x}}_{k-1} + \mu \mathbf{h}_k (y_k - \mathbf{h}_k^T \hat{\mathbf{x}}_{k-1})
The step size \mu is fixed, but the effective step size depends on \|\mathbf{h}_k\|^2:
- When \|\mathbf{h}_k\|^2 is large, the update is large (potentially unstable)
- When \|\mathbf{h}_k\|^2 is small, the update is small (slow convergence)
Derivation from RLS
Recall the RLS gain: \mathbf{K}_k = \frac{\mathbf{P}_{k-1} \mathbf{h}_k}{1 + \mathbf{h}_k^T \mathbf{P}_{k-1} \mathbf{h}_k}
If we approximate \mathbf{P}_{k-1} \approx \mu \mathbf{I} (like in LMS), but keep the denominator:
\mathbf{K}_k \approx \frac{\mu \mathbf{h}_k}{1 + \mu \mathbf{h}_k^T \mathbf{h}_k}
This gives the NLMS update:
\hat{\mathbf{x}}_k = \hat{\mathbf{x}}_{k-1} + \frac{1}{1/\mu + \mathbf{h}_k^T \mathbf{h}_k} \mathbf{h}_k (y_k - \mathbf{h}_k^T \hat{\mathbf{x}}_{k-1})
where 1/\mu is a small regularization constant to prevent division by zero when \|\mathbf{h}_k\|^2 is very small.
Normalized Step Size
The effective step size is now: \mu_{\text{eff}} =\frac{1}{1/\mu + \mathbf{h}_k^T \mathbf{h}_k}
This normalizes the update by the input power, making the algorithm:
- More stable (less sensitive to input scaling)
- Faster converging (larger effective step when input is small)
- More robust to different signal levels
Connection to Kalman Filter
NLMS can be seen as a better approximation to RLS/Kalman filter than standard LMS, because it preserves more of the structure of the optimal gain matrix.
Final RLS error: 0.075429
Final LMS error: 0.484112
Final NLMS error: 0.051615
6. Summary: The Hierarchy of Adaptive Filters
We have derived a complete hierarchy of adaptive filtering algorithms, all starting from the Kalman filter:
The Derivation Path
- Kalman Filter (Full dynamic state estimation)
- State: \mathbf{x}_{k+1} = \mathbf{F}_k \mathbf{x}_k + \mathbf{x}_k
- Observation: \mathbf{y}_k = \mathbf{H}_k \mathbf{x}_k + \mathbf{v}_k
- Complexity: O(n^2) per iteration
- Static Kalman Filter (Static parameter estimation)
- State: \mathbf{x}_{k+1} = \mathbf{x}_k (no process noise)
- Observation: y_k = \mathbf{h}_k^T \mathbf{x}_k + v_k
- Same complexity, simpler structure
- RLS (Recursive Least Squares)
- Equivalent to static Kalman filter
- Optimal for static parameters
- Complexity: O(n^2) per iteration
- Update: \hat{\mathbf{x}}_k = \hat{\mathbf{x}}_{k-1} + \mathbf{K}_k (y_k - \mathbf{h}_k^T \hat{\mathbf{x}}_{k-1})
- Gain: \mathbf{K}_k = \frac{\mathbf{P}_{k-1} \mathbf{h}_k}{\lambda + \mathbf{h}_k^T \mathbf{P}_{k-1} \mathbf{h}_k}
- LMS (Least Mean Squares)
- Approximation: \mathbf{P}_{k-1} \approx \mu \mathbf{I}
- Simpler, but suboptimal
- Complexity: O(n) per iteration
- Update: \hat{\mathbf{x}}_k = \hat{\mathbf{x}}_{k-1} + \mu \mathbf{h}_k (y_k - \mathbf{h}_k^T \hat{\mathbf{x}}_{k-1})
- NLMS (Normalized LMS)
- Better approximation: keeps normalization term
- More stable and faster than LMS
- Complexity: O(n) per iteration
- Update: \hat{\mathbf{x}}_k = \hat{\mathbf{x}}_{k-1} + \frac{1}{1/\mu + \|\mathbf{h}_k\|^2} \mathbf{h}_k (y_k - \mathbf{h}_k^T \hat{\mathbf{x}}_{k-1})
Key Insights
- All adaptive filters share the same structure: old estimate + gain × innovation
- The gain determines the algorithm: RLS uses optimal gain, LMS/NLMS use approximations
- Trade-off: Optimality vs. computational complexity
- Recursive mean is the simplest case (scalar, unit regressor)
When to Use Which?
- RLS: When you need optimal performance and can afford O(n^2) complexity
- NLMS: Good balance of performance and complexity, robust to input scaling
- LMS: Simplest, but requires careful step size tuning and may be slower