Regularized Regression
Regularized regression extends standard parameter estimation by incorporating prior knowledge or structural assumptions through a penalty on the model parameters. The setting is a parametric estimator g_\theta : Y \mapsto \hat X with parameters \theta, trained from supervised data (X, Y).
1. Problem Setup
Given a loss function \ell(X, g_\theta(Y)), the goal is to find parameters \theta that minimize the expected prediction error:
\min_\theta \; \mathbb{E}\big[\ell(X, g_\theta(Y))\big].
2. Adding Regularization
To prevent overfitting and impose structure, add a penalty term:
\hat{\theta} = \arg\min_\theta \left\{ \mathbb{E}\big[\ell(X, g_\theta(Y))\big] + \lambda\,\Omega(\theta) \right\}.
Typical choices of \Omega(\theta):
- Ridge: \Omega(\theta)=\|\theta\|_2^2
- Lasso: \Omega(\theta)=\|\theta\|_1
- Elastic Net: combination of \ell_1 and \ell_2
- General priors: smoothness, group sparsity, low-rank, etc.
The parameter \lambda controls the trade-off between data fit and regularization strength.
3. Bayesian Interpretation
3.1 Likelihood Interpretation of the Loss
The parametric regression objective \ell(X, g_\theta(Y)) can be viewed as a negative log-likelihood for the parameter vector \theta. Define a likelihood model:
p(X,Y \mid \theta) \;\propto\; \exp\!\big(-\ell(X, g_\theta(Y))\big).
Then minimizing empirical loss is equivalent to maximum likelihood estimation (MLE) of \theta:
\hat{\theta}_{\text{MLE}} = \arg\min_\theta \ell(X, g_\theta(Y)) = \arg\max_\theta p(X,Y\mid\theta).
Thus, empirical risk minimization can be reinterpreted as a likelihood-based estimation of the parameters.
3.2 Gaussian-Noise Interpretation for Squared Loss
For the common squared loss \ell(X,g_\theta(Y)) = \|X - g_\theta(Y)\|^2, the likelihood becomes:
p(X\mid Y,\theta) \propto \exp\!\left(-\|X - g_\theta(Y)\|^2\right).
This is exactly the likelihood of the additive-noise model
X = g_\theta(Y) + \varepsilon, \qquad \varepsilon \sim \mathcal N(0, \sigma^2 I),
up to the constant factor 1/(2\sigma^2). Therefore, minimizing squared loss is MLE for \theta under the assumption of Gaussian prediction noise.
3.3 Regularization as a Prior on the Parameters
In a Bayesian view, the penalty \Omega(\theta) corresponds to a prior
p(\theta) \;\propto\; \exp\!\big(-\lambda\,\Omega(\theta)\big).
Combining likelihood and prior gives the posterior
p(\theta\mid X,Y) \propto p(X,Y\mid\theta)\,p(\theta),
and the estimator becomes MAP:
\hat{\theta}_{\text{MAP}} = \arg\min_\theta \Big[ \ell(X,g_\theta(Y)) + \lambda\,\Omega(\theta) \Big].
Thus:
- loss term → negative log-likelihood
- regularizer → negative log-prior
- regularized regression → MAP estimation of \theta
4. Why Regularization Matters
- Controls overfitting by penalizing overly flexible parameterizations
- Incorporates prior knowledge, such as sparsity or smoothness
- Improves conditioning and numerical stability (e.g., ridge regression)
- Enhances generalization when sample size is limited
- Acts as Bayesian complexity control via priors on \theta
5. Connection to Standard Regression
| Method | Regularizer | Bayesian prior | Interpretation |
|---|---|---|---|
| Ordinary LS | none | flat prior | MLE of \theta |
| Ridge | \|\theta\|_2^2 | Gaussian prior | MAP |
| Lasso | \|\theta\|_1 | Laplace prior | MAP |
| Elastic Net | mix | mixed prior | MAP |
| General penalties | \Omega(\theta) | \propto e^{-\lambda \Omega(\theta)} | MAP |
Regularized regression therefore unifies classical optimization with Bayesian modeling of estimator parameters.