Regularized Regression

Regularized regression extends standard parameter estimation by incorporating prior knowledge or structural assumptions through a penalty on the model parameters. The setting is a parametric estimator g_\theta : Y \mapsto \hat X with parameters \theta, trained from supervised data (X, Y).

1. Problem Setup

Given a loss function \ell(X, g_\theta(Y)), the goal is to find parameters \theta that minimize the expected prediction error:

\min_\theta \; \mathbb{E}\big[\ell(X, g_\theta(Y))\big].

2. Adding Regularization

To prevent overfitting and impose structure, add a penalty term:

\hat{\theta} = \arg\min_\theta \left\{ \mathbb{E}\big[\ell(X, g_\theta(Y))\big] + \lambda\,\Omega(\theta) \right\}.

Typical choices of \Omega(\theta):

  • Ridge: \Omega(\theta)=\|\theta\|_2^2
  • Lasso: \Omega(\theta)=\|\theta\|_1
  • Elastic Net: combination of \ell_1 and \ell_2
  • General priors: smoothness, group sparsity, low-rank, etc.

The parameter \lambda controls the trade-off between data fit and regularization strength.

3. Bayesian Interpretation

3.1 Likelihood Interpretation of the Loss

The parametric regression objective \ell(X, g_\theta(Y)) can be viewed as a negative log-likelihood for the parameter vector \theta. Define a likelihood model:

p(X,Y \mid \theta) \;\propto\; \exp\!\big(-\ell(X, g_\theta(Y))\big).

Then minimizing empirical loss is equivalent to maximum likelihood estimation (MLE) of \theta:

\hat{\theta}_{\text{MLE}} = \arg\min_\theta \ell(X, g_\theta(Y)) = \arg\max_\theta p(X,Y\mid\theta).

Thus, empirical risk minimization can be reinterpreted as a likelihood-based estimation of the parameters.

3.2 Gaussian-Noise Interpretation for Squared Loss

For the common squared loss \ell(X,g_\theta(Y)) = \|X - g_\theta(Y)\|^2, the likelihood becomes:

p(X\mid Y,\theta) \propto \exp\!\left(-\|X - g_\theta(Y)\|^2\right).

This is exactly the likelihood of the additive-noise model

X = g_\theta(Y) + \varepsilon, \qquad \varepsilon \sim \mathcal N(0, \sigma^2 I),

up to the constant factor 1/(2\sigma^2). Therefore, minimizing squared loss is MLE for \theta under the assumption of Gaussian prediction noise.

3.3 Regularization as a Prior on the Parameters

In a Bayesian view, the penalty \Omega(\theta) corresponds to a prior

p(\theta) \;\propto\; \exp\!\big(-\lambda\,\Omega(\theta)\big).

Combining likelihood and prior gives the posterior

p(\theta\mid X,Y) \propto p(X,Y\mid\theta)\,p(\theta),

and the estimator becomes MAP:

\hat{\theta}_{\text{MAP}} = \arg\min_\theta \Big[ \ell(X,g_\theta(Y)) + \lambda\,\Omega(\theta) \Big].

Thus:

  • loss term → negative log-likelihood
  • regularizer → negative log-prior
  • regularized regression → MAP estimation of \theta

4. Why Regularization Matters

  • Controls overfitting by penalizing overly flexible parameterizations
  • Incorporates prior knowledge, such as sparsity or smoothness
  • Improves conditioning and numerical stability (e.g., ridge regression)
  • Enhances generalization when sample size is limited
  • Acts as Bayesian complexity control via priors on \theta

5. Connection to Standard Regression

Method Regularizer Bayesian prior Interpretation
Ordinary LS none flat prior MLE of \theta
Ridge \|\theta\|_2^2 Gaussian prior MAP
Lasso \|\theta\|_1 Laplace prior MAP
Elastic Net mix mixed prior MAP
General penalties \Omega(\theta) \propto e^{-\lambda \Omega(\theta)} MAP

Regularized regression therefore unifies classical optimization with Bayesian modeling of estimator parameters.