Regularized Regression

Regularized regression extends standard parameter estimation by incorporating prior knowledge or structural assumptions through a penalty on the model parameters. The setting is a parametric estimator g_\theta : Y \mapsto \hat X with parameters \theta, trained from supervised data (X, Y).

1. Problem Setup

Given a loss function \ell(X, g_\theta(Y)), the goal is to find parameters \theta that minimize the expected prediction error:

\min_\theta \; \mathbb{E}\big[\ell(X, g_\theta(Y))\big].

2. Adding Regularization

To prevent overfitting and impose structure, add a penalty term:

\hat{\theta} = \arg\min_\theta \left\{ \mathbb{E}\big[\ell(X, g_\theta(Y))\big] + \lambda\,\Omega(\theta) \right\}.

Typical choices of \Omega(\theta):

Ridge: \Omega(\theta)=\|\theta\|_2^2
Lasso: \Omega(\theta)=\|\theta\|_1
Elastic Net: combination of \ell_1 and \ell_2
General priors: smoothness, group sparsity, low-rank, etc.

The parameter \lambda controls the trade-off between data fit and regularization strength.

3. Bayesian Interpretation

3.1 Likelihood Interpretation of the Loss

The parametric regression objective \ell(X, g_\theta(Y)) can be viewed as a negative log-likelihood for the parameter vector \theta. Define a likelihood model:

p(X,Y \mid \theta) \;\propto\; \exp\!\big(-\ell(X, g_\theta(Y))\big).

Then minimizing empirical loss is equivalent to maximum likelihood estimation (MLE) of \theta:

\hat{\theta}_{\text{MLE}} = \arg\min_\theta \ell(X, g_\theta(Y)) = \arg\max_\theta p(X,Y\mid\theta).

Thus, empirical risk minimization can be reinterpreted as a likelihood-based estimation of the parameters.

3.2 Gaussian-Noise Interpretation for Squared Loss

For the common squared loss \ell(X,g_\theta(Y)) = \|X - g_\theta(Y)\|^2, the likelihood becomes:

p(X\mid Y,\theta) \propto \exp\!\left(-\|X - g_\theta(Y)\|^2\right).

This is exactly the likelihood of the additive-noise model

X = g_\theta(Y) + \varepsilon, \qquad \varepsilon \sim \mathcal N(0, \sigma^2 I),

up to the constant factor 1/(2\sigma^2). Therefore, minimizing squared loss is MLE for \theta under the assumption of Gaussian prediction noise.

3.3 Regularization as a Prior on the Parameters

In a Bayesian view, the penalty \Omega(\theta) corresponds to a prior

p(\theta) \;\propto\; \exp\!\big(-\lambda\,\Omega(\theta)\big).

Combining likelihood and prior gives the posterior

p(\theta\mid X,Y) \propto p(X,Y\mid\theta)\,p(\theta),

and the estimator becomes MAP:

\hat{\theta}_{\text{MAP}} = \arg\min_\theta \Big[ \ell(X,g_\theta(Y)) + \lambda\,\Omega(\theta) \Big].

Thus:

loss term → negative log-likelihood
regularizer → negative log-prior
regularized regression → MAP estimation of \theta

4. Why Regularization Matters

Controls overfitting by penalizing overly flexible parameterizations
Incorporates prior knowledge, such as sparsity or smoothness
Improves conditioning and numerical stability (e.g., ridge regression)
Enhances generalization when sample size is limited
Acts as Bayesian complexity control via priors on \theta

5. Connection to Standard Regression

Method	Regularizer	Bayesian prior	Interpretation
Ordinary LS	none	flat prior	MLE of \theta
Ridge	\\|\theta\\|_2^2	Gaussian prior	MAP
Lasso	\\|\theta\\|_1	Laplace prior	MAP
Elastic Net	mix	mixed prior	MAP
General penalties	\Omega(\theta)	\propto e^{-\lambda \Omega(\theta)}	MAP

Regularized regression therefore unifies classical optimization with Bayesian modeling of estimator parameters.