Ridge (L2) and Lasso (L1) Regression

Author

Ryan Giordano

Goals

Introduce ridge (L2) and Lasso (l1) regression
- As a complexity penalty
- As a tuneable hierarchy of models to be selected by cross-validation
- As a Bayesian posterior estimate

Reading

These notes are a supplement to Section 6.2 of Gareth et al. (2021).

Approximately colinear regressors

Sometime it makes sense to penalize very large values of $\hat{β}$ .

Consider the following contrived example. Let

$\begin{aligned} y_{n} = & z_{n} + ε_{n} where z_{n} \sim N (0, 1) and ε \sim N (0, 1) . \end{aligned}$

If we regress $y \sim β z$ , then $\hat{β} \sim N (1, 1 / N)$ approximately, with no problems. But suppose we actually also observe $x_{n} = z_{n} + ϵ_{n}$ with $ϵ_{n} \sim N (0, δ)$ for very small $δ ≪ 1$ . Then $x_{n}$ and $z_{n}$ are highly correlated:

$M_{X} = E_{} [(\begin{matrix} x_{n} \\ z_{n} \end{matrix}) (\begin{matrix} x_{n} & z_{n} \end{matrix})] = (\begin{matrix} 1 + δ & 1 \\ 1 & 1 \end{matrix}) \Rightarrow M_{X}^{- 1} = \frac{1}{δ} (\begin{matrix} 1 & - 1 \\ - 1 & 1 + δ \end{matrix}) .$

Therefore, if we try to regress $y \sim β_{x} x + β_{z} z$ , we get

${Cov}_{} ((\begin{matrix} {\hat{β}}_{x} \\ {\hat{β}}_{z} \end{matrix}) | X, Z) = \frac{1}{N} M_{X}^{- 1} = \frac{1}{N} \frac{1}{δ} (\begin{matrix} 1 & - 1 \\ - 1 & 1 + δ \end{matrix}) .$

Note that ${Var}_{} (β_{x}) = \frac{N}{δ}$ , which is very large. With high probability, we will estimate $\hat{β}$ that has a very large magnitude, ${‖ \hat{β} ‖}_{2}^{2}$ , although its two components should nearly the negative of one another. This could be problematic in practice. For example, in our test set or application, if $x_{n}$ and $z_{n}$ are not as well–correlated as in our training set, this will lead to crazy and highly variable predicted values.

Does it make sense to permit such a large variance? Wouldn’t it be better to choose slightly smaller $\hat{β}$ , which are in turn somewhat more “stable”?

Penalizing large regressors with ridge regression

Recall that one perspective on regression is that we choose $\hat{β}$ to minimize the loss

$\hat{β} := \underset{β}{argmin} \sum_{n = 1}^{N} (y_{n} - β^{⊺} x_{n})^{2} =: R S S (β) .$

We motivated this as an approximation to the expected predicted loss, $L (β) = E_{} [y_{new}, x_{new}] (y_{new} - β^{⊺} x_{new})^{2}$ . But that made sense when we had a fixed set of regressors, and have shown that the correspondence breaks down when we are searching the space of regressors. In particular, $R S S (β)$ always decreases as we add more regressors, but $L (β)$ may not.

Instead, let us consider minimizing $R S S (β)$ , but with an additional penalty for large $\hat{β}$ . There are many ways to do this! But one convenient one is as follows. For now, pick a $λ$ , and choose $\hat{β}$ to minimize:

$\hat{β} (λ) := \underset{β}{argmin} L_{r i d g e} (β, λ) := R S S (β) + λ {‖ β ‖}_{2}^{2} = R S S (β) + λ \sum_{p = 1}^{P} β_{p}^{2} .$

This is known as ridge regression, L2–penalized regression. The latter is because the penalty ${‖ β ‖}_{2}^{2}$ is the L2 norm of the regressor; next time we will study the L1 version, which is also known as the Lasso.

The term $λ {‖ β ‖}_{2}^{2}$ is known as a “regularizer,” since it imposes some “regularity” to the estimate $\hat{β} (λ)$ . Note that

As $λ \to \infty$ , then $\hat{β} (λ) \to 0$
When $λ = 0$ , then $\hat{β} (λ) = \hat{β}$ , the OLS estimator.

So the inclusion of $λ$ is to “shrink” the estimate $\hat{β} (λ)$ towards zero. Note that since the ridge loss has an extra penalty for the norm, it is impossible for the OLS solution to have a smaller norm than the ridge solution.

The ridge regression regularizer has the considerable advantage that the optimum is available in closed form, since

$\begin{aligned} L_{r i d g e} (β, λ) = & (Y - X β)^{⊺} (Y - X β) + λ β^{⊺} β \\ = & Y^{⊺} Y - 2 Y^{⊺} X β + β^{⊺} X^{⊺} X β + λ β^{⊺} β \\ = & Y^{⊺} Y - 2 Y^{⊺} X β + β^{⊺} (X^{⊺} X + λ I_{}) β \Rightarrow \\ \frac{\partial L_{r i d g e} (β, λ)}{\partial β} = & - 2 X^{⊺} Y + 2 (X^{⊺} X + λ I_{}) β \Rightarrow \\ \hat{β} (λ) = & {(X^{⊺} X + λ I_{})}^{- 1} X^{⊺} Y . \end{aligned}$

Note that $X^{⊺} X + λ I_{}$ is always invertible if $λ > 0$ , even if $X$ is not full–column rank. In this sense, the ridge regression can deal safely with colinearity.

Exercise

Prove that $X^{⊺} X + λ I_{}$ is invertible if $λ > 0$ . Hint: using the fact that $X^{⊺} X$ positive semi–definite because it’s symmetric, find a lower bound on the smallest eigenvalue of $X^{⊺} X + λ I_{}$ .

Standardizing regressors

Suppose we re-scale one of the regressors $x_{n p}$ by some value $α$ for a very small $α ≪ 1$ , i.e., regressing on $x_{n p}^{'} = α x_{n p}$ instead of $x_{n p}$ . As we know, the OLS minimizer ${\hat{β}}_{p}^{'} = {\hat{β}}_{p} / α$ and the fitted value $\hat{Y}$ is unchanged at $λ = 0$ . But for a particular $λ > 0$ , how does this affect the ridge solution? We can write

$λ {\hat{β}}_{p}^{' 2} = \frac{λ}{α^{2}} {\hat{β}}_{p}^{2} .$

That is, we will effectively “punish” large values of ${\hat{β}}_{p}^{'}$ much more than we would “punish” the corresponding values of $\hat{β}$ . In turn, for a particular $λ$ , we will tend to set ${\hat{β}}_{p}^{'} < {\hat{β}}_{p}$ (although this is not necessarily the case).

The point is that re-scaling the regressors affects the meaning of $λ$ . Correspondingly, if different regressors have very different typical scales, such as age versus income, then ridge regression will drive their coefficients to zero to very different degrees.

Similarly, it often doesn’t make sense to penalize the constant, so we might take $β_{1}$ to be the constant ( $x_{n 1} = 1$ ), and write

$\hat{β} (λ) := R S S (β) + λ {‖ β ‖}_{2}^{2} = R S S (β) + λ \sum_{p = 2}^{P} β_{p}^{2} .$

But this gets tedious, and assumes we have included a constant.
Instead, we might invoke the FWL theorem, center the response and regressors at their mean values, and then do penalized regression.

For both these reasons, before performing ridge regression (or any other similar penalized regression), we typically standardize the regressors, defining

$x_{n}^{'} := \frac{x_{n} - {\bar{x}}_{n}}{\sqrt{\frac{1}{N} \sum_{n = 1}^{N} (x_{n} - \bar{x})^{2}}} .$

We then run ridge regression on $x^{'}$ rather than $x$ , so that we

Don’t penalize the constant term and
Penalize every coefficient the same regardless of its regressor’s typical scale.

A complexity penalty

Suppose that $y_{n} = β^{⊺} x_{n} + ε_{n}$ for some $β$ . Note that, for a fixed $x_{new}$ , $y_{new}$ , and fixed $X$ ,

$\begin{aligned} E_{} [y_{new} - x_{new}^{⊺} \hat{β} (λ)] = & x_{new}^{⊺} (β - E_{} [\hat{β} (λ)]) \\ = & x_{new}^{⊺} (I_{} - {(X^{⊺} X + λ I_{})}^{- 1} (X^{⊺} X)) β \end{aligned}$

That is, as $λ$ grows, $\hat{β} (λ)$ becomes biased. However, by the same reasoning as in the standard case, under homoskedasticity,

${Cov}_{} (\hat{β} (λ)) = σ^{2} {(X^{⊺} X + λ I_{})}^{- 1} (X^{⊺} X) {(X^{⊺} X + λ I_{})}^{- 1},$

which is smaller that ${Cov}_{} (\hat{β})$ in the sense that ${Cov}_{} (\hat{β}) - {Cov}_{} (\hat{β} (λ))$ is a positive definite matrix for $λ > 0$ and full-rank $X$ . In this sense, the family $\hat{β} (λ)$ is a one-dimensional family that trades off bias and variance. We can thus use cross validation to choose $λ$ that minimizes estimated MSE.

Constrained optimization and the minimum norm interpolator

We can rewrite the ridge loss in a suggestive way. Fix $λ = \hat{λ}$ , write $b = {‖ \hat{β} (\hat{λ}) ‖}_{2}^{2}$ , and write

$L_{r i d g e}^{'} (β, λ) = R S S (β) + λ ({‖ β ‖}_{2}^{2} - b) .$

Since $b$ is fixed, $\hat{β} (\hat{λ})$ still satisfies

${\frac{\partial L_{r i d g e} (β, λ)}{\partial β} |}_{\hat{β}, \hat{λ}} = {\frac{\partial R S S (β)}{\partial β} |}_{\hat{β}} + \hat{λ} {\frac{\partial {‖ β ‖}_{2}^{2}}{\partial β} |}_{\hat{β}} = 0 .$

So the optimum is actually the same. The loss $L_{r i d g e}^{'} (β, λ)$ is the Lagrange multiplier version of the constrained optimization problem

$\hat{β} (b) := \underset{β : {‖ β ‖}_{2}^{2} \leq b}{argmin} R S S (β) .$

Taking $b = {‖ \hat{β} (λ) ‖}_{2}^{2}$ , we see that, for every $λ$ , there is a $b$ , and vice–versa. The ridge regression is thus equivalent to minimizing the squared error loss subject to the constraint that it lies within an L2 ball. Making the ball larger allows better fit to the data, but by using larger $β$ . This intuition is particularly useful when contrasting ridge regression with the lasso.

This intuition is also useful when understanding what happens as $λ \to 0$ . Write $r = E S S (\hat{β} (\hat{λ}))$ , and note that we could also have written

$L_{r i d g e}^{″} (β, λ) = \frac{1}{λ} (R S S (β) - r) + {‖ β ‖}_{2}^{2} .$

As before, for fixed $λ$ (and so fixed $r$ ), $L_{r i d g e}^{″} (β, λ)$ has the same ridge optimum. However, we can write

$\hat{β} (r) := \underset{β : R S S (β) < r}{argmin} {‖ β ‖}_{2}^{2} .$

We can thus equivalently interpret the ridge estimator as the one that produces the smallest $β$ in the L2 norm, subject to the $R S S$ being no larger than $r$ . As $λ \to 0$ , infinite weight gets put on the $R S S (β)$ term, and $r \to 0$ if $X$ is rank $N$ or higher. So we see that

$lim_{r \to 0} \hat{β} (r) = \underset{β : R S S (β) = 0}{argmin} {‖ β ‖}_{2}^{2} .$

That is, when there are many $β$ that have $R S S (β) = 0$ — i.e., which “interpolate” the data — the limiting ridge estimator chooses the one with minimum norm. This is call the “ridgeless interpolator.” (See the “ridgeless” lecture in Tibshirani (2023).)

A Bayesian posterior

Another way to interpret the ridge penalty is as a Bayesian posterior mean. If

$β \sim N (0, σ_{β}^{2} I_{}) and Y | β, X \sim N (X β, σ^{2} I_{}),$

then

$β | Y, X \sim N ({(X^{⊺} X + \frac{σ^{2}}{σ_{β}^{2}} I_{})}^{- 1} X^{⊺} Y, σ^{2} {(X^{⊺} X + \frac{σ^{2}}{σ_{β}^{2}} I_{})}^{- 1}) .$

One way to derive this is to recognize that, if $β \sim N (μ, Σ)$ , then

$\log P_{} (β | μ, Σ) = - \frac{1}{2} β^{⊺} Σ^{- 1} β + β^{⊺} Σ^{- 1} μ + Terms that do not depend on β .$

We can write out the distribution of $P_{} (β | Y) = P_{} (β, Y) / P_{} (Y)$ , gather terms that depend on $β$ , and read off the mean and covariance:

$\begin{aligned} \log P_{} (β, Y) = & - \frac{1}{2} σ^{- 2} (Y - X β)^{⊺} (Y - X β) - \frac{1}{2} σ_{β}^{- 2} β^{⊺} β + Terms that do not depend on β \\ = & - \frac{1}{2} σ^{- 2} β^{⊺} X^{⊺} X β + σ^{- 2} β^{⊺} X^{⊺} Y - \frac{1}{2} σ_{β}^{- 2} β^{⊺} β + Terms that do not depend on β \\ = & - \frac{1}{2} β^{⊺} (σ^{- 2} X^{⊺} X + σ_{β}^{- 2} I_{}) β + σ^{- 2} β^{⊺} X^{⊺} Y + Terms that do not depend on β . \end{aligned}$

From this, we can read off $Σ$ and $μ$ , and get the above expression.

If we take $λ = σ^{2} / σ_{β}^{2}$ , then we can see that

$E_{} [β | Y] = {(X^{⊺} X + λ I_{})}^{- 1} X^{⊺} Y = \hat{β} (λ) .$

This gives the ridge procedure some interpretability. First of all, the use of the ridge penalty corresponds to a prior belief that $β$ is not too large.

Second, the ridge penalty you use reflects the relative scale of the noise variance and prior variance in a way that makes sense:

If $σ ≫ σ_{β}$ , then the data is noisy (relative to our prior beliefs). We should not take fitting the data too seriously, and so should estimate a smaller $β$ than $\hat{β}$ . And indeed, in this case $λ$ is large, a large $λ$ shrinks the estimated coefficients.
If $σ_{β} ≫ σ$ , then we find it plausible that $β$ is very large (relative to the variability in our data). We should not take our prior beliefs too seriously, and estimate a coefficient that matches $\hat{β}$ . And indeed, in this case, $λ$ is small, and we do not shrink the coefficients much.

Sparse regression with the L1 norm (lasso)

One problem with the L2 solution might be that the solution ${\hat{β}}_{L 2} (λ)$ is still “dense”, meaning that, in general, every entry of it is nonzero, and we still have to invert a $P \times P$ matrix.

For example, consider our highly correlated regressor example from the previous lecture. The ridge regression will still include both regressors, and their coefficient estimates will still be highly negatively correlated, but both will be shrunk towards zero. Maybe it would make more sense to select only one variable to include. Let us try to think of how we can change the penalty term to achieve this.

A “sparse” solution is an estimator $\hat{β}$ in which many of the entries are zero — that is, an estimated regression line that does not use many of the available regressors.

In a word — ridge regression estimates are not sparse. Let’s try to derive one that is by changing the penalty.

A very intuitive way to produce a sparse estimate is as follows: ${\hat{β}}_{L 0} (λ) := \underset{β}{argmin} (R S S (β) + λ \sum_{p} 1 (β_{p} \neq 0)) (practically difficult)$

This finds a tradeoff between the best fit to the data, but with a penalty for using more regressors. This makes sense, but is very difficult to compute. In particular, this objective is very non-convex. Bayesian statisticians do attempt to estimate models with a similar kind of penalty (they are called “spike and slab” models), but they are extremeley computationally intensive and beyond the scope of this course.

A convex approximation to the preceding loss is the L1 or Lasso loss, leading to Lasso or L1 regression:

${\hat{β}}_{L 1} (λ) := \underset{β}{argmin} (R S S (β) + λ \sum_{p} | β_{b} |) = \underset{β}{argmin} (R S S (β) + λ {‖ β ‖}_{1}) .$

This loss is convex (beacuse it is the sum of two convex functions), and so is much easier to minimize. Furthermore, as $λ$ grows, it does produce sparser and sparser solutions — though it may not be obvious at first.

Just as with the ridge regression, you should standardize variables before appyling the Lasso.

The Lasso produces sparse solutions

One way to see that the Lasso produces sparse solutions is to start with a very large $λ$ and see what happens as it is slowly decreased.

Start at $λ$ very large, so that ${\hat{β}}_{L 1} (λ) = 0$ . If we take small step of size $δ$ in a particular direction away from zero in entry $β_{p}$ , then $λ {‖ \hat{β} ‖}_{1}$ increases by $δ λ$ , and the RSS changes by the gradient of the squared error,

$δ \sum_{n = 1}^{N} (y_{n} - \hat{β} (λ)^{⊺} x_{n}) x_{n p} = δ \sum_{n = 1}^{N} {\hat{ε}}_{n} x_{n p} = δ \sum_{n = 1}^{N} y_{n} x_{n p} (because \hat{β} (λ) = 0) .$

As long as $| \sum_{n = 1}^{N} y_{n} x_{n p} | < λ$ for all $p$ , we cannot improve the loss by moving away from $0$ . Since the loss is convex, that means $0$ is the minimum.

Eventually, we decrease $λ$ until $\sum_{n = 1}^{N} y_{n} x_{n p} = λ$ for some $p$ . At that point, $β_{p}$ moves away from zero as $λ$ decreases, and the ${\hat{ε}}_{n}$ also change. However, until $\sum_{n = 1}^{N} {\hat{ε}}_{n} x_{n q} = λ$ for some other $q$ , only $β_{p}$ will be nonzero. As $λ$ decreases more and more variables tend to get added to the model, until $λ = 0$ , when of course ${\hat{β}}_{L 1} (0) = \hat{β}$ , the OLS solution. Along the path, variables may come in and out of the regression in complex ways.

The Lasso as a constrained optimization problem

See figure 6.7 from Gareth et al. (2021) for an interpretation of the Lasso as a constrained optimization problem. The shape of the L1 ball provides an intuitive way to understand the sparsity of the solution compared to ridge.

The Bayesian Lasso is not sparse

In contrast to the ridge case, it is not hard to show that if you take a prior

$P_{} (β) \propto \exp (λ {‖ β ‖}_{1})$

you do not recover a sparse solution for the posterior mean! The difference is that, for the ridge prior, the posterior remained normal, so that the “maximum a–posteriori” (MAP) estimator was equal to the mean. In the Lasso case, the MAP is sparse, but the mean is not, and the two do not coincide because the posterior is not normal.

A more Bayesian way to produce sparse posteriors is by setting a non–zero probability that $β_{p} = 0$ . These are called “spike and slab priors,” and are beyond the scope of this course.

References

Gareth, J., W. Daniela, H. Trevor, and T. Robert. 2021. An Introduction to Statistical Learning: With Applications in Python. Spinger.

Tibshirani, R. 2023. “Advanced Topics in Statistical Learning: Spring 2023.” https://www.stat.berkeley.edu/~ryantibs/statlearn-s23/.