Ridge (L2) and Lasso (L1) Regression
Goals
- Introduce ridge (L2) and Lasso (l1) regression
- As a complexity penalty
- As a tuneable hierarchy of models to be selected by cross-validation
- As a Bayesian posterior estimate
Reading
These notes are a supplement to Section 6.2 of Gareth et al. (2021).
Approximately colinear regressors
Sometime it makes sense to penalize very large values of
Consider the following contrived example. Let
If we regress
Therefore, if we try to regress
Note that
Does it make sense to permit such a large variance? Wouldn’t it be better to choose slightly smaller
Penalizing large regressors with ridge regression
Recall that one perspective on regression is that we choose
We motivated this as an approximation to the expected predicted loss,
Instead, let us consider minimizing
This is known as ridge regression, L2–penalized regression. The latter is because the penalty
The term
- As
, then - When
, then , the OLS estimator.
So the inclusion of
The ridge regression regularizer has the considerable advantage that the optimum is available in closed form, since
Note that
Prove that
Standardizing regressors
Suppose we re-scale one of the regressors
That is, we will effectively “punish” large values of
The point is that re-scaling the regressors affects the meaning of
Similarly, it often doesn’t make sense to penalize the constant, so we might take
But this gets tedious, and assumes we have included a constant.
Instead, we might invoke the FWL theorem, center the response and regressors at their mean values, and then do penalized regression.
For both these reasons, before performing ridge regression (or any other similar penalized regression), we typically standardize the regressors, defining
We then run ridge regression on
- Don’t penalize the constant term and
- Penalize every coefficient the same regardless of its regressor’s typical scale.
A complexity penalty
Suppose that
That is, as
which is smaller that
Constrained optimization and the minimum norm interpolator
We can rewrite the ridge loss in a suggestive way. Fix
Since
So the optimum is actually the same. The loss
Taking
This intuition is also useful when understanding what happens as
As before, for fixed
We can thus equivalently interpret the ridge estimator as the one that produces the smallest
That is, when there are many
A Bayesian posterior
Another way to interpret the ridge penalty is as a Bayesian posterior mean. If
then
One way to derive this is to recognize that, if
We can write out the distribution of
From this, we can read off
If we take
This gives the ridge procedure some interpretability. First of all, the use of the ridge penalty corresponds to a prior belief that
Second, the ridge penalty you use reflects the relative scale of the noise variance and prior variance in a way that makes sense:
- If
, then the data is noisy (relative to our prior beliefs). We should not take fitting the data too seriously, and so should estimate a smaller than . And indeed, in this case is large, a large shrinks the estimated coefficients. - If
, then we find it plausible that is very large (relative to the variability in our data). We should not take our prior beliefs too seriously, and estimate a coefficient that matches . And indeed, in this case, is small, and we do not shrink the coefficients much.
Sparse regression with the L1 norm (lasso)
One problem with the L2 solution might be that the solution
For example, consider our highly correlated regressor example from the previous lecture. The ridge regression will still include both regressors, and their coefficient estimates will still be highly negatively correlated, but both will be shrunk towards zero. Maybe it would make more sense to select only one variable to include. Let us try to think of how we can change the penalty term to achieve this.
A “sparse” solution is an estimator
In a word — ridge regression estimates are not sparse. Let’s try to derive one that is by changing the penalty.
A very intuitive way to produce a sparse estimate is as follows:
This finds a tradeoff between the best fit to the data, but with a penalty for using more regressors. This makes sense, but is very difficult to compute. In particular, this objective is very non-convex. Bayesian statisticians do attempt to estimate models with a similar kind of penalty (they are called “spike and slab” models), but they are extremeley computationally intensive and beyond the scope of this course.
A convex approximation to the preceding loss is the L1 or Lasso loss, leading to Lasso or L1 regression:
This loss is convex (beacuse it is the sum of two convex functions), and so is much easier to minimize. Furthermore, as
Just as with the ridge regression, you should standardize variables before appyling the Lasso.
The Lasso produces sparse solutions
One way to see that the Lasso produces sparse solutions is to start with a very large
Start at
As long as
Eventually, we decrease
The Lasso as a constrained optimization problem
See figure 6.7 from Gareth et al. (2021) for an interpretation of the Lasso as a constrained optimization problem. The shape of the L1 ball provides an intuitive way to understand the sparsity of the solution compared to ridge.
The Bayesian Lasso is not sparse
In contrast to the ridge case, it is not hard to show that if you take a prior
you do not recover a sparse solution for the posterior mean! The difference is that, for the ridge prior, the posterior remained normal, so that the “maximum a–posteriori” (MAP) estimator was equal to the mean. In the Lasso case, the MAP is sparse, but the mean is not, and the two do not coincide because the posterior is not normal.
A more Bayesian way to produce sparse posteriors is by setting a non–zero probability that