Ridge (L2) and Lasso (L1) Regression

Author

Ryan Giordano

Goals

  • Introduce ridge (L2) and Lasso (l1) regression
    • As a complexity penalty
    • As a tuneable hierarchy of models to be selected by cross-validation
    • As a Bayesian posterior estimate

Reading

These notes are a supplement to Section 6.2 of Gareth et al. ().

Approximately colinear regressors

Sometime it makes sense to penalize very large values of β^.

Consider the following contrived example. Let

yn=zn+εnwhereznN(0,1)andεN(0,1).

If we regress yβz, then β^N(1,1/N) approximately, with no problems. But suppose we actually also observe xn=zn+ϵn with ϵnN(0,δ) for very small δ1. Then xn and zn are highly correlated:

MX=E[(xnzn)(xnzn)]=(1+δ111)MX1=1δ(1111+δ).

Therefore, if we try to regress yβxx+βzz, we get

Cov((β^xβ^z)|X,Z)=1NMX1=1N1δ(1111+δ).

Note that Var(βx)=Nδ, which is very large. With high probability, we will estimate β^ that has a very large magnitude, β^22, although its two components should nearly the negative of one another. This could be problematic in practice. For example, in our test set or application, if xn and zn are not as well–correlated as in our training set, this will lead to crazy and highly variable predicted values.

Does it make sense to permit such a large variance? Wouldn’t it be better to choose slightly smaller β^, which are in turn somewhat more “stable”?

Penalizing large regressors with ridge regression

Recall that one perspective on regression is that we choose β^ to minimize the loss

β^:=argminβn=1N(ynβxn)2=:RSS(β).

We motivated this as an approximation to the expected predicted loss, L(β)=E[ynew,xnew](ynewβxnew)2. But that made sense when we had a fixed set of regressors, and have shown that the correspondence breaks down when we are searching the space of regressors. In particular, RSS(β) always decreases as we add more regressors, but L(β) may not.

Instead, let us consider minimizing RSS(β), but with an additional penalty for large β^. There are many ways to do this! But one convenient one is as follows. For now, pick a λ, and choose β^ to minimize:

β^(λ):=argminβLridge(β,λ):=RSS(β)+λβ22=RSS(β)+λp=1Pβp2.

This is known as ridge regression, L2–penalized regression. The latter is because the penalty β22 is the L2 norm of the regressor; next time we will study the L1 version, which is also known as the Lasso.

The term λβ22 is known as a “regularizer,” since it imposes some “regularity” to the estimate β^(λ). Note that

  • As λ, then β^(λ)0
  • When λ=0, then β^(λ)=β^, the OLS estimator.

So the inclusion of λ is to “shrink” the estimate β^(λ) towards zero. Note that since the ridge loss has an extra penalty for the norm, it is impossible for the OLS solution to have a smaller norm than the ridge solution.

The ridge regression regularizer has the considerable advantage that the optimum is available in closed form, since

Lridge(β,λ)=(YXβ)(YXβ)+λββ=YY2YXβ+βXXβ+λββ=YY2YXβ+β(XX+λI)βLridge(β,λ)β=2XY+2(XX+λI)ββ^(λ)=(XX+λI)1XY.

Note that XX+λI is always invertible if λ>0, even if X is not full–column rank. In this sense, the ridge regression can deal safely with colinearity.

Exercise

Prove that XX+λI is invertible if λ>0. Hint: using the fact that XX positive semi–definite because it’s symmetric, find a lower bound on the smallest eigenvalue of XX+λI.

Standardizing regressors

Suppose we re-scale one of the regressors xnp by some value α for a very small α1, i.e., regressing on xnp=αxnp instead of xnp. As we know, the OLS minimizer β^p=β^p/α and the fitted value Y^ is unchanged at λ=0. But for a particular λ>0, how does this affect the ridge solution? We can write

λβ^p2=λα2β^p2.

That is, we will effectively “punish” large values of β^p much more than we would “punish” the corresponding values of β^. In turn, for a particular λ, we will tend to set β^p<β^p (although this is not necessarily the case).

The point is that re-scaling the regressors affects the meaning of λ. Correspondingly, if different regressors have very different typical scales, such as age versus income, then ridge regression will drive their coefficients to zero to very different degrees.

Similarly, it often doesn’t make sense to penalize the constant, so we might take β1 to be the constant (xn1=1), and write

β^(λ):=RSS(β)+λβ22=RSS(β)+λp=2Pβp2.

But this gets tedious, and assumes we have included a constant.
Instead, we might invoke the FWL theorem, center the response and regressors at their mean values, and then do penalized regression.

For both these reasons, before performing ridge regression (or any other similar penalized regression), we typically standardize the regressors, defining

xn:=xnx¯n1Nn=1N(xnx¯)2.

We then run ridge regression on x rather than x, so that we

  • Don’t penalize the constant term and
  • Penalize every coefficient the same regardless of its regressor’s typical scale.

A complexity penalty

Suppose that yn=βxn+εn for some β. Note that, for a fixed xnew, ynew, and fixed X,

E[ynewxnewβ^(λ)]=xnew(βE[β^(λ)])=xnew(I(XX+λI)1(XX))β

That is, as λ grows, β^(λ) becomes biased. However, by the same reasoning as in the standard case, under homoskedasticity,

Cov(β^(λ))=σ2(XX+λI)1(XX)(XX+λI)1,

which is smaller that Cov(β^) in the sense that Cov(β^)Cov(β^(λ)) is a positive definite matrix for λ>0 and full-rank X. In this sense, the family β^(λ) is a one-dimensional family that trades off bias and variance. We can thus use cross validation to choose λ that minimizes estimated MSE.

Constrained optimization and the minimum norm interpolator

We can rewrite the ridge loss in a suggestive way. Fix λ=λ^, write b=β^(λ^)22, and write

Lridge(β,λ)=RSS(β)+λ(β22b).

Since b is fixed, β^(λ^) still satisfies

Lridge(β,λ)β|β^,λ^=RSS(β)β|β^+λ^β22β|β^=0.

So the optimum is actually the same. The loss Lridge(β,λ) is the Lagrange multiplier version of the constrained optimization problem

β^(b):=argminβ:β22bRSS(β).

Taking b=β^(λ)22, we see that, for every λ, there is a b, and vice–versa. The ridge regression is thus equivalent to minimizing the squared error loss subject to the constraint that it lies within an L2 ball. Making the ball larger allows better fit to the data, but by using larger β. This intuition is particularly useful when contrasting ridge regression with the lasso.

This intuition is also useful when understanding what happens as λ0. Write r=ESS(β^(λ^)), and note that we could also have written

Lridge(β,λ)=1λ(RSS(β)r)+β22.

As before, for fixed λ (and so fixed r), Lridge(β,λ) has the same ridge optimum. However, we can write

β^(r):=argminβ:RSS(β)<rβ22.

We can thus equivalently interpret the ridge estimator as the one that produces the smallest β in the L2 norm, subject to the RSS being no larger than r. As λ0, infinite weight gets put on the RSS(β) term, and r0 if X is rank N or higher. So we see that

limr0β^(r)=argminβ:RSS(β)=0β22.

That is, when there are many β that have RSS(β)=0 — i.e., which “interpolate” the data — the limiting ridge estimator chooses the one with minimum norm. This is call the “ridgeless interpolator.” (See the “ridgeless” lecture in Tibshirani ().)

A Bayesian posterior

Another way to interpret the ridge penalty is as a Bayesian posterior mean. If

βN(0,σβ2I)andY|β,XN(Xβ,σ2I),

then

β|Y,XN((XX+σ2σβ2I)1XY,σ2(XX+σ2σβ2I)1).

One way to derive this is to recognize that, if βN(μ,Σ), then

logP(β|μ,Σ)=12βΣ1β+βΣ1μ+Terms that do not depend on β.

We can write out the distribution of P(β|Y)=P(β,Y)/P(Y), gather terms that depend on β, and read off the mean and covariance:

logP(β,Y)=12σ2(YXβ)(YXβ)12σβ2ββ+Terms that do not depend on β=12σ2βXXβ+σ2βXY12σβ2ββ+Terms that do not depend on β=12β(σ2XX+σβ2I)β+σ2βXY+Terms that do not depend on β.

From this, we can read off Σ and μ, and get the above expression.

If we take λ=σ2/σβ2, then we can see that

E[β|Y]=(XX+λI)1XY=β^(λ).

This gives the ridge procedure some interpretability. First of all, the use of the ridge penalty corresponds to a prior belief that β is not too large.

Second, the ridge penalty you use reflects the relative scale of the noise variance and prior variance in a way that makes sense:

  • If σσβ, then the data is noisy (relative to our prior beliefs). We should not take fitting the data too seriously, and so should estimate a smaller β than β^. And indeed, in this case λ is large, a large λ shrinks the estimated coefficients.
  • If σβσ, then we find it plausible that β is very large (relative to the variability in our data). We should not take our prior beliefs too seriously, and estimate a coefficient that matches β^. And indeed, in this case, λ is small, and we do not shrink the coefficients much.

Sparse regression with the L1 norm (lasso)

One problem with the L2 solution might be that the solution β^L2(λ) is still “dense”, meaning that, in general, every entry of it is nonzero, and we still have to invert a P×P matrix.

For example, consider our highly correlated regressor example from the previous lecture. The ridge regression will still include both regressors, and their coefficient estimates will still be highly negatively correlated, but both will be shrunk towards zero. Maybe it would make more sense to select only one variable to include. Let us try to think of how we can change the penalty term to achieve this.

A “sparse” solution is an estimator β^ in which many of the entries are zero — that is, an estimated regression line that does not use many of the available regressors.

In a word — ridge regression estimates are not sparse. Let’s try to derive one that is by changing the penalty.

A very intuitive way to produce a sparse estimate is as follows: β^L0(λ):=argminβ(RSS(β)+λp1(βp0))(practically difficult)

This finds a tradeoff between the best fit to the data, but with a penalty for using more regressors. This makes sense, but is very difficult to compute. In particular, this objective is very non-convex. Bayesian statisticians do attempt to estimate models with a similar kind of penalty (they are called “spike and slab” models), but they are extremeley computationally intensive and beyond the scope of this course.

A convex approximation to the preceding loss is the L1 or Lasso loss, leading to Lasso or L1 regression:

β^L1(λ):=argminβ(RSS(β)+λp|βb|)=argminβ(RSS(β)+λβ1).

This loss is convex (beacuse it is the sum of two convex functions), and so is much easier to minimize. Furthermore, as λ grows, it does produce sparser and sparser solutions — though it may not be obvious at first.

Just as with the ridge regression, you should standardize variables before appyling the Lasso.

The Lasso produces sparse solutions

One way to see that the Lasso produces sparse solutions is to start with a very large λ and see what happens as it is slowly decreased.

Start at λ very large, so that β^L1(λ)=0. If we take small step of size δ in a particular direction away from zero in entry βp, then λβ^1 increases by δλ, and the RSS changes by the gradient of the squared error,

δn=1N(ynβ^(λ)xn)xnp=δn=1Nε^nxnp=δn=1Nynxnp (because β^(λ)=0).

As long as |n=1Nynxnp|<λ for all p, we cannot improve the loss by moving away from 0. Since the loss is convex, that means 0 is the minimum.

Eventually, we decrease λ until n=1Nynxnp=λ for some p. At that point, βp moves away from zero as λ decreases, and the ε^n also change. However, until n=1Nε^nxnq=λ for some other q, only βp will be nonzero. As λ decreases more and more variables tend to get added to the model, until λ=0, when of course β^L1(0)=β^, the OLS solution. Along the path, variables may come in and out of the regression in complex ways.

The Lasso as a constrained optimization problem

See figure 6.7 from Gareth et al. () for an interpretation of the Lasso as a constrained optimization problem. The shape of the L1 ball provides an intuitive way to understand the sparsity of the solution compared to ridge.

The Bayesian Lasso is not sparse

In contrast to the ridge case, it is not hard to show that if you take a prior

P(β)exp(λβ1)

you do not recover a sparse solution for the posterior mean! The difference is that, for the ridge prior, the posterior remained normal, so that the “maximum a–posteriori” (MAP) estimator was equal to the mean. In the Lasso case, the MAP is sparse, but the mean is not, and the two do not coincide because the posterior is not normal.

A more Bayesian way to produce sparse posteriors is by setting a non–zero probability that βp=0. These are called “spike and slab priors,” and are beyond the scope of this course.

References

Gareth, J., W. Daniela, H. Trevor, and T. Robert. 2021. An Introduction to Statistical Learning: With Applications in Python. Spinger.
Tibshirani, R. 2023. “Advanced Topics in Statistical Learning: Spring 2023.” https://www.stat.berkeley.edu/~ryantibs/statlearn-s23/.