Bias and variance tradeoff in prediction
\(\,\)
Goals
- Study the dependence of regresison mean squared error on model complexity
- Bias-variance tradeoff and mean squared error
- Simple expressions for bias and variance under simple assumptions
- More general interpretations
Reading
These lectures notes can be supplemented with
Consequences of model complexity
We have discussed how we can generate many features for linear regression. In principle, we can generate an arbitrarily large number of features — bounded only by the number of distinct values the regressors can take. Which features should we include in the model?
We will be discussing this as a practical matter throughout the course. But here, we will show that in linear regression the choice of which features to include involves an explicit tradeoff between two quantities, “bias” and “variance” which move in different directions as more regressors are included.
In particular, we know that the more features we include in the model, the closer our approximate can theoretically be to the truth. (This is a general feature of modeling under misspecification.) So why not always include all the features? Today we provide a simple argument why not.
The bias–variance tradeoff
Recall that we want to use the training data to compute some \(\hat{f}(\boldsymbol{x})\) to minimize \(\mathbb{E}_{\mathrm{new}}\left[(\hat{f}(\boldsymbol{x}_\mathrm{new}) - y_\mathrm{new})^2\right]\).
How well does it do? Since the training data is also random, we want to evaluate it in terms of the expectation over the data \(\mathcal{D}\) as well:
\[ \begin{aligned} \mathbb{E}_{\mathrm{new},\mathcal{D}}\left[(\hat{f}(\boldsymbol{x}_\mathrm{new}) - y_\mathrm{new})^2\right]. \end{aligned} \]
Here, \(\boldsymbol{x}_\mathrm{new},y_\mathrm{new}\) are an independent sample from the same distribution that gave rise to the \(N\) datapoints \(\mathcal{D}\), on which \(\hat{f}(\cdot)\) depends.
Let’s decompose this target. Define: \[ \begin{aligned} f^{\star}(\cdot) = \underset{f}{\mathrm{argmin}}\, \mathbb{E}_{\,}\left[(f(\boldsymbol{x}) - y)^2\right] \quad\textrm{and}\quad \overline{f}(\cdot) = \mathbb{E}_{\,}\left[\hat{f}(\cdot)\right] \\ %\fbar(\cdot) = \lim_{N\rightarrow \infty} \fhat(\cdot). \end{aligned} \]
We can then decompose our target into three terms:
\[ \begin{aligned} &\mathbb{E}_{\,}\left[(\hat{f}(\boldsymbol{x}_\mathrm{new}) - y_\mathrm{new})^2\right] \\&\quad={} \mathbb{E}_{\,}\left[(\hat{f}(\boldsymbol{x}_\mathrm{new}) - \overline{f}(\boldsymbol{x}_\mathrm{new}) + \overline{f}(\boldsymbol{x}_\mathrm{new}) - f^{\star}(\boldsymbol{x}_\mathrm{new}) + f^{\star}(\boldsymbol{x}_\mathrm{new}) - y_\mathrm{new})^2\right] \\&\quad={} \underset{\textrm{"variance"}}{ \mathbb{E}_{\,}\left[(\hat{f}(\boldsymbol{x}_\mathrm{new}) - \overline{f}(\boldsymbol{x}_\mathrm{new}))^2\right] } + \underset{\textrm{"bias"}}{ \mathbb{E}_{\,}\left[(\overline{f}(\boldsymbol{x}_\mathrm{new}) - f^{\star}(\boldsymbol{x}_\mathrm{new}))^2\right] } + \underset{\textrm{"irreducible error"}}{ \mathbb{E}_{\,}\left[(f^{\star}(\boldsymbol{x}_\mathrm{new}) - y_\mathrm{new})^2\right] }. % + % \\&\quad\quad\quad % \expect{(\fbar(\xv_\new) - \fstar(\xv_\new)) % (\fhat(\xv_\new) - \fbar(\xv_\new))}, \end{aligned} \tag{1}\]
As an exercise, show that the cross terms are zero in this decomposition.
Such an explicit decomposition is only available for squared error loss, but it provides a useful heuristic for thinking about model complexity in general.
Increasing model complexity can only decrease bias by allowing \(\hat{f}\) to fit more expressive functions. There is only one place in this decomposition that expressivity can hurt our risk, which is the variance term. In some settings — including a version of simple linear regression here — we can show explicitly that increasing model complexity increases variance, though, as we will see, the story is not necessarily so simple, even for linear regression.
In any case, this decomposition shows that the potential to increase the training set variability must be the source of any complexity cost. It is for this reason that it can be useful to limit the complexity of \(\hat{f}\), at the cost of failing to capture the true \(f^{\star}\), but with the potential benefit of providing an improved actual loss.
Misspecified linear regression
Let’s assume that we have regressors \(\boldsymbol{x}_n \in \mathbb{R}^{P}\), possibly created by feature expansions, and we would like to decide which regressors to include in our regression. Note that the larger \(P\) is, the more expressive the regression can be, and the closer we can get to the true \(\mathbb{E}_{\,}\left[y| \boldsymbol{x}\right]\). However, we will see that choosing regressors typically comes at a cost.
As we have shown, \(f^{\star}(y) = \mathbb{E}_{\,}\left[y\vert \boldsymbol{x}\right]\) minimizes the population loss \(\mathbb{E}_{\,}\left[(y- f(\boldsymbol{x}))^2\right]\), and the regression coefficient \(\beta^{*}= M_{xx}^{-1} M_{xy}\) minimizes the linear regression loss \(\mathbb{E}_{\,}\left[(y- \boldsymbol{x}^\intercal\boldsymbol{\beta})^2\right]\), where
\[ M_{xx} := \mathbb{E}_{\,}\left[\boldsymbol{x}\boldsymbol{x}^\intercal\right] \quad\textrm{and}\quad M_{xy} := \mathbb{E}_{\,}\left[\boldsymbol{x}\boldsymbol{x}^\intercal\right]. \]
The expectation is not tractable in misspecified models
We have shown that, for a fixed set of regressors, \(\hat{\boldsymbol{\beta}}\rightarrow \boldsymbol{\beta}^{*}\). But what is the \(\mathbb{E}_{\,}\left[\hat{\boldsymbol{\beta}}\right]\)? We can write
\[ \begin{aligned} \mathbb{E}_{\,}\left[\hat{\boldsymbol{\beta}}\right] ={}& \mathbb{E}_{\,}\left[(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{Y}\right] \\={}& \mathbb{E}_{\,}\left[(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\mathbb{E}_{\,}\left[\boldsymbol{Y}\vert \boldsymbol{X}\right]\right]. \end{aligned} \]
Note that \(\boldsymbol{X}\) is random in the preceding expression, since our pairs \((x_n, y_n)\) are IID — otherwise, the relatioship between population and sample loss doesn’t make sense. In an introductory linear models class, you might assume that \(\mathbb{E}_{\,}\left[\boldsymbol{Y}\vert \boldsymbol{X}\right] = \boldsymbol{X}\beta^{*}+ \boldsymbol{\varepsilon}\) for \(\mathbb{E}_{\,}\left[\boldsymbol{\varepsilon}| \boldsymbol{X}\right] = \boldsymbol{0}\). If that were the case, then
\[ \begin{aligned} \mathbb{E}_{\,}\left[(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\mathbb{E}_{\,}\left[\boldsymbol{Y}\vert \boldsymbol{X}\right]\right] ={}& \mathbb{E}_{\,}\left[(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal(\boldsymbol{X}\beta^{*}+ \boldsymbol{\varepsilon})\right] \\={}& \beta^{*}+ \mathbb{E}_{\,}\left[(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\mathbb{E}_{\,}\left[\boldsymbol{\varepsilon}| \boldsymbol{X}\right]\right] \\={}& \beta^{*}. & \textrm{(Under correct specification)} \end{aligned} \]
However, under misspecification, there is no such simple result — in particular, there is no reason to think there is any \(\boldsymbol{\beta}\) such that \(\mathbb{E}_{\,}\left[y- \boldsymbol{\beta}^\intercal\boldsymbol{x}| \boldsymbol{x}\right] = 0\).
By imposing stronger–than–standard conditions on the distribution of \(\boldsymbol{x}\), one can show that \(\sqrt{N} (\mathbb{E}_{\,}\left[\hat{\boldsymbol{\beta}}\right] - \boldsymbol{\beta}^{*}) \rightarrow 0\). We will assume that this result holds without further elaboration, so that we can use \(\boldsymbol{\beta}^{*}\) in place of the intractable \(\mathbb{E}_{\,}\left[\hat{\boldsymbol{\beta}}\right]\) in the bias–variance decomposition.
Bias–variance decomposition for misspecified linear models
The population loss evaluated at \(\hat{\boldsymbol{\beta}}\) can be decomposed as above into
\[ \begin{aligned} \mathscr{L}(\hat{\boldsymbol{\beta}}) :={}& \mathbb{E}_{\mathrm{new},\mathcal{D}}\left[(y- \hat{\boldsymbol{\beta}}^\intercal\boldsymbol{x})^2\right] \\={}& \mathbb{E}_{\mathrm{new},\mathcal{D}}\left[(y- f^{\star}(\boldsymbol{x}) + f^{\star}(\boldsymbol{x}) - {\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x}+ {\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x}- \hat{\boldsymbol{\beta}}^\intercal\boldsymbol{x})^2\right] \\={}& \underset{\textrm{"irreducible error"}}{ \mathbb{E}_{\mathrm{new}}\left[(y- f^{\star}(\boldsymbol{x}))^2\right] } + \underset{\textrm{"bias"}}{ \mathbb{E}_{\mathrm{new}}\left[(f^{\star}(\boldsymbol{x}) - {\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x})^2\right] } + \underset{\textrm{"variance"}}{ \mathbb{E}_{\mathrm{new},\mathcal{D}}\left[({\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x}- \hat{\boldsymbol{\beta}}^\intercal\boldsymbol{x})^2\right] } + \\ & \quad 2 \mathbb{E}_{\mathrm{new}}\left[(y- f^{\star}(\boldsymbol{x}))(f^{\star}(\boldsymbol{x}) - {\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x})\right] + \\ & \quad 2 \mathbb{E}_{\mathrm{new}}\left[(y- f^{\star}(\boldsymbol{x})) ({\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x}- \mathbb{E}_{\mathcal{D}}\left[\hat{\boldsymbol{\beta}}\right]^\intercal\boldsymbol{x})\right] + \\ & \quad 2 \mathbb{E}_{\mathrm{new}}\left[({\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x}- \mathbb{E}_{\mathcal{D}}\left[\hat{\boldsymbol{\beta}}\right]^\intercal\boldsymbol{x}) (f^{\star}(\boldsymbol{x}) - {\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x})\right]. \end{aligned} \]
By definition,
\[ \mathbb{E}_{\,}\left[y- f^{\star}(\boldsymbol{x}) | \boldsymbol{x}\right] = \mathbb{E}_{\,}\left[y- \mathbb{E}_{\,}\left[y| \boldsymbol{x}\right] | \boldsymbol{x}\right] = 0, \]
so we have by iterated expectations
\[ \mathbb{E}_{\,}\left[(y- f^{\star}(\boldsymbol{x}))(f^{\star}(\boldsymbol{x}) - {\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x})\right] = 0 \quad\textrm{and}\quad \mathbb{E}_{\,}\left[(y- f^{\star}(\boldsymbol{x}))({\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x}- \hat{\boldsymbol{\beta}}^\intercal\boldsymbol{x})\right] = 0. \]
Now,
\[ \begin{aligned} \mathbb{E}_{\,}\left[({\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x}- \mathbb{E}_{\mathcal{D}}\left[\hat{\boldsymbol{\beta}}\right]^\intercal\boldsymbol{x}) (f^{\star}(\boldsymbol{x}) - {\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x})\right] ={}& ({\boldsymbol{\beta}^{*}} - \mathbb{E}_{\mathcal{D}}\left[\hat{\boldsymbol{\beta}}\right])^\intercal \mathbb{E}_{\,}\left[\boldsymbol{x}(f^{\star}(\boldsymbol{x}) - {\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x})\right] \\={}& ({\boldsymbol{\beta}^{*}} - \mathbb{E}_{\mathcal{D}}\left[\hat{\boldsymbol{\beta}}\right])^\intercal \left( M_{xy} - M_{xx} \boldsymbol{\beta}^{*}\right) \rightarrow 0, \end{aligned} \]
since \(\sqrt{N}(\hat{\boldsymbol{\beta}}- \mathbb{E}_{\mathcal{D}}\left[\hat{\boldsymbol{\beta}}\right]) \rightarrow 0\). So, as \(N\) grows, only the irreducible error, bias, and variance terms contribute.
Bias term and irreducible error
The first term, \(\mathbb{E}_{\mathrm{new}}\left[(y- f^{\star}(\boldsymbol{x}))^2\right]\), has nothing to do with our model, and accounts for any unmodeled variation that might exist in the data. For example, there is no guarantee that, if you observe the same \(\boldsymbol{x}\) repeatedly, that you will observe the same \(y\). Since you are only using \(\boldsymbol{x}\) to predict, there is no way to predict this variation in \(y\) without getting some new data.
The second term is the “bias” term. Note that the bias term is actually the expected square of the bias at a particular \(\boldsymbol{x}\):
\[ \textrm{Bias term: }\quad \mathbb{E}_{\mathrm{new}}\left[(f^{\star}(\boldsymbol{x}) - {\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x})^2\right] \]
This contrasts somewhat with the common defintion of statistical bias
\[ \textrm{Statistical Bias at $\boldsymbol{x}$: }\quad f^{\star}(\boldsymbol{x}) - \mathbb{E}_{\mathcal{D}}\left[\hat{f}(\boldsymbol{x})\right] = f^{\star}(\boldsymbol{x}) - \mathbb{E}_{\mathcal{D}}\left[\hat{\boldsymbol{\beta}}^\intercal\right]\boldsymbol{x}= f^{\star}(\boldsymbol{x}) - {\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x}+ \textrm{ A vanishing term}. \]
The average statistical bias can cancel if it is high for some \(\boldsymbol{x}\) and low for others, but since we square the bias before averaging, there can be no such cancelation in the bias term.
We can write the bias term explicitly using the fact that \(\mathbb{E}_{\mathrm{new}}\left[f^{\star}(\boldsymbol{x})\boldsymbol{x}^\intercal\right] = M_{xy}\) (verify as an exercise):
\[ \begin{aligned} \mathbb{E}_{\mathrm{new}}\left[(f^{\star}(\boldsymbol{x}) - {\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x})^2\right] ={}& \mathbb{E}_{\mathrm{new}}\left[f^{\star}(\boldsymbol{x})^2\right] + {\boldsymbol{\beta}^{*}}^\intercal\mathbb{E}_{\mathrm{new}}\left[\boldsymbol{x}\boldsymbol{x}^\intercal\right] \boldsymbol{\beta}^{*}- 2 \mathbb{E}_{\mathrm{new}}\left[f^{\star}(\boldsymbol{x})\boldsymbol{x}^\intercal\right]\boldsymbol{\beta}^{*} \\={}& \mathbb{E}_{\mathrm{new}}\left[f^{\star}(\boldsymbol{x})^2\right] + {\boldsymbol{\beta}^{*}}^\intercal M_{xx} \boldsymbol{\beta}^{*}- 2 M_{yx} \boldsymbol{\beta}^{*} \\={}& \mathbb{E}_{\mathrm{new}}\left[f^{\star}(\boldsymbol{x})^2\right] + M_{yx} M_{xx}^{-1} M_{xx} M_{xx}^{-1} M_{xy} - 2 M_{yx} M_{xx}^{-1} M_{xy} \\={}& \mathbb{E}_{\mathrm{new}}\left[f^{\star}(\boldsymbol{x})^2\right] - M_{yx} M_{xx}^{-1} M_{xy}. \end{aligned} \]
This can only decrease as \(P\) increases. Perhaps the easiest way to see this is to note that
\[ \mathbb{E}_{\mathrm{new}}\left[(f^{\star}(\boldsymbol{x}) - {\boldsymbol{\beta}}^\intercal\boldsymbol{x})^2\right] = \mathscr{L}(\boldsymbol{\beta}) - \mathbb{E}_{\mathrm{new}}\left[(y- f^{\star}(\boldsymbol{x}))^2\right]. \]
By making \(\boldsymbol{\beta}\) more expressive, we can only decrease the population loss \(\mathscr{L}(\boldsymbol{\beta})\); since we cannot change the irreducible loss by doing so, we must be decreasing the bias term.
As expected, we thus see that increasing model complexity decreases the bias term.
Variance term
The variance term is the only term that depends on the training data. We can expand it as follows:
\[ \begin{aligned} \mathbb{E}_{\mathrm{new},\mathcal{D}}\left[({\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x}- \hat{\boldsymbol{\beta}}^\intercal\boldsymbol{x})^2\right] ={}& \mathbb{E}_{\mathrm{new},\mathcal{D}}\left[ ({\boldsymbol{\beta}^{*}} - \hat{\boldsymbol{\beta}})^\intercal\boldsymbol{x}\boldsymbol{x}^\intercal({\boldsymbol{\beta}^{*}} - \hat{\boldsymbol{\beta}}) \right] \\={}& \mathbb{E}_{\mathrm{new},\mathcal{D}}\left[ \mathrm{trace}\left( \boldsymbol{x}\boldsymbol{x}^\intercal({\boldsymbol{\beta}^{*}} - \hat{\boldsymbol{\beta}}) ({\boldsymbol{\beta}^{*}} - \hat{\boldsymbol{\beta}})^\intercal \right) \right] \\={}& \mathrm{trace}\left( \mathbb{E}_{\mathrm{new}}\left[\boldsymbol{x}\boldsymbol{x}^\intercal\right] \mathbb{E}_{\mathcal{D}}\left[ ({\boldsymbol{\beta}^{*}} - \hat{\boldsymbol{\beta}}) ({\boldsymbol{\beta}^{*}} - \hat{\boldsymbol{\beta}})^\intercal \right] \right) \\={}& \mathrm{trace}\left( M_{xx} \mathbb{E}_{\mathcal{D}}\left[ ({\boldsymbol{\beta}^{*}} - \hat{\boldsymbol{\beta}}) ({\boldsymbol{\beta}^{*}} - \hat{\boldsymbol{\beta}})^\intercal \right] \right). \end{aligned} \]
Again, in an introductory linear models course, you might have seen an application of the CLT to \(\hat{\boldsymbol{\beta}}\):
\[ \sqrt{N} (\hat{\boldsymbol{\beta}}- \boldsymbol{\beta}^{*}) \rightarrow \mathcal{N}\left(\boldsymbol{0}, \sigma^2 M_{xx}^{-1}\right) \quad\textrm{(under correct specification and homoskedasticity)}\quad, \]
where \(\sigma^2\) is the variance of the residual. It turns out that a CLT does hold under misspecification, but with a different variance. We can write
\[ \begin{aligned} \hat{\boldsymbol{\beta}}- {\boldsymbol{\beta}^{*}} ={}& \left(\boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1} \boldsymbol{X}^\intercal\boldsymbol{Y}- \boldsymbol{\beta}^{*} \\={}& \left(\frac{1}{N} \sum_{n=1}^N\boldsymbol{x}_n \boldsymbol{x}_n^\intercal\right)^{-1} \frac{1}{N} \sum_{n=1}^N\boldsymbol{x}_n y_n - \boldsymbol{\beta}^{*} \\={}& \left(\frac{1}{N} \sum_{n=1}^N\boldsymbol{x}_n \boldsymbol{x}_n^\intercal\right)^{-1} \frac{1}{N} \sum_{n=1}^N\boldsymbol{x}_n (y_n - \boldsymbol{x}_n^\intercal\boldsymbol{\beta}^{*}) + \left(\frac{1}{N} \sum_{n=1}^N\boldsymbol{x}_n \boldsymbol{x}_n^\intercal\right)^{-1} \left(\frac{1}{N} \sum_{n=1}^N\boldsymbol{x}_n \boldsymbol{x}_n^\intercal\right) \boldsymbol{\beta}^{*} - \boldsymbol{\beta}^{*} \\={}& \left(\frac{1}{N} \sum_{n=1}^N\boldsymbol{x}_n \boldsymbol{x}_n^\intercal\right)^{-1} \frac{1}{N} \sum_{n=1}^N\boldsymbol{x}_n (y_n - \boldsymbol{x}_n^\intercal\boldsymbol{\beta}^{*}). \end{aligned} \]
By the LLN and CMT,
\[ \left(\frac{1}{N} \sum_{n=1}^N\boldsymbol{x}_n \boldsymbol{x}_n^\intercal\right)^{-1} \rightarrow M_{xx}^{-1}. \]
Noting that
\[ \mathbb{E}_{\,}\left[ \boldsymbol{x}_n (y_n - \boldsymbol{x}_n^\intercal\boldsymbol{\beta}^{*})\right] = M_{xy} - M_{xx} M_{xx}^{-1}M_{xy} = 0, \]
we can apply the CLT to get
\[ \frac{1}{\sqrt{N}} \sum_{n=1}^N\boldsymbol{x}_n (y_n - \boldsymbol{x}_n^\intercal\boldsymbol{\beta}^{*}) \rightarrow \mathcal{N}\left(\boldsymbol{0}, \boldsymbol{V}\right) \quad\textrm{where}\quad \varepsilon_n := y_n - \boldsymbol{x}_n^\intercal\boldsymbol{\beta}^{*} \quad\textrm{and}\quad \boldsymbol{V}:= \mathrm{Cov}\left(\boldsymbol{x}_n \varepsilon_n\right). \]
Note that, unlike in standard correctly specified regression,
\[ \mathbb{E}_{\,}\left[\varepsilon_n \vert \boldsymbol{x}\right] = \mathbb{E}_{\,}\left[y_n \vert \boldsymbol{x}\right] - \boldsymbol{x}^\intercal\beta^{*}\ne 0! \]
In general, \(\varepsilon_n\) and \(\boldsymbol{x}\) have some joint distribution. Putting this together,
\[ \sqrt{N}\left(\hat{\boldsymbol{\beta}}- \boldsymbol{\beta}^{*}\right) \rightarrow \mathcal{N}\left(\boldsymbol{0}, M_{xx}^{-1} V M_{xx}^{-1}\right). \]
(As an exercise, show that you recover the correct specification covariance when \(\varepsilon_n\) and \(\boldsymbol{x}_n\) are independent.)
Note that the CLT — which is converngence in distribution — is insufficient to conclude that \(\mathrm{Var}\left(\sqrt{N} \hat{\boldsymbol{\beta}}\right) \rightarrow M_{xx}^{-1} V M_{xx}^{-1}\). However, if \(\mathrm{Var}\left(\sqrt{N} \hat{\boldsymbol{\beta}}\right)\) does converge to something, which is true under stronger assumptions that we do not discuss here, it must converge to \(M_{xx}^{-1} V M_{xx}^{-1}\).
Combining with the variance term, we can finally get that \[ \begin{aligned} N \mathbb{E}_{\mathrm{new},\mathcal{D}}\left[({\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x}- \hat{\boldsymbol{\beta}}^\intercal\boldsymbol{x})^2\right] \rightarrow {}& \mathrm{trace}\left( M_{xx} M_{xx}^{-1} V M_{xx}^{-1} \right) \\={}& \mathrm{trace}\left( M_{xx}^{-1/2} \mathrm{Cov}\left(\boldsymbol{x}\varepsilon\right) M_{xx}^{-1/2} \right) \end{aligned} \]
Analyzing the variance term
Ideally, we would like to know how the variance term depends on \(P\). However, this behavior can be complicated in general. Intuition can be gained from two special cases:
- Where \(y_n\) is mean zero and independent of \(\boldsymbol{x}_n\), and
- Where \(y_n = \boldsymbol{x}_n + z_n + \varepsilon_n\), for some mean zero, finite–variance \(z_n\) and \(\varepsilon_n\) that are independent of \(\boldsymbol{x}_n\) and of each other.
In the former case, we will see that the variance can only increase as you add regressors. In the latter case, including \(z_n\) can either increase or decrease the variance, depending on its own variance.
No signal
First, suppose that \(\mathbb{E}_{\,}\left[y\right] = 0\), \(\mathrm{Var}\left(y\right) = \sigma^2\), and \(y\) is independent of \(\boldsymbol{x}\). This is an extreme case, where there is no signal whatsoever. In this case, we recover the classical result, since \(M_{xy} = \boldsymbol{0}\), \(\boldsymbol{\beta}^{*}= \boldsymbol{0}\), and
\[ \mathrm{Cov}\left(\boldsymbol{x}\varepsilon_n\right) = M_{xx} \sigma^2 \quad\textrm{$\Rightarrow$}\quad N \cdot \textrm{Variance term } \rightarrow \mathrm{trace}\left( M_{xx}^{-1/2} M_{xx} M_{xx}^{-1/2} \sigma^2 \right) = \mathrm{trace}\left(\boldsymbol{I}_P\right) \sigma^2 = P \sigma^2. \]
Effectively, the training variance in \(\boldsymbol{x}_n\) and the predictive variance in \(\boldsymbol{x}_\mathrm{new}\) cancel out, and we are left with
\[ \textrm{Variance term} \approx \frac{P \sigma^2}{N}. \]
We thus see that
- The variance term grows linearly in \(P\) for fixed \(N\),
- The variance term shrinks as \(1/N\) for fixed \(P\), and
- The variance term is larger as the variability of \(y_n\) is larger.
As an exercise you can show that this is what you get for correctly specified homoskedastic linear regression in which you might include some extra, unneeded regressors.
This example is somewhat unsatisfactory, since the bias is not decreasing as \(P\) grows — the bias is optimal with no regressors at all.
Omitted regressors
Now suppose that we have an omitted regressor, \(z_n\). Without loss of generality we can take \(y_n = {\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x}_n + z_n + \varepsilon_n\), so the omitted coefficient is \(1\) (otherwise we could just rescale the standard deviation of \(z\)). Assume that
- \(\mathbb{E}_{\,}\left[z\right] = 0\), \(\mathrm{Var}\left(z\right) = \sigma_z^2\),
- \(\mathbb{E}_{\,}\left[\varepsilon\right] = 0\), \(\mathrm{Var}\left(\varepsilon\right) = \sigma^2\),
- \(\boldsymbol{x}\), \(z\), \(\varepsilon\) are all independent.
Note that \(M_{xy} = M_{xx} \boldsymbol{\beta}^{*}\), so there is no inconsistency assuming that \(\boldsymbol{\beta}^{*}\) is the truth. We have
\(\varepsilon= y- {\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x}= z+ \varepsilon\),
so
\[ \mathrm{Cov}\left(\boldsymbol{x}\varepsilon\right) = M_{xx} (\sigma_z^2 + \sigma^2) \]
and
\[ N \cdot \textrm{Variance term without $z$: } \rightarrow \mathrm{trace}\left( M_{xx}^{-1/2} M_{xx} M_{xx}^{-1/2} (\sigma_z^2 + \sigma^2) \right) = P (\sigma_z^2 + \sigma^2). \]
However, by the same reasoning, if we had included \(z\) in the regression, we would have gotten
\[ N \cdot \textrm{Variance term without $z$: } \rightarrow \mathrm{trace}\left( M_{xx}^{-1/2} M_{xx} M_{xx}^{-1/2} (\sigma_z^2 + \sigma^2) \right) = (P + 1) \sigma^2. \]
Thus, by adding the regressor \(z\), the variance increases if and only if
\[ (P + 1) \sigma^2 \ge P (\sigma_z^2 + \sigma^2) \quad\textrm{$\Leftrightarrow$}\quad 1 + \frac{1}{P} \ge 1 + \frac{\sigma_z^2}{\sigma^2} \quad\textrm{$\Leftrightarrow$}\quad \sigma_z^2 \le \frac{\sigma^2}{P}. \]
Thus, if \(\sigma_z^2\) is large enough, then including it both decreases the bias (definitionally), and decreases the variance.