Bias and variance tradeoff in prediction
\(\,\)
Goals
- Study the dependence of regresison mean squared error on model complexity
- Bias-variance tradeoff and mean squared error
- Simple expressions for bias and variance under simple assumptions
- More general interpretations
Reading
These lectures notes can be supplemented with
Consequences of model complexity
For this lecture, we return to squared loss and linear regression. We have discussed how we can generate many features for linear regression. In principle, we can generate an arbitrarily large number of features — bounded only by the number of distinct values the regressors can take. Which features should we include in the model?
We will be discussing this as a practical matter throughout the course. But here, we will show that in linear regression the choice of which features to include involves an explicit tradeoff between two quantities, “bias” and “variance” which may (or may not) move in different directions as more regressors are included.
Recall that when using the squared loss, the optimal linear predictor is given by \(\boldsymbol{\beta}^{*}= M_{xx}^{-1} M_{xy}\), where \(M_{xx} = \mathbb{E}_{\vphantom{}}\left[\boldsymbol{x}\boldsymbol{x}^\intercal\right]\) and \(M_{xy} = \mathbb{E}_{\vphantom{}}\left[\boldsymbol{x}y\right]\). The optimal prediction function among all functions is given by \(f^{\star}(\boldsymbol{x}) = \mathbb{E}_{\vphantom{}}\left[y\vert \boldsymbol{x}\right]\).
The bias–variance tradeoff under correct specification
The classical bias–variance tradeoff occurs under (a) correct specification and (b) conditional on the training regressors, \(\boldsymbol{X}\). The assumption of correct specification is particularly problematic, and, as we see below, in its absence the bias–variance tradeoff becomes more complicated. However, it is good to be familiar with the classical version.
Suppose we define \(\overline{\boldsymbol{\beta}}:= \mathbb{E}_{p(\boldsymbol{Y}\vert \boldsymbol{X})}\left[\hat{\boldsymbol{\beta}}\right]\), and assume that \(y= \boldsymbol{x}^\intercal\boldsymbol{\beta}^{*}+ \epsilon\), for some \(\epsilon\) satisfying \(\mathbb{E}_{\vphantom{}}\left[\epsilon | \boldsymbol{x}\right] = 0\) and \(\mathbb{E}_{\vphantom{}}\left[\epsilon^2 \vert \boldsymbol{x}\right] = \sigma^2\). For now, let’s not assume anything about \(\hat{\boldsymbol{\beta}}\) — it’s just some estimator that depends on the training data, \(\boldsymbol{X}\) and \(\boldsymbol{Y}\), not necessarily the OLS estimator.
We can then write \[ \begin{aligned} \mathscr{R}(\hat{\boldsymbol{\beta}}) ={}& \mathbb{E}_{p(\boldsymbol{x}, y)}\left[(y- \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}})^2\right] \\={}& \mathbb{E}_{p(\boldsymbol{x}, y)}\left[(y- \boldsymbol{x}^\intercal\boldsymbol{\beta}^{*}+ \boldsymbol{x}^\intercal(\boldsymbol{\beta}^{*}- \overline{\boldsymbol{\beta}}) + \boldsymbol{x}^\intercal(\overline{\boldsymbol{\beta}}- \hat{\boldsymbol{\beta}}))^2\right] \\={}& \mathbb{E}_{p(\boldsymbol{x}, y)}\left[(y- \boldsymbol{x}^\intercal\boldsymbol{\beta}^{*})^2\right] + \mathbb{E}_{p(\boldsymbol{x}, y)}\left[\boldsymbol{x}^\intercal(\boldsymbol{\beta}^{*}- \overline{\boldsymbol{\beta}}) (\boldsymbol{\beta}^{*}- \overline{\boldsymbol{\beta}})^\intercal\boldsymbol{x}\right] + \mathbb{E}_{p(\boldsymbol{x}, y)}\left[\boldsymbol{x}^\intercal(\overline{\boldsymbol{\beta}}- \hat{\boldsymbol{\beta}}) (\overline{\boldsymbol{\beta}}- \hat{\boldsymbol{\beta}})^\intercal\boldsymbol{x}\right] + \\{}& 2 \mathbb{E}_{p(\boldsymbol{x}, y)}\left[(y- \boldsymbol{x}^\intercal\boldsymbol{\beta}^{*}) (\boldsymbol{\beta}^{*}- \overline{\boldsymbol{\beta}})^\intercal\boldsymbol{x}\right] + 2 \mathbb{E}_{p(\boldsymbol{x}, y)}\left[(y- \boldsymbol{x}^\intercal\boldsymbol{\beta}^{*}) (\hat{\boldsymbol{\beta}}- \hat{\boldsymbol{\beta}})^\intercal\boldsymbol{x}\right] + \\{}& 2 \mathbb{E}_{p(\boldsymbol{x}, y)}\left[\boldsymbol{x}^\intercal(\hat{\boldsymbol{\beta}}- \overline{\boldsymbol{\beta}}) (\overline{\boldsymbol{\beta}}- \hat{\boldsymbol{\beta}})^\intercal\boldsymbol{x}\right] \\{}& \\={}& \sigma^2 + \mathrm{trace}\left(\mathbb{E}_{p(\boldsymbol{x}, y)}\left[\boldsymbol{x}\boldsymbol{x}^\intercal\right] (\boldsymbol{\beta}^{*}- \overline{\boldsymbol{\beta}}) (\boldsymbol{\beta}^{*}- \overline{\boldsymbol{\beta}})^\intercal\right) + \mathrm{trace}\left(\mathbb{E}_{p(\boldsymbol{x}, y)}\left[\boldsymbol{x}\boldsymbol{x}^\intercal\right] (\overline{\boldsymbol{\beta}}- \hat{\boldsymbol{\beta}}) (\overline{\boldsymbol{\beta}}- \hat{\boldsymbol{\beta}})^\intercal\right) + \\{}& 2 \mathbb{E}_{p(\boldsymbol{x})}\left[\mathbb{E}_{p(y\vert \boldsymbol{x})}\left[(y- \boldsymbol{x}^\intercal\boldsymbol{\beta}^{*})\right] (\boldsymbol{\beta}^{*}- \overline{\boldsymbol{\beta}})^\intercal\boldsymbol{x}\right] + \\{}& 2 \mathbb{E}_{p(\boldsymbol{x})}\left[\mathbb{E}_{p(y\vert \boldsymbol{x})}\left[(y- \boldsymbol{x}^\intercal\boldsymbol{\beta}^{*})\right] (\hat{\boldsymbol{\beta}}- \hat{\boldsymbol{\beta}})^\intercal\boldsymbol{x}\right] + \\{}& 2 \mathrm{trace}\left(\mathbb{E}_{p(\boldsymbol{x}, y)}\left[\boldsymbol{x}\boldsymbol{x}^\intercal\right] (\overline{\boldsymbol{\beta}}- \hat{\boldsymbol{\beta}}) (\boldsymbol{\beta}^{*}- \overline{\boldsymbol{\beta}})^\intercal\right). \\={}& \sigma^2 + \mathrm{trace}\left(M_{xx} (\boldsymbol{\beta}^{*}- \overline{\boldsymbol{\beta}}) (\boldsymbol{\beta}^{*}- \overline{\boldsymbol{\beta}})^\intercal\right) + \mathrm{trace}\left(M_{xx} (\overline{\boldsymbol{\beta}}- \hat{\boldsymbol{\beta}}) (\overline{\boldsymbol{\beta}}- \hat{\boldsymbol{\beta}})^\intercal\right) + \\{}& 2 \mathrm{trace}\left(M_{xx} (\overline{\boldsymbol{\beta}}- \hat{\boldsymbol{\beta}}) (\boldsymbol{\beta}^{*}- \overline{\boldsymbol{\beta}})^\intercal\right) \end{aligned} \] because, by correct specification, \(\mathbb{E}_{p(y\vert \boldsymbol{x})}\left[(y- \boldsymbol{x}^\intercal\boldsymbol{\beta}^{*})\right] = 0\).
Naturally, \(\mathscr{R}(\hat{\boldsymbol{\beta}})\) is random, since \(\hat{\boldsymbol{\beta}}\) is a function of the training data, which is random. If we take the expectation of the preceding expression with respect to the training \(\boldsymbol{Y}\), but conditional on \(\boldsymbol{X}\), we get
\[ \begin{aligned} \mathbb{E}_{p(\boldsymbol{Y}\vert \boldsymbol{X})}\left[\mathscr{R}(\hat{\boldsymbol{\beta}})\right] ={}& \underbrace{\sigma^2}_{\textrm{Irreducible error}} + \underbrace{\mathrm{trace}\left(M_{xx} (\boldsymbol{\beta}^{*}- \overline{\boldsymbol{\beta}}) (\boldsymbol{\beta}^{*}- \overline{\boldsymbol{\beta}})^\intercal\right)}_{\textrm{Bias term}} + \underbrace{\mathrm{trace}\left(M_{xx} \mathrm{Cov}_{p(\boldsymbol{Y}\vert \boldsymbol{X})}\left(\hat{\boldsymbol{\beta}}\right)\right)}_{\textrm{Variance term}} \end{aligned} \] because \(\mathbb{E}_{p(\boldsymbol{Y}\vert \boldsymbol{X})}\left[\overline{\boldsymbol{\beta}}- \hat{\boldsymbol{\beta}}\right] = 0\). We thus see that the expected regret is given by two terms — one involving the covariance of \(\hat{\boldsymbol{\beta}}\) and one involving the bias \(\overline{\boldsymbol{\beta}}- \boldsymbol{\beta}^{*}\).
The decomposition only looks this nice when we have correct specification, and define bias by conditioning on \(\boldsymbol{X}\). And at this point, it’s not obvious that there’s a “tradeoff.” However, in at least some cases, there is, in the sense that increasing complexity tends to increase variance and decrease bias, and vice-versa.
The bias–variance tradeoff for ridge estimation
One such case is ridge regression under correct specification. Recall the ridge estimator, \[ \hat{\boldsymbol{\beta}}_\lambda := \left(\boldsymbol{X}^\intercal\boldsymbol{X}+ \lambda {\boldsymbol{I}_{\,}}\right)^{-1} \boldsymbol{X}^\intercal\boldsymbol{Y}, \] which always exists. Let us see how the \(\lambda\) parameter effects a bias variance tradeoff under the (unrealistic) assumption of correct specification. We will use the SVD of \(\boldsymbol{X}\),
\[ \boldsymbol{X}= \boldsymbol{U}\Lambda \boldsymbol{V}^\intercal, \]
where \(\Lambda\) is diagonal with entries \(\lambda_p\) for \(p\in 1,\ldots, P\), \(\boldsymbol{V}\) is \(P \times P\), \(\boldsymbol{U}\) is \(N \times P\), and both \(\boldsymbol{V}\) and \(\boldsymbol{U}\) are orthogonal. In this form, we can write
\[ \boldsymbol{X}^\intercal\boldsymbol{X}= \boldsymbol{V}\Lambda \boldsymbol{U}^\intercal\boldsymbol{U}\Lambda \boldsymbol{V}^\intercal= \boldsymbol{V}\Lambda^2 \boldsymbol{V}^\intercal. \] Note that \[ \frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{X}= \boldsymbol{V}\frac{1}{N} \Lambda^2 \boldsymbol{V}^\intercal \rightarrow \mathbb{E}_{\vphantom{}}\left[\boldsymbol{x}\boldsymbol{x}^\intercal\right] = M_{xx}, \] so we expect \(\lambda_p^2\) to be of order \(N\) if \(\mathbb{E}_{\vphantom{}}\left[\boldsymbol{x}\boldsymbol{x}^\intercal\right]\) is non–degenerate.
Letting \(\mathbf{L}= \Lambda^2 + \lambda {\boldsymbol{I}_{\,}}\), the diagonal matrix with \(\lambda_p^2 + \lambda\) on the diagonal, the ridge estimator is
\[ \begin{aligned} \hat{\boldsymbol{\beta}}_\lambda ={}& \boldsymbol{V}\mathbf{L}^{-1} \Lambda \boldsymbol{U}^\intercal\boldsymbol{Y}\\={}& \boldsymbol{V}\mathbf{L}^{-1} \Lambda \boldsymbol{U}^\intercal(\boldsymbol{X}\boldsymbol{\beta}^{*}+ \epsilon) \\={}& \boldsymbol{V}\mathbf{L}^{-1} \Lambda^2 \boldsymbol{V}^\intercal\boldsymbol{\beta}^{*}+ \boldsymbol{V}\mathbf{L}^{-1} \Lambda \boldsymbol{U}^\intercal\epsilon. \end{aligned} \]
We thus see that
\[ \overline{\boldsymbol{\beta}}= \mathbb{E}_{\vphantom{}}\left[\hat{\boldsymbol{\beta}}_\lambda \vert \boldsymbol{X}\right] = \boldsymbol{V}\mathbf{L}^{-1} \Lambda^2 \boldsymbol{V}^\intercal\boldsymbol{\beta}^{*} \]
and
\[ \begin{aligned} \mathrm{Cov}_{\,}\left(\hat{\boldsymbol{\beta}}_\lambda \vert \boldsymbol{X}\right) ={}& \boldsymbol{V}\mathbf{L}^{-1} \Lambda \boldsymbol{U}^\intercal\mathbb{E}_{\vphantom{}}\left[\epsilon \epsilon^\intercal\right] \boldsymbol{U}\Lambda \mathbf{L}^{-1} \boldsymbol{V}^\intercal\\={}& \sigma^2 \boldsymbol{V}\mathbf{L}^{-1} \Lambda^2 \mathbf{L}^{-1} \boldsymbol{V}^\intercal. \end{aligned} \]
(Recalling that \(\lambda_p^2 \propto N\), so these expressions have the right scale.)
If we plug these values into our bias–variance decomposition, and use the fact that \(\frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{X}= \frac{1}{N} \boldsymbol{V}\Lambda^2 \boldsymbol{V}^\intercal\approx M_{xx}\), we get that
\[ \begin{aligned} \overline{\boldsymbol{\beta}}- \boldsymbol{\beta}^{*}={}& \boldsymbol{V}\left(\mathbf{L}^{-1} \Lambda^2 - {\boldsymbol{I}_{\,}}\right) \boldsymbol{V}^\intercal\boldsymbol{\beta}^{*}\quad \Rightarrow \\ \textrm{Bias term}={}& \mathrm{trace}\left(M_{xx} (\boldsymbol{\beta}^{*}- \overline{\boldsymbol{\beta}}) (\boldsymbol{\beta}^{*}- \overline{\boldsymbol{\beta}})^\intercal\right) \\\approx{}& \frac{1}{N} \mathrm{trace}\left(\boldsymbol{V}\Lambda^2 \boldsymbol{V}^\intercal\boldsymbol{V}\left(\mathbf{L}^{-1} \Lambda^2 - {\boldsymbol{I}_{\,}}\right) \boldsymbol{V}^\intercal\boldsymbol{\beta}^{*} {\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{V}\left(\mathbf{L}^{-1} \Lambda^2 - {\boldsymbol{I}_{\,}}\right) \boldsymbol{V}^\intercal\right) \\={}& \frac{1}{N} \mathrm{trace}\left(\Lambda^2 \left(\mathbf{L}^{-1} \Lambda^2 - {\boldsymbol{I}_{\,}}\right) \boldsymbol{V}^\intercal\boldsymbol{\beta}^{*} {\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{V}\left(\mathbf{L}^{-1} \Lambda^2 - {\boldsymbol{I}_{\,}}\right) \right) \\={}& \frac{1}{N} \mathrm{trace}\left( \left(\mathbf{L}^{-1} \Lambda^2 - {\boldsymbol{I}_{\,}}\right) \Lambda^2 \left(\mathbf{L}^{-1} \Lambda^2 - {\boldsymbol{I}_{\,}}\right) \boldsymbol{V}^\intercal\boldsymbol{\beta}^{*}{\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{V}\right) \\={}& \frac{1}{N} \mathrm{trace}\left( \mathrm{Diag}\left( \left( \frac{\lambda_p^2}{\lambda_p^2 + \lambda} - 1\right)^2 \lambda_p^2 \right) \boldsymbol{V}^\intercal\boldsymbol{\beta}^{*}{\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{V}\right) \end{aligned} \]
As you can see, as \(\lambda \rightarrow 0\), the bias goes to zero, and the bias goes to a maximal value as \(\lambda \rightarrow \infty\).
For the variance, we get
\[ \begin{aligned} \textrm{Variance term}={}& \mathrm{trace}\left(M_{xx} \sigma^2 \boldsymbol{V}\mathbf{L}^{-1} \Lambda^2 \mathbf{L}^{-1} \boldsymbol{V}^\intercal\right) \\\approx{}& \frac{\sigma^2}{N} \mathrm{trace}\left(\boldsymbol{V}\Lambda^2 \boldsymbol{V}^\intercal\boldsymbol{V}\mathbf{L}^{-1} \Lambda^2 \mathbf{L}^{-1} \boldsymbol{V}^\intercal\right) \\=& \frac{\sigma^2}{N} \mathrm{trace}\left(\Lambda^2 \mathbf{L}^{-1} \Lambda^2 \mathbf{L}^{-1}\right) \\=& \frac{\sigma^2}{N} \mathrm{trace}\left(\mathrm{Diag}\left(\frac{\lambda_p^4}{(\lambda_p^2 + \lambda)^2}\right)\right). \end{aligned} \]
Here, the variance goes in the opposite direction as \(\lambda\): as \(\lambda \rightarrow 0\), the variance term goes to its maximal value, and as \(\lambda \rightarrow \infty\), the variance goes to zero.
When using a ridge estimator under correct specification, we would then expect the risk \(\mathscr{R}(\hat{\boldsymbol{\beta}}_\lambda)\) to be the sum of two curves: a bias curve, which is increasing, and a variance curve, which is decreasing. At some intermediate value of \(\lambda\), the risk would be minimized.
The “bias–variance” tradeoff more generally
Recall that, in general, we can decompose the regret into two terms that behave in opposite directions with respect to model complexity:
\[ \mathscr{R}(\hat{\boldsymbol{\beta}}) - \mathscr{R}(f^{\star}) = \mathscr{R}(\hat{\boldsymbol{\beta}}) - \mathscr{R}(\boldsymbol{\beta}^{*}) + \mathscr{R}(\boldsymbol{\beta}^{*}) - \mathscr{R}(f^{\star}). \]
We discussed how, intuitively, if there is a price to be paid for model complexity, it should be found in the first term, \(\mathscr{R}(\hat{\boldsymbol{\beta}}) - \mathscr{R}(\boldsymbol{\beta}^{*})\). For the special case of squared error, we can make this decomposition more precise. In particular, we can write
\[ \begin{aligned} \mathscr{R}(\hat{\boldsymbol{\beta}}) ={}& \mathbb{E}_{\vphantom{}}\left[(y- \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}})^2\right] \\={}& \mathbb{E}_{\vphantom{}}\left[(y- \boldsymbol{x}^\intercal\boldsymbol{\beta}^{*}+ \boldsymbol{x}^\intercal\boldsymbol{\beta}^{*}- \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}})^2\right] \\={}& \mathbb{E}_{\vphantom{}}\left[(y- \boldsymbol{x}^\intercal\boldsymbol{\beta}^{*})^2\right] + \mathbb{E}_{\vphantom{}}\left[(\boldsymbol{x}^\intercal\boldsymbol{\beta}^{*}- \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}})^2\right] + 2 \mathbb{E}_{\vphantom{}}\left[(y- \boldsymbol{x}^\intercal\boldsymbol{\beta}^{*})\boldsymbol{x}^\intercal(\boldsymbol{\beta}^{*}- \hat{\boldsymbol{\beta}})\right]. \\={}& \mathscr{R}(\boldsymbol{\beta}^{*}) + \mathbb{E}_{\vphantom{}}\left[(\boldsymbol{x}^\intercal\boldsymbol{\beta}^{*}- \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}})^2\right] + 2 \mathbb{E}_{\vphantom{}}\left[(y\boldsymbol{x}^\intercal- {\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x}\boldsymbol{x}^\intercal)\right] (\boldsymbol{\beta}^{*}- \hat{\boldsymbol{\beta}}) \\={}& \mathscr{R}(\boldsymbol{\beta}^{*}) + \mathbb{E}_{\vphantom{}}\left[(\boldsymbol{x}^\intercal\boldsymbol{\beta}^{*}- \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}})^2\right], \end{aligned} \] where in the final line we have plugged in the formula for \(\boldsymbol{\beta}^{*}\) to show that the cross term is zero. We can similarly decompose
\[ \begin{aligned} \mathscr{R}(\boldsymbol{\beta}^{*}) ={}& \mathbb{E}_{\vphantom{}}\left[(y- \boldsymbol{x}^\intercal\boldsymbol{\beta}^{*})^2\right] \\={}& \mathbb{E}_{\vphantom{}}\left[(y- f^{\star}(\boldsymbol{x}) + f^{\star}(\boldsymbol{x}) - \boldsymbol{x}^\intercal\boldsymbol{\beta}^{*})^2\right] \\={}& \mathbb{E}_{\vphantom{}}\left[(y- f^{\star}(\boldsymbol{x}))^2\right] + \mathbb{E}_{\vphantom{}}\left[(\boldsymbol{x}^\intercal\boldsymbol{\beta}^{*}- f^{\star}(\boldsymbol{x}))^2\right] + 2 \mathbb{E}_{\vphantom{}}\left[(y- f^{\star}(\boldsymbol{x})) (\boldsymbol{x}^\intercal\boldsymbol{\beta}^{*}- f^{\star}(\boldsymbol{x}))\right]. \\={}& \mathscr{R}(f^{\star}) + \mathbb{E}_{\vphantom{}}\left[(\boldsymbol{x}^\intercal\boldsymbol{\beta}^{*}- f^{\star}(\boldsymbol{x}))^2\right], \end{aligned} \] where in the final line we used the fact that \(\mathbb{E}_{p(y\vert \boldsymbol{x})}\left[y- f^{\star}(\boldsymbol{x})\right] = 0\). Putting this together and slightly rewriting the first, we see that the regret is given by
\[ \begin{aligned} \mathscr{R}(\hat{\boldsymbol{\beta}}) - \mathscr{R}(f^{\star}) ={}& \mathbb{E}_{\vphantom{}}\left[(\boldsymbol{x}^\intercal\boldsymbol{\beta}^{*}- \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}})^2\right] + \mathbb{E}_{\vphantom{}}\left[(\boldsymbol{x}^\intercal\boldsymbol{\beta}^{*}- f^{\star}(\boldsymbol{x}))^2\right] \\={}& \mathbb{E}_{\vphantom{}}\left[\mathrm{trace}\left(\boldsymbol{x}\boldsymbol{x}^\intercal(\hat{\boldsymbol{\beta}}- \boldsymbol{\beta}^{*}) (\hat{\boldsymbol{\beta}}- \boldsymbol{\beta}^{*})^\intercal\right)\right] + \mathbb{E}_{\vphantom{}}\left[(\boldsymbol{x}^\intercal\boldsymbol{\beta}^{*}- f^{\star}(\boldsymbol{x}))^2\right] \\={}& \underset{\textrm{"variance"}}{\mathrm{trace}\left(M_{xx} (\hat{\boldsymbol{\beta}}- \boldsymbol{\beta}^{*}) (\hat{\boldsymbol{\beta}}- \boldsymbol{\beta}^{*})^\intercal\right)} + \underset{\textrm{"bias"}}{\mathbb{E}_{\vphantom{}}\left[(\boldsymbol{x}^\intercal\boldsymbol{\beta}^{*}- f^{\star}(\boldsymbol{x}))^2\right]}. \end{aligned} \]
Only asymptotic variance and bias
Recall that the variance of \(\boldsymbol{\beta}^{*}\) would be given by
\[ \mathbb{E}_{\mathcal{D}}\left[(\hat{\boldsymbol{\beta}}- \overline{\boldsymbol{\beta}})(\hat{\boldsymbol{\beta}}- \overline{\boldsymbol{\beta}})^\intercal\right] \quad\textrm{where}\quad \overline{\boldsymbol{\beta}}:= \mathbb{E}_{\mathcal{D}}\left[\hat{\boldsymbol{\beta}}\right]. \]
However, exact bias and variance are actually difficult to describe if \(\boldsymbol{x}\) is random and the model is misspecified. For example, \[ \begin{aligned} \mathbb{E}_{\vphantom{}}\left[\hat{\boldsymbol{\beta}}\right] ={}& \mathbb{E}_{\vphantom{}}\left[(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{Y}\right] \\={}& \mathbb{E}_{\vphantom{}}\left[(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\mathbb{E}_{\vphantom{}}\left[\boldsymbol{Y}\vert \boldsymbol{X}\right]\right]. \end{aligned} \]
Note that \(\boldsymbol{X}\) is random in the preceding expression, since our pairs \((x_n, y_n)\) are IID — otherwise, the relatioship between population and sample risk doesn’t make sense. In an introductory linear models class, you might assume that \(\mathbb{E}_{\vphantom{}}\left[\boldsymbol{Y}\vert \boldsymbol{X}\right] = \boldsymbol{X}\boldsymbol{\beta}^{*}+ \boldsymbol{\varepsilon}\) for \(\mathbb{E}_{\vphantom{}}\left[\boldsymbol{\varepsilon}| \boldsymbol{X}\right] = \boldsymbol{0}\). If that were the case, then
\[ \begin{aligned} \mathbb{E}_{\vphantom{}}\left[(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\mathbb{E}_{\vphantom{}}\left[\boldsymbol{Y}\vert \boldsymbol{X}\right]\right] ={}& \mathbb{E}_{\vphantom{}}\left[(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal(\boldsymbol{X}\boldsymbol{\beta}^{*}+ \boldsymbol{\varepsilon})\right] \\={}& \boldsymbol{\beta}^{*}+ \mathbb{E}_{\vphantom{}}\left[(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\mathbb{E}_{\vphantom{}}\left[\boldsymbol{\varepsilon}| \boldsymbol{X}\right]\right] \\={}& \boldsymbol{\beta}^{*}. & \textrm{(Under correct specification)} \end{aligned} \]
However, under misspecification, there is no such simple result — in particular, there is no reason to think there is any \(\boldsymbol{\beta}\) such that \(\mathbb{E}_{\vphantom{}}\left[y- \boldsymbol{\beta}^\intercal\boldsymbol{x}| \boldsymbol{x}\right] = 0\). However, the variance term in fact corresponds to an asymptotic variance, and \(\boldsymbol{\beta}^{*}\) to an asymptotic bias.
Bias term
The second term is the “bias” term. Note that the bias term is actually the expected square of the bias at a particular \(\boldsymbol{x}\):
\[ \textrm{Bias term: }\quad \mathbb{E}_{\mathrm{new}}\left[(f^{\star}(\boldsymbol{x}) - {\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x})^2\right] \]
This contrasts somewhat with the common defintion of statistical bias, which would be
\[ \textrm{Statistical Bias at $\boldsymbol{x}$ }= \mathbb{E}_{\vphantom{}}\left[f^{\star}(\boldsymbol{x}) - {\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x}\right]. \]
The average statistical bias can cancel if it is high for some \(\boldsymbol{x}\) and low for others, but since we square the bias before averaging, there can be no such cancelation in the bias term.
Note that we can only decrease the bias term by adding more regressors. This can be seen by undoing the derivation above and re-writing \[ \mathbb{E}_{\mathrm{new}}\left[(f^{\star}(\boldsymbol{x}) - {\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x})^2\right] = \mathscr{R}(\boldsymbol{\beta}^{*}) - \mathbb{E}_{\mathrm{new}}\left[(y- f^{\star}(\boldsymbol{x}))^2\right]. \]
By making the set of regressors more expressive, we can only decrease the population risk \(\mathscr{R}(\boldsymbol{\beta}^{*})\); since we cannot change the irreducible risk by doing so, we must be decreasing the bias term.
Variance term
In an introductory linear models course, you might have seen an application of the CLT to \(\hat{\boldsymbol{\beta}}\):
\[ \sqrt{N} (\hat{\boldsymbol{\beta}}- \boldsymbol{\beta}^{*}) \rightarrow \mathcal{N}\left(\boldsymbol{0}, \sigma^2 M_{xx}^{-1}\right) \quad\textrm{(under correct specification and homoskedasticity)}\quad, \]
where \(\sigma^2\) is the variance of the residual. It turns out that a CLT does hold under misspecification, but with a different variance. We can write
\[ \begin{aligned} \hat{\boldsymbol{\beta}}- {\boldsymbol{\beta}^{*}} ={}& \left(\boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1} \boldsymbol{X}^\intercal\boldsymbol{Y}- \boldsymbol{\beta}^{*} \\={}& \left(\frac{1}{N} \sum_{n=1}^N\boldsymbol{x}_n \boldsymbol{x}_n^\intercal\right)^{-1} \frac{1}{N} \sum_{n=1}^N\boldsymbol{x}_n y_n - \boldsymbol{\beta}^{*} \\={}& \left(\frac{1}{N} \sum_{n=1}^N\boldsymbol{x}_n \boldsymbol{x}_n^\intercal\right)^{-1} \frac{1}{N} \sum_{n=1}^N\boldsymbol{x}_n (y_n - \boldsymbol{x}_n^\intercal\boldsymbol{\beta}^{*}) + \left(\frac{1}{N} \sum_{n=1}^N\boldsymbol{x}_n \boldsymbol{x}_n^\intercal\right)^{-1} \left(\frac{1}{N} \sum_{n=1}^N\boldsymbol{x}_n \boldsymbol{x}_n^\intercal\right) \boldsymbol{\beta}^{*} - \boldsymbol{\beta}^{*} \\={}& \left(\frac{1}{N} \sum_{n=1}^N\boldsymbol{x}_n \boldsymbol{x}_n^\intercal\right)^{-1} \frac{1}{N} \sum_{n=1}^N\boldsymbol{x}_n (y_n - \boldsymbol{x}_n^\intercal\boldsymbol{\beta}^{*}). \end{aligned} \]
By the LLN and CMT,
\[ \left(\frac{1}{N} \sum_{n=1}^N\boldsymbol{x}_n \boldsymbol{x}_n^\intercal\right)^{-1} \rightarrow M_{xx}^{-1}. \]
Noting that
\[ \mathbb{E}_{\vphantom{}}\left[ \boldsymbol{x}_n (y_n - \boldsymbol{x}_n^\intercal\boldsymbol{\beta}^{*})\right] = M_{xy} - M_{xx} M_{xx}^{-1}M_{xy} = 0, \]
we can apply the CLT to get
\[ \frac{1}{\sqrt{N}} \sum_{n=1}^N\boldsymbol{x}_n (y_n - \boldsymbol{x}_n^\intercal\boldsymbol{\beta}^{*}) \rightarrow \mathcal{N}\left(\boldsymbol{0}, \boldsymbol{V}\right) \quad\textrm{where}\quad \varepsilon_n := y_n - \boldsymbol{x}_n^\intercal\boldsymbol{\beta}^{*} \quad\textrm{and}\quad \boldsymbol{V}:= \mathrm{Cov}_{\,}\left(\boldsymbol{x}_n \varepsilon_n\right). \]
Note that, unlike in standard correctly specified regression,
\[ \mathbb{E}_{\vphantom{}}\left[\varepsilon_n \vert \boldsymbol{x}\right] = \mathbb{E}_{\vphantom{}}\left[y_n \vert \boldsymbol{x}\right] - \boldsymbol{x}^\intercal\boldsymbol{\beta}^{*}\ne 0! \]
In general, \(\varepsilon_n\) and \(\boldsymbol{x}\) have some joint distribution. Putting this together,
\[ \sqrt{N}\left(\hat{\boldsymbol{\beta}}- \boldsymbol{\beta}^{*}\right) \rightarrow \mathcal{N}\left(\boldsymbol{0}, M_{xx}^{-1} V M_{xx}^{-1}\right). \]
(As an exercise, show that you recover the correct specification covariance when \(\varepsilon_n\) and \(\boldsymbol{x}_n\) are independent.)
Note that the CLT — which is converngence in distribution — is insufficient to conclude that \(\mathrm{Var}_{\,}\left(\sqrt{N} \hat{\boldsymbol{\beta}}\right) \rightarrow M_{xx}^{-1} V M_{xx}^{-1}\). However, if \(\mathrm{Var}_{\,}\left(\sqrt{N} \hat{\boldsymbol{\beta}}\right)\) does converge to something, which is true under stronger assumptions that we do not discuss here, it must converge to \(M_{xx}^{-1} V M_{xx}^{-1}\).
Similarly, if \(\overline{\boldsymbol{\beta}}= \mathbb{E}_{\vphantom{}}\left[\hat{\boldsymbol{\beta}}\right]\) converges to something, then we must have \(\sqrt{N}\left(\overline{\boldsymbol{\beta}}- \boldsymbol{\beta}^{*}\right) \rightarrow 0\), since the expectation of a normal random variable is zero. It is in this sense that we are justified in using \(\boldsymbol{\beta}^{*}\) in place of \(\overline{\boldsymbol{\beta}}\) in what we’re calling a “variance” term.
Combining with the variance term, we can finally get that \[ \begin{aligned} N \mathbb{E}_{\mathrm{new},\mathcal{D}}\left[({\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x}- \hat{\boldsymbol{\beta}}^\intercal\boldsymbol{x})^2\right] \rightarrow {}& \mathrm{trace}\left( M_{xx} M_{xx}^{-1} V M_{xx}^{-1} \right) \\={}& \mathrm{trace}\left( M_{xx}^{-1/2} \mathrm{Cov}_{\,}\left(\boldsymbol{x}\varepsilon\right) M_{xx}^{-1/2} \right) \end{aligned} \]
Analyzing the variance term
Ideally, we would like to know how the variance term depends on \(P\). However, this behavior can be complicated in general. Intuition can be gained from two special cases:
- Where \(y_n\) is mean zero and independent of \(\boldsymbol{x}_n\), and
- Where \(y_n = \boldsymbol{x}_n + z_n + \varepsilon_n\), for some mean zero, finite–variance \(z_n\) and \(\varepsilon_n\) that are independent of \(\boldsymbol{x}_n\) and of each other.
In the former case, we will see that the variance can only increase as you add regressors. In the latter case, including \(z_n\) can either increase or decrease the variance, depending on its own variance.
No signal
First, suppose that \(\mathbb{E}_{\vphantom{}}\left[y\right] = 0\), \(\mathrm{Var}_{\,}\left(y\right) = \sigma^2\), and \(y\) is independent of \(\boldsymbol{x}\). This is an extreme case, where there is no signal whatsoever. In this case, we recover the classical result, since \(M_{xy} = \boldsymbol{0}\), \(\boldsymbol{\beta}^{*}= \boldsymbol{0}\), and
\[ \mathrm{Cov}_{\,}\left(\boldsymbol{x}\varepsilon_n\right) = M_{xx} \sigma^2 \quad\textrm{$\Rightarrow$}\quad N \cdot \textrm{Variance term } \rightarrow \mathrm{trace}\left( M_{xx}^{-1/2} M_{xx} M_{xx}^{-1/2} \sigma^2 \right) = \mathrm{trace}\left({\boldsymbol{I}_{P}}\right) \sigma^2 = P \sigma^2. \]
Effectively, the training variance in \(\boldsymbol{x}_n\) and the predictive variance in \(\boldsymbol{x}_\mathrm{new}\) cancel out, and we are left with
\[ \textrm{Variance term} \approx \frac{P \sigma^2}{N}. \]
We thus see that
- The variance term grows linearly in \(P\) for fixed \(N\),
- The variance term shrinks as \(1/N\) for fixed \(P\), and
- The variance term is larger as the variability of \(y_n\) is larger.
As an exercise you can show that this is what you get for correctly specified homoskedastic linear regression in which you might include some extra, unneeded regressors.
This example is somewhat unsatisfactory, since the bias is not decreasing as \(P\) grows — the bias is optimal with no regressors at all.
Omitted regressors
Now suppose that we have an omitted regressor, \(z_n\). Without loss of generality we can take \(y_n = {\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x}_n + z_n + \epsilon_n\), so the omitted coefficient is \(1\) (otherwise we could just rescale the standard deviation of \(z\)). Assume that
- \(\mathbb{E}_{\vphantom{}}\left[z\right] = 0\), \(\mathrm{Var}_{\,}\left(z\right) = \sigma_z^2\),
- \(\mathbb{E}_{\vphantom{}}\left[\epsilon\right] = 0\), \(\mathrm{Var}_{\,}\left(\epsilon\right) = \sigma^2\),
- \(\boldsymbol{x}\), \(z\), \(\epsilon\) are all independent.
Note that \(M_{xy} = M_{xx} \boldsymbol{\beta}^{*}\), so there is no inconsistency assuming that \(\boldsymbol{\beta}^{*}\) is the truth. We have
\(\varepsilon= y- {\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x}= z+ \epsilon\),
so
\[ \mathrm{Cov}_{\,}\left(\boldsymbol{x}\varepsilon\right) = M_{xx} (\sigma_z^2 + \sigma^2) \]
and
\[ N \cdot \textrm{Variance term without $z$: } \rightarrow \mathrm{trace}\left( M_{xx}^{-1/2} M_{xx} M_{xx}^{-1/2} (\sigma_z^2 + \sigma^2) \right) = P (\sigma_z^2 + \sigma^2). \]
However, by the same reasoning, if we had included \(z\) in the regression, we would have gotten
\[ N \cdot \textrm{Variance term without $z$: } \rightarrow \mathrm{trace}\left( M_{xx}^{-1/2} M_{xx} M_{xx}^{-1/2} (\sigma_z^2 + \sigma^2) \right) = (P + 1) \sigma^2. \]
Thus, by adding the regressor \(z\), the variance increases if and only if
\[ (P + 1) \sigma^2 \ge P (\sigma_z^2 + \sigma^2) \quad\textrm{$\Leftrightarrow$}\quad 1 + \frac{1}{P} \ge 1 + \frac{\sigma_z^2}{\sigma^2} \quad\textrm{$\Leftrightarrow$}\quad \sigma_z^2 \le \frac{\sigma^2}{P}. \]
Thus, if \(\sigma_z^2\) is large enough, then including it both decreases the bias (definitionally), and decreases the variance.