Homework 3

Stat 154/254: Statistical Machine Learning

1 Infinitely fine step function ULLN fail

In this homework problem, we carefully work through an example of poor generalization in a specific problem.

Let \(x_n \overset{\mathrm{IID}}{\sim}\textrm{Unif}(0,1)\), \(y\overset{\mathrm{IID}}{\sim}\mathcal{N}\left(0,1\right)\) independently of \(x\), and \((x_n, y_n)\) be IID draws from a joint distribution. For each \(K\), define the step functions

\[ z_{jK} = \mathrm{I}\left((j - 1)/K \le x\le j/K\right) \quad\textrm{for}\quad j=1,\ldots,K. \]

Define the function class

\[ \mathcal{F}= \left\{ f(x; \boldsymbol{\beta}) = \sum_{K=1}^\infty \sum_{j=1}^K \beta_{jK} z_{jK}(x) \textrm{ for some set of }\beta_{jK} \right\}. \]

Note that the vector \(\boldsymbol{\beta}\) may be infinitely long. Finally, define

\[ \hat{\mathscr{R}}(f) := \frac{1}{N} \sum_{n=1}^N(f(x_n) - y_n)^2 \quad\textrm{and}\quad \mathscr{R}(f) = \mathbb{E}_{\vphantom{}}\left[(f(x) - y)^2\right] \]

(a)

Show that \(\mathbb{E}_{\vphantom{}}\left[z_{jK}\right] = \mathbb{E}_{\vphantom{}}\left[z_{jK}^2\right] = 1/K\) and so

\[ \mathbb{E}_{\vphantom{}}\left[f(x; \boldsymbol{\beta})\right] = \sum_{K=1}^\infty \frac{1}{K} \sum_{j=1}^K \beta_{jK}. \]

(b)

Define \(\mathcal{F}_K := \left\{ f(x; \boldsymbol{\gamma}) = \sum_{j=1}^K \gamma_{j} z_{jK}(x) \textrm{ for some set of }\gamma_{j} \right\}\). Show that \(\mathcal{F}_K \subset \mathcal{F}\).

(c)

With probability one, all \(x_n\) are distinct in any dataset of size \(N\). Show that you can take \(K\) large enough that there exists \(f\in \mathcal{F}_K\) such that

\[ \begin{aligned} \hat{\mathscr{R}}(\hat{\boldsymbol{\gamma}}) ={}& \frac{1}{N} \sum_{n=1}^N(f(x_n, \hat{\boldsymbol{\gamma}}) - y_n)^2 ={} 0, \end{aligned} \]

where \(\hat{\boldsymbol{\gamma}}\) is the OLS estimator of \(\boldsymbol{\gamma}\).

(d)

Show that \(\mathscr{R}(f) \ge 1\) for any \(f\).

(e)

Combine the above results to show that \(\mathcal{F}\) does not satisfy a ULLN in the sense that

\[ \sup_{f\in \mathcal{F}} \left|\hat{\mathscr{R}}(f) - \mathscr{R}(f)\right| \ge 1. \]

Solutions

(a)

Since \(x\) is uniform, \(\mathbb{E}_{\vphantom{}}\left[z_{jK}\right] = \mathbb{E}_{\vphantom{}}\left[\mathrm{I}\left((j - 1)/K \le x \le j / K\right)\right] = \frac{j}{K} - \frac{j - 1}{K} = 1/K\). And the indicator squared is the same as the indicator. Then

\[ \mathbb{E}_{\vphantom{}}\left[f(x; \boldsymbol{\beta})\right] = \sum_{K=1}^\infty \sum_{j=1}^K \beta_{jK} \mathbb{E}_{\vphantom{}}\left[z_{jk}\right] = \sum_{K=1}^\infty \frac{1}{K} \sum_{j=1}^K \beta_{jK}. \]

(b)

Any element of \(\mathcal{F}_K\) can be formed by setting \(\beta_{jK'} = 0\) for \(K' \ne K\).

(c)

Let \(\Delta = \min_{n,m} |x_n - x_m|\). Since the \(x_n\) are distinct, \(\Delta > 0\). Take \(K > 1 / \Delta\); then \(x_n\) are all in distinct bins of width \(1/K\), and so \(z_{jK}(x_n)\) are either \(0\) for all \(n\) (if no \(x_n\) falls in its interval) or \(1\) for exactly one \(n\) (if a single \(x_n\) falls in its interval).

Writing \(\boldsymbol{Z}_j = (z_{jK}(x_1), \ldots, z_{jK}(x_N))^\intercal\) as the vector of values taken by \(z_{jK}\), we thus have \(\boldsymbol{Z}_j\intercal\boldsymbol{Z}_\ell = 0\) for \(j \ne \ell\), and there are exactly \(N\) distinct \(\boldsymbol{Z}_j\) with \(\boldsymbol{Z}_j^\intercal\boldsymbol{Z}_j = 1\). This means the \(N \times K\) matrix \((\boldsymbol{Z}_1 \ldots \boldsymbol{Z}_{K})\) is rank \(N\), and so the projection of \(\boldsymbol{Y}\) onto the column span of \(\boldsymbol{Z}\) is just \(\boldsymbol{Y}\).

(d)

Since \(y\) and \(x\) are in fact independent, \[ \begin{aligned} \mathscr{R}(f) ={}& \mathbb{E}_{\vphantom{}}\left[(y- f(x))^2\right] \\={}& \mathbb{E}_{\vphantom{}}\left[y^2\right] - 2 \mathbb{E}_{\vphantom{}}\left[y\right] \mathbb{E}_{\vphantom{}}\left[f(x)\right] + \mathbb{E}_{\vphantom{}}\left[f(x)^2\right] \\={}& \mathbb{E}_{\vphantom{}}\left[y^2\right] + \mathbb{E}_{\vphantom{}}\left[f(x)^2\right] \\\ge{}& \mathbb{E}_{\vphantom{}}\left[y^2\right] \\={}& 1. \end{aligned} \]

(e)

\[ \begin{aligned} \sup_{f\in \mathcal{F}} \left|\hat{\mathscr{R}}(f) - \mathscr{R}(f)\right| \ge{}& \left|\hat{\mathscr{R}}(\hat{f}) - \mathscr{R}(\hat{f})\right| & \textrm{(by (b))} \\={}& \left|\mathscr{R}(\hat{f})\right| & \textrm{(by (c))} \\\ge{}& 1 & \textrm{(by (d))}. \end{aligned} \]

2 Low estimation error with poor generalization error

In this problem, we emphasize how low generalization error is sufficient but not necessary for low estimation error.

Recall our decomposition of estimation error,

\[ \underbrace{\mathscr{R}(\hat{f}) - \mathscr{R}(\overline{f})}_{\textrm{Estimation error}} ={} \underbrace{\mathscr{R}(\hat{f}) - \hat{\mathscr{R}}(\hat{f})}_{\textrm{"Generalization error"}} + \underbrace{\hat{\mathscr{R}}(\hat{f}) - \hat{\mathscr{R}}(\overline{f})}_{\textrm{$\le 0$}} + \underbrace{\hat{\mathscr{R}}(\overline{f}) - \mathscr{R}(\overline{f})}_{\textrm{LLN term}}. \]

Let us define a strange–looking estimator that has poor generalization error, but still provides low estimation error. For a fixed \(\delta > 0\), define

\[ \hat{f}(\boldsymbol{x}; \delta) = \overline{f}(\boldsymbol{x}) + \sum_{n=1}^N\mathrm{I}\left(\left\Vert\boldsymbol{x}- \boldsymbol{x}_n\right\Vert \le \delta\right) (y_n - \overline{f}(\boldsymbol{x})). \]

Assume that \(\hat{\mathscr{R}}(f) = 0\) if \(f(\boldsymbol{x}_n) = y_n\) for every \(n\) in the training set. That is, the empirical risk is zero if \(f\) is able to interpolate the training data.

Assume that \(\boldsymbol{x}\) has a continuous distribution, so that \(p(\left\Vert\boldsymbol{x}- \boldsymbol{x}_0\right\Vert \le C) \rightarrow 0\) as \(C \rightarrow 0\) for any \(\boldsymbol{x}_0\).

(a) Show that there exists a \(\delta > 0\) sufficiently small so that \(\hat{\mathscr{R}}(\hat{f}(\cdot; \delta)) = 0\).

(b) Show that \(\mathbb{E}_{\boldsymbol{x}}\left[\mathrm{I}\left(\left\Vert\boldsymbol{x}- \boldsymbol{x}_n\right\Vert \le \delta\right)\right] \rightarrow 0\) as \(\delta \rightarrow 0\) for every \(n\).

(c) Show that \(\mathscr{R}(\hat{f}(\boldsymbol{x}; \delta)) \rightarrow \mathscr{R}(\overline{f})\) as \(\delta \rightarrow 0\).

(d) As \(\delta \rightarrow 0\), what happens to the generlization error, \(\mathscr{R}(\hat{f}) - \hat{\mathscr{R}}(\hat{f})\)?

(e) As \(\delta \rightarrow 0\), what happens to the “negative term,” \(\hat{\mathscr{R}}(\hat{f}) - \hat{\mathscr{R}}(\overline{f})\)?

(f) As \(\delta \rightarrow 0\), what happens to the estimation error, \(\mathscr{R}(\hat{f}) - \mathscr{R}(\overline{f})\)?

(g) Describe in your own words how it is that \(\hat{f}(\cdot; \delta)\) fails to have low generalization error, and yet has low estimation error.

(h) Describe a sense in which \(\hat{f}(\cdot; \delta)\) “overfits” the training data for small \(\delta\), and a sense in which it does not.

Solutions

(a) Since the \(x_n\) are all distinct, we can set \(\delta\) small enough that each \(\left\Vert x_n - x_m\right\Vert < \delta\) for each \(n \ne m\). Then

\[ \begin{aligned} \hat{\mathscr{R}}(\hat{f}(\cdot; \delta)) ={}& \frac{1}{N} \sum_{n=1}^N\left(y_n - \left(\overline{f}(\boldsymbol{x}_n) + \sum_{m=1}^N \mathrm{I}\left(\left\Vert\boldsymbol{x}_n - \boldsymbol{x}_m\right\Vert \le \delta\right) (y_m - \overline{f}(\boldsymbol{x}_m))\right)^2\right) \\={}& \frac{1}{N} \sum_{n=1}^N\left(y_n - \left(\overline{f}(\boldsymbol{x}_n) + y_n - \overline{f}(\boldsymbol{x}_n)\right)^2\right) \\={}& \frac{1}{N} \sum_{n=1}^N\left(0\right)^2 \\={}& 0. \end{aligned} \]

(b)

\[ \begin{aligned} \mathbb{E}_{\boldsymbol{x}}\left[\mathrm{I}\left(\left\Vert\boldsymbol{x}- \boldsymbol{x}_n\right\Vert \le \delta\right)\right] ={}& \mathbb{P}_{\,}\left(\left\Vert\boldsymbol{x}- \boldsymbol{x}_n\right\Vert \le \delta\right) \rightarrow 0 \end{aligned} \] since we assumed that \(x\) has a continuous distribution (replace \(\delta\) with \(C\) and \(x_n\) with \(x_0\).)

(c)

Write

\[ \begin{aligned} \left(y- \hat{f}(\boldsymbol{x}; \delta)\right)^2 ={}& \left(y- \left( \overline{f}(\boldsymbol{x}) + \sum_{n=1}^N \mathrm{I}\left(\left\Vert\boldsymbol{x}- \boldsymbol{x}_n\right\Vert \le \delta\right) (y_n - \overline{f}(\boldsymbol{x}_n)) \right)\right)^2 \\={}& \left(y- \overline{f}(\boldsymbol{x}) - \sum_{n=1}^N \mathrm{I}\left(\left\Vert\boldsymbol{x}- \boldsymbol{x}_n\right\Vert \le \delta\right) (y_n - \overline{f}(\boldsymbol{x}_n)) \right)^2 \\={}& \left(y- \overline{f}(\boldsymbol{x})\right)^2 + \left( \sum_{n=1}^N \mathrm{I}\left(\left\Vert\boldsymbol{x}- \boldsymbol{x}_n\right\Vert \le \delta\right) (y_n - \overline{f}(\boldsymbol{x}_n)) \right)^2 + \\{}& 2 \left(y- \overline{f}(\boldsymbol{x})\right) \left( \sum_{n=1}^N \mathrm{I}\left(\left\Vert\boldsymbol{x}- \boldsymbol{x}_n\right\Vert \le \delta\right) (y_n - \overline{f}(\boldsymbol{x}_n)) \right) \end{aligned} \]

Using (b), as \(\delta \rightarrow 0\),

\[ \begin{aligned} \mathbb{E}_{\boldsymbol{x}}\left[\mathrm{I}\left(\left\Vert\boldsymbol{x}- \boldsymbol{x}_n\right\Vert \le \delta\right)\right] \rightarrow 0 \quad\textrm{and}\quad \mathbb{E}_{\boldsymbol{x}}\left[\mathrm{I}\left(\left\Vert\boldsymbol{x}- \boldsymbol{x}_n\right\Vert \le \delta\right)\mathrm{I}\left(\left\Vert\boldsymbol{x}- \boldsymbol{x}_m\right\Vert \le \delta\right)\right] \rightarrow 0. \end{aligned} \] So the expected values of the sums go to zero as \(\delta \rightarrow 0\), and

\[ \mathscr{R}(\hat{f}(\cdot; \delta)) = \mathbb{E}_{y,\boldsymbol{x}}\left[\left(y- \hat{f}(\boldsymbol{x}; \delta)\right)^2\right] \rightarrow \mathbb{E}_{y,\boldsymbol{x}}\left[\left(y- \overline{f}(\boldsymbol{x})\right)^2\right] = \mathscr{R}(\overline{f}(\cdot)) \quad\textrm{ as }\delta \rightarrow 0. \]

(d) By (a) and (c), \(\mathscr{R}(\hat{f}) - \hat{\mathscr{R}}(\hat{f}) \rightarrow \mathscr{R}(\overline{f})\). In other words, the generalization error does not vanish, since the estimator overfits the data for small \(\delta\).

(e) By (a), \(\hat{\mathscr{R}}(\hat{f}) - \hat{\mathscr{R}}(\overline{f}) \rightarrow - \hat{\mathscr{R}}(\overline{f})\). In other words, the “negative term” is exactly the negative of the generalization error.

(f) By (c), \(\mathscr{R}(\hat{f}) - \mathscr{R}(\overline{f}) \rightarrow \mathscr{R}(\overline{f}) - \mathscr{R}(\overline{f}) = 0\). In other words, \(\hat{f}\) achieves optimal risk as \(\delta \rightarrow 0\).

(g) The function \(\hat{f}(\cdot; \delta)\) is very “spiky” for small \(\delta\) — it overfits the data, but only precisely at the observed datapoints.

(h) It overfits the data in the sense that it does not have vanishing generalization error, but does not overfit the data in the sense that it achieves nearly optimal risk for very small \(\delta\).

3 Leave–one–out with an estimator that depends strongly on N

This example is designed to strengthen the intuition about why leave–K–out estimators are biased.

Consider \(y_n \overset{\mathrm{IID}}{\sim}\mathcal{N}\left(5, 1\right)\) and squared loss: \(\mathscr{L}(\hat{y}, y) = (\hat{y}- y)^2\), and \(\mathscr{R}(f) = \mathbb{E}_{\vphantom{}}\left[(f- y)^2\right]\). We will have no regressors, so our estimating functions are just constants.

We will take \(N = 100\) for this problem, and let \(\boldsymbol{Y}_{-n}\) denote the leave–one–out vector, \(\boldsymbol{Y}_{-n} = (y_1, \ldots, y_{n-1}, y_{n+1}, \ldots, y_N)\).

An estimation procedure \(F\) takes in a dataset \(\boldsymbol{Y}= (y_1, \ldots, y_N)\) and returns a constant estimate \(\hat{y}= F(\boldsymbol{Y}) \in \mathbb{R}^{}\). For this, we consider the (silly) estimator

\[ \hat{y}= F(\boldsymbol{Y}) = \begin{cases} 5 & \textrm{if the length of }\boldsymbol{Y}\ge 100 \\ 0 & \textrm{if the length of }\boldsymbol{Y}< 100. \end{cases} \]

(a)

What is the optimal estimate, \(f^{\star}= \underset{f}{\mathrm{argmin}}\, \mathscr{R}(f)\)?

(b)

What is \(\mathscr{R}(F(\boldsymbol{Y}))\)? What is \(\mathscr{R}(F(\boldsymbol{Y}_{-n}))\)?

(c)

Recall the leave–one–out cross–validation (CV) estimator

\[ \hat{\mathscr{R}}_{CV}(\boldsymbol{Y}) := \frac{1}{N} \sum_{n=1}^N\mathscr{R}(F(\boldsymbol{Y}_{-n}, y_n)). \]

What is \(\mathbb{E}_{\boldsymbol{Y}}\left[\hat{\mathscr{R}}_{CV}(\boldsymbol{Y})\right]\)? Is the leave–one–out CV estimator an unbiased estimate of \(\mathscr{R}(F(\boldsymbol{Y}))\)?

(d)

Connect this toy example with the intuition that leave–\(K\)–out CV tends to be more biased for larger \(K\).

Solutions

(a)

\(f^{\star}= \mathbb{E}_{\vphantom{}}\left[y\right] = 5\).

(b)

Here, \(F(\boldsymbol{Y}) = 5\), so \(\mathscr{R}(F(\boldsymbol{Y})) = \mathbb{E}_{\vphantom{}}\left[(y- 5)^2\right] = 1\).

In contrast, \(F(\boldsymbol{Y}_{-n}) = 0\), so \(\mathscr{R}(F(\boldsymbol{Y}_{-n})) = \mathbb{E}_{\vphantom{}}\left[y^2\right] = 1 + 5^2 = 26\).

(c)

Using (b),

\[ \hat{\mathscr{R}}_{CV}(\boldsymbol{Y}) := \frac{1}{N} \sum_{n=1}^N\mathscr{R}(F(\boldsymbol{Y}_{-n}, y_n) = \frac{1}{N} \sum_{n=1}^N(0 - y_n)^2 = \frac{1}{N} \sum_{n=1}^Ny_n^2, \]

and

\(\mathbb{E}_{\boldsymbol{Y}}\left[\hat{\mathscr{R}}_{CV}(\boldsymbol{Y})\right] = \frac{1}{N} \sum_{n=1}^N\mathbb{E}_{\vphantom{}}\left[y_n^2\right] = 26\). So CV is not an unbiased estimate of \(\mathscr{R}(F(\boldsymbol{Y})) = 1\).

(d)

CV estimates the loss of an estimation procedures that uses fewer datapoints than the original estimator. If the estimator depends strongly on the number of datapoints you use, as it does artificially in this case, then leaving data out can severely bias your CV estimate of the loss.

4 Leave–one–out CV for linear regression

This homework problem derives a closed-form expression for the effect of leave–one–out CV for linear regression.

We will use the following result, known as the Woodbury formula (but also many other names, including the Sherman-Morrison-Woodbury formula). Let \(A\) denote an invertible matrix, and \(\boldsymbol{u}\) and \(\boldsymbol{v}\) vectors the same length as \(A\). Then

\[ (A + \boldsymbol{u}\boldsymbol{v}^\intercal)^{-1} ={} A^{-1} - \frac{A^{-1} \boldsymbol{u}\boldsymbol{v}^\intercal A^{-1}}{1 + \boldsymbol{v}^\intercal A^{-1} \boldsymbol{u}}. \]

We will also use the definition of a “leverage score” from lecture \(h_n := \boldsymbol{x}_n^\intercal(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n\). Note that \(h_n = (\boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal)_{nn}\) is the \(n\)–th diagonal entry of the projection matrix \(\underset{\boldsymbol{X}}{\boldsymbol{P}}\).

Let \(\hat{\boldsymbol{\beta}}_{-n}\) denote the estimate of \(\hat{\beta}\) with the datapoint \(n\) left out. Similarly, let \(\boldsymbol{X}_{-n}\) denote the \(\boldsymbol{X}\) matrix with row \(n\) left out, and \(\boldsymbol{Y}_{-n}\) denote the \(\boldsymbol{Y}\) matrix with row \(n\) left out.

(a)

Prove that

\[ \hat{\boldsymbol{\beta}}_{-n} = (\boldsymbol{X}_{-n}^\intercal\boldsymbol{X}_{-n})^{-1} \boldsymbol{X}_{-n}^\intercal\boldsymbol{Y}_{-n} = (\boldsymbol{X}^\intercal\boldsymbol{X}- \boldsymbol{x}_n \boldsymbol{x}_n^\intercal)^{-1} (\boldsymbol{X}^\intercal\boldsymbol{Y}- \boldsymbol{x}_n y_n) \]

(b)

Using the Woodbury formula, derive the following expression: \[ (\boldsymbol{X}^\intercal\boldsymbol{X}- \boldsymbol{x}_n \boldsymbol{x}_n^\intercal)^{-1} = (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} + \frac{(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n \boldsymbol{x}_n^\intercal(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} }{1 - h_n} \]

(c)

Combine (a) and (b) to derive the following explicit expression for \(\hat{\boldsymbol{\beta}}_{-n}\):

\[ \hat{\boldsymbol{\beta}}_{-n} = \hat{\boldsymbol{\beta}}- (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n \frac{1}{1 - h_n} \hat{\varepsilon}_n. \]

(d)

Using (c), derive the following explicit expression the leave-one-out error on the \(n\)–th observation:

\[ y_n - \boldsymbol{x}_n^\intercal\hat{\boldsymbol{\beta}}_{-n} = \frac{\hat{\varepsilon}_n}{1 - h_n}. \]

(e)

Using (d), prove that the leave–one–out CV estimator of \(\mathscr{R}(\hat{\boldsymbol{\beta}})\) is given by

\[ \frac{1}{N} \sum_{n=1}^N(y_n - \boldsymbol{x}_n^\intercal\hat{\boldsymbol{\beta}}_{-n})^2 = \frac{1}{N} \sum_{n=1}^N\frac{\hat{\varepsilon}_n^2}{(1 - h_n)^2}, \]

where \(\hat{\varepsilon}_n = y_n - \hat{y}_n\) is the residual from the full regression without leaving any data out. Using this formula, leave–one–out CV can be computed using only the original regression and \((\boldsymbol{X}^\intercal\boldsymbol{X})^{-1}\) rather than doing \(N\) separate matrix inversions.

Solutions

(a)

This follows from \(\boldsymbol{X}^\intercal\boldsymbol{X}= \sum_{n=1}^N\boldsymbol{x}_n \boldsymbol{x}_n^\intercal\) and \(\boldsymbol{X}^\intercal\boldsymbol{Y}= \sum_{n=1}^N\boldsymbol{x}_n y_n\).

(b)

Direct application of the formula with \(\boldsymbol{u}= \boldsymbol{x}_n\) and \(\boldsymbol{v}= -\boldsymbol{x}_n\) gives \[ (\boldsymbol{X}^\intercal\boldsymbol{X}- \boldsymbol{x}_n \boldsymbol{x}_n^\intercal)^{-1} = (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} + \frac{(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n \boldsymbol{x}_n^\intercal(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} }{1 - \boldsymbol{x}_n^\intercal(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1}\boldsymbol{x}_n}. \]

Then recognize the leverage score.

(c)

We have

\[ (\boldsymbol{X}^\intercal\boldsymbol{X}- \boldsymbol{x}_n \boldsymbol{x}_n^\intercal)^{-1} \boldsymbol{X}^\intercal\boldsymbol{Y}= \hat{\beta}+ \frac{(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n \boldsymbol{x}_n^\intercal\hat{\beta}}{1 - h_n}. \]

and

\[ (\boldsymbol{X}^\intercal\boldsymbol{X}- \boldsymbol{x}_n \boldsymbol{x}_n^\intercal)^{-1} \boldsymbol{x}_n y_n = (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n y_n + \frac{(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n h_n }{1 - h_n} y_n. \]

Combining,

\[ \begin{aligned} \hat{\boldsymbol{\beta}}_{-n} ={}& \hat{\beta}+ (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n \left( \frac{\boldsymbol{x}_n^\intercal\hat{\beta}}{1 - h_n} - y_n - \frac{h_n }{1 - h_n} y_n \right) \\={}& \hat{\beta}+ (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n \left( \frac{1 }{1 - h_n} \hat{y}_n - \left(1 + \frac{h_n }{1 - h_n} \right)y_n \right) \\={}& \hat{\beta}+ (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n \left( \frac{1 }{1 - h_n} \hat{y}_n - \frac{1 }{1 - h_n} y_n \right) \\={}& \hat{\beta}- (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n \frac{1 }{1 - h_n} \hat{\varepsilon}_n \end{aligned} \]

(d)

\[ \begin{aligned} y_n - \boldsymbol{x}_n^\intercal\hat{\boldsymbol{\beta}}_{-n} ={}& y_n - \boldsymbol{x}_n^\intercal\hat{\beta}+ \boldsymbol{x}_n^\intercal(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n \frac{1 }{1 - h_n} \hat{\varepsilon}_n \\={}& y_n - \hat{y}_n + \frac{h_n }{1 - h_n} \hat{\varepsilon}_n \\={}& \left( 1 + \frac{h_n }{1 - h_n} \right)\hat{\varepsilon}_n \\={}& \frac{\hat{\varepsilon}_n }{1 - h_n} \end{aligned} \]

(e)

Just plug it in.

5 VC dimension of linear classifiers

Consider the class of \(P\)–dimensional linear binary classifiers, \(\mathcal{F}= \{ \mathrm{I}\left(\boldsymbol{x}^\intercal\boldsymbol{\beta}\ge 0\right) \textrm{ for any }\boldsymbol{\beta}\in \mathbb{R}^{P}\}\). In this problem, we will show that the VC dimension of \(\mathcal{F}\) is \(P\). In doing so we then prove a generalization bound for logistic regression with zero–one loss.

For \(N\) points, let \(\boldsymbol{X}\) denote the \(N \times P\) matrix with \(\boldsymbol{x}_n^\intercal\) in row \(n\). In this problem, the \(\boldsymbol{x}_n\) are not given to us — we will be choosing them to find sets that can or cannot be shattered.

First, we show that we can shatter any set of size \(P\) in (a)–(c), showing that the VC dimension is greater than or equal to \(P\). Then, in (d)–(f) we show that we cannot shatter a set of size \(P + 1\), and so the VC dimension is less than or equal to \(P+1\). It follows that the VC dimension is \(P\).

(a)

Suppose we are given \(N\) points, \(\boldsymbol{x}_1, \ldots, \boldsymbol{x}_N\). For any \(y_1, \ldots, y_N\), define \(z_n = 2 y_n - 1\). Show that \(y_n\) is correctly classified if we can find \(\hat{\boldsymbol{\beta}}\) such that \(\hat{\boldsymbol{\beta}}^\intercal\boldsymbol{x}= z_n\).

(b)

When \(N=P\), identify an \(\boldsymbol{X}\) so that, for any \(\boldsymbol{Z}\in \mathbb{R}^{P}\), you can find a \(\hat{\boldsymbol{\beta}}\) such that \(\boldsymbol{X}\hat{\boldsymbol{\beta}}= \boldsymbol{Z}\). Hint: There are actually an infinite number of \(\boldsymbol{X}\) that will work, but also some that won’t.

(c)

Use (a) and the \(\boldsymbol{X}\) you identified in (b) to conclude that there exists a set of size \(P\) points, \(\boldsymbol{x}_1, \ldots, \boldsymbol{x}_P\), that you can shatter with classifiers in \(\mathcal{F}\).

(d)

Now, suppose that \(N = P + 1\), and you are given \(N\) points \(\boldsymbol{x}_1, \ldots, \boldsymbol{x}_N\). Let the rank of \(\boldsymbol{X}\) be \(Q\), and note that \(Q \le P\). Assume without loss of generality that the first \(Q\) rows of \(\boldsymbol{X}\) are linearly independent (if they’re not, just re-order the rows of \(\boldsymbol{X}\)). Show that you can find \(\boldsymbol{\gamma}\) such that \(\boldsymbol{x}_{Q+1} = \sum_{n=1}^Q \gamma_n \boldsymbol{x}_{n}\). That is, show that the \(Q + 1\)–th row is a linear combination of the first \(Q\) rows.

(e)

Continuing (d), suppose we set \(y_n = \mathrm{I}\left(\gamma_n > 0\right)\) for \(n=1,\ldots,Q\), and suppose that \(\hat{\boldsymbol{\beta}}\) correctly classifies these first \(Q\) points. Using (d), show that \(\hat{\boldsymbol{\beta}}^\intercal\boldsymbol{x}_{Q+1} \ge 0\).

(f)

Using (e), define a set of \(P+1\) points that cannot be shattered.

(g)

Putting your results together, what is the VC dimension of \(\mathcal{F}\)?

Solutions

(a)

As defined, \(z_n = 1\) when \(y_n = 1\) and \(z_n = -1\) when \(y_n = 0\). So if \(\hat{\boldsymbol{\beta}}^\intercal\boldsymbol{x}= z_n\), then \(\hat{\boldsymbol{\beta}}^\intercal\boldsymbol{x}\) has the correct sign to classify \(y_n\). Of course this is a sufficient, not a necessary condition — it may be that \(y_n\) is correctly classified but \(\hat{\boldsymbol{\beta}}^\intercal\boldsymbol{x}\ne z_n\).

(b)

Any full–rank \(\boldsymbol{X}\) will do, since we can then take \(\hat{\boldsymbol{\beta}}= \boldsymbol{X}^{-1} \boldsymbol{Z}\). This is of course impossible if \(N > P\).

(c)

Take a full–rank \(\boldsymbol{X}\), take any \(\boldsymbol{Y}\in \{0,1\}^P\), take \(\boldsymbol{Z}= 2 \boldsymbol{Y}- 1\), and \(\hat{\boldsymbol{\beta}}= \boldsymbol{X}^{-1}\boldsymbol{Z}\). Then \(\hat{\boldsymbol{\beta}}\) correctly classifies \(\boldsymbol{Y}\). Since this is possible for any of the \(2^P\) possible values of \(\boldsymbol{Y}\), the set given by \(\boldsymbol{X}\) can be shattered.

(d)

Since \(\boldsymbol{X}'\) is rank \(P\), any vector — including \(\boldsymbol{x}_{P+1}\) — can be written as a linear combiantion of the rows of \(\boldsymbol{X}'\).

Note: The problem is a little ambiguous as originally stated. The whole point of this proof is to show that no set of \(P + 1\) points can be shattered, and the only thing that matters is that one regressor is a linear combination of the others. However, that vector need not necessarily be the last row. Give full credit for any variants that conclude that at least one regressor needs to be a linear combination of the others.

(e)

Plugging in,

\[ \hat{\boldsymbol{\beta}}^\intercal\boldsymbol{x}_{P+1} = \sum_{n=1}^P \sum_{n=1}^P \gamma_n \hat{\boldsymbol{\beta}}^\intercal\boldsymbol{x}_{n}. \]

Since \(\hat{\boldsymbol{\beta}}^\intercal\boldsymbol{x}_{n} > 0\) if and only if \(\gamma_n > 0\), each term in the sum is positive.

(f)

Take \(y_n = \mathrm{I}\left(\gamma_n > 0\right)\) for \(n =1,\ldots,P\), and \(y_{P+1} = -1\). By (e) this pattern cannot be produced by the linear classifier, and so no set of size \(P+1\) can be shattered.

Note: Again regarding the ambiguity of (d), the only part of this that matters is that one of the regressors is a linear combination of the others, i.e., that \(\boldsymbol{X}\) is rank deficient.

(g)

What can you conclude about the generalization error of linear classifiers with zero–one loss? Specifically, what happens to \[ \sup_{f \in \mathcal{F}} \left|\frac{1}{N} \sum_{n=1}^N\mathrm{I}\left(f(\boldsymbol{x}_n) \ne y_n\right) - \mathbb{P}_{\,}\left(f(\boldsymbol{x}) \ne y\right)\right| \]

as \(N \rightarrow \infty\)?

Note: Give full credit for noticing that the VC dimension is finite, and that a ULLN applies. Formally, the lecture notes show only that \[ \sup_{f \in \mathcal{F}} \left|\frac{1}{N} \sum_{n=1}^N\mathrm{I}\left(f(\boldsymbol{x}_n) > 0\right) - \mathbb{P}_{\,}\left(f(\boldsymbol{x}_n) > 0\right)\right|. \]

(See Ed discussion on this point.) There’s no need for students to prove the additional step that a ULLN holds for the zero–one loss. (Though it does.)

(h)

This loss goes to zero (modulo the note for (g)).

6 VC dimension of zero–one loss

Suppose we have a function class \(\mathcal{F}\) of classifiers that has VC dimension \(P < \infty\). In this problem, we extend the ULLN results from class to zero–one loss. Let \(\hat{f}\in \mathcal{F}\) be a classifier estimated from the observed data, \((\boldsymbol{x}_1, y_1), \ldots, (\boldsymbol{x}_N, y_N)\).

(a)

Given that the VC dimension is finite, what can you conclude about \(\left|\frac{1}{N} \sum_{n=1}^N\hat{f}(\boldsymbol{x}_n) - \mathbb{E}_{\vphantom{}}\left[\hat{f}(\boldsymbol{x})\right]\right|\) as \(N \rightarrow \infty\)?

Explain in intuitive terms what this expression means. (Note that \(y_n\) occurs in this expression only through \(\hat{f}\), so this expression is not estimating the classification error.)

(b)

Note that \(\mathcal{F}\) is a class of functions on \(\boldsymbol{x}\). Define a new function class \(\mathcal{F}'\) consisting of functions defined on \((\boldsymbol{x}, y)\) as follows:

\[ \mathcal{F}' = \{ f'(\boldsymbol{x}, y) = \mathrm{I}\left( f(\boldsymbol{x}) \ne y\right) \textrm{ for } f\in \mathcal{F}\}. \]

Show carefully that if \(\mathcal{F}\) is able to shatter a set of size \(P\) of values \(\{ \boldsymbol{x}_1, \ldots, \boldsymbol{x}_P \}\) then \(\mathcal{F}'\) is able to shatter a set of size \(P\) of values \(\{ (\boldsymbol{x}_1, y_1), \ldots, (\boldsymbol{x}_P, y_P) \}\).

(c)

In the setting of (b), show carefully that if \(\mathcal{F}'\) is able to shatter a set of size \(P\) of values \(\{ (\boldsymbol{x}_1, y_1), \ldots, (\boldsymbol{x}_P, y_P) \}\), then \(\mathcal{F}\) is able to shatter a set of size \(P\) of values \(\{ \boldsymbol{x}_1, \ldots, \boldsymbol{x}_P \}\).

(d)

Using (b) and (c), what can you conclude about the VC dimension of \(\mathcal{F}'\)?

(e)

What can you conclude about the generalization error of linear classifiers with zero–one loss? Specifically, what happens to \[ \sup_{f \in \mathcal{F}} \left|\frac{1}{N} \sum_{n=1}^N\mathrm{I}\left(f(\boldsymbol{x}_n) \ne y_n\right) - \mathbb{P}_{\,}\left(f(\boldsymbol{x}) \ne y\right)\right| \]

as \(N \rightarrow \infty\)?

(f)

Let \(\hat{\boldsymbol{\beta}}\) denote the estimate from logistic regression (which was trained with logistic loss, not the zero–one loss), and consider the binary classifier \(f(\boldsymbol{x}; \boldsymbol{\beta}) = \mathrm{I}\left(\boldsymbol{x}^\intercal\boldsymbol{\beta}\ge 0\right)\). What can you conclude about

\[ \left|\frac{1}{N} \sum_{n=1}^N\mathrm{I}\left(f(\boldsymbol{x}; \hat{\boldsymbol{\beta}}) \ne y_n\right) - \mathbb{P}_{\,}\left(f(\boldsymbol{x}; \hat{\boldsymbol{\beta}}) \ne y\right)\right| \]

as \(N \rightarrow \infty\)?

Solutions

(a)

This means that

\[ \left|\frac{1}{N} \sum_{n=1}^N\hat{f}(\boldsymbol{x}_n) - \mathbb{E}_{\vphantom{}}\left[\hat{f}(\boldsymbol{x})\right]\right| \le \sup_{f\in \mathcal{F}} \left|\frac{1}{N} \sum_{n=1}^Nf(\boldsymbol{x}) - \mathbb{E}_{\vphantom{}}\left[f(\boldsymbol{x})\right]\right| \rightarrow 0, \] where the supremum going to zero follows from the ULLN for finite VC dimension classes as stated in lecture.

This means that the sample average prediction converges to the expected prediction. But it says nothing about the errors, since the \(y_n\) do not occur in the expression.

(b)

We need to show that we can shatter some set of \(P\) points \((x_n, y_n)\). That is, we need to show that there exists a set of points such that we can find a function \(f' \in \mathcal{F}'\) that produces a given length–\(P\) binary vector \(v' \in \{0, 1\}^P\), no matter what \(v'\) is.

Take \(x_n\) to be a set of \(P\) points that can be shattered by \(\mathcal{F}\). This exists because \(\mathcal{F}\) has VC dimension \(P\). Take the points \(y_n = 0\) for all \(n\). Then

\[ f'(x_n, y_n) = f'(x_n, 0) = \mathrm{I}\left(f(x_n) \ne 0\right) = \mathrm{I}\left(f(x_n) = 1\right). \]

That is \(f'(x_n, y_n) = f(x_n)\). Since there exists an \(f\) that can produce any vector \(v\), the corresponding \(f'\)-s also take the value \(v\).

(c)

Suppose there is some set of \((x_n, y_n)\) that you can shatter. Then

\[ f'(x_n, y_n) = \mathrm{I}\left(f(x_n) \ne y_n\right). \]

We can shatter the set \(x_n\). Suppose we need to produce \(f\) such that \((f(x_1) , \ldots, f(x_P)) = v\). Set \(v'_n = v_n\) if \(y_n = 0\) or \(v_n = 1 - v_n\) if \(y_n = 1\). Then there is an \(f'\) that gives \(v'\), and the corresponding \(f\) gives \(v\).

(d)

We conclude that \(\mathcal{F}'\) has VC dimension \(P\).

(e)

Since we showed that linear classifiers have finite VC dimension, the corresponding zero–one loss has finite VC dimension by (d), and so \[ \sup_{f \in \mathcal{F}} \left|\frac{1}{N} \sum_{n=1}^N\mathrm{I}\left(f(\boldsymbol{x}_n) \ne y_n\right) - \mathbb{P}_{\,}\left(f(\boldsymbol{x}) \ne y\right)\right| \rightarrow 0 \] as \(N \rightarrow \infty\).

(f)

This is just a special case of (e).

7 Use the ULLN to prove logistic regression consistency (254 only \(\star\) \(\star\) \(\star\))

Consider the logistic regression loss for binary \(y\) and bounded regression coefficients:

\[ \mathcal{F}:= \{ f(\boldsymbol{x}; \boldsymbol{\beta}) = \exp(\boldsymbol{x}^\intercal\boldsymbol{\beta}) / (1 + \exp(\boldsymbol{x}^\intercal\boldsymbol{\beta})) \textrm{ such that }\left\Vert\boldsymbol{\beta}\right\Vert_2^2 \le K\} \]

\[ \mathscr{L}(f(\boldsymbol{x}; \boldsymbol{\beta}), y) := - y\boldsymbol{\beta}^\intercal\boldsymbol{x}+ \log(1 + \exp(\boldsymbol{\beta}^\intercal\boldsymbol{x})). \]

Assuming that \(\mathbb{E}_{\vphantom{}}\left[\left\Vert\boldsymbol{x}\right\Vert\right] < \infty\), show that \(\mathcal{F}\) satisfies the integrable Lipchitz condition and so has vanishing generalization error:

\[ \sup_{f\in \mathcal{F}} \left|\hat{\mathscr{R}}(f) - \mathscr{R}(f)\right| \rightarrow 0. \]

Hint: you can establish that a function is Lipschitz by bounding the derivative.

Solutions

Letting \(\pi(\boldsymbol{x}; \boldsymbol{\beta}) = \exp(\exp(\boldsymbol{\beta}^\intercal\boldsymbol{x})) / (1 + \exp(\boldsymbol{\beta}^\intercal\boldsymbol{x}))\), the derivative of the loss is

\[ \frac{\partial \mathscr{R}(f(\boldsymbol{x}; \boldsymbol{\beta}), y)}{\partial \boldsymbol{\beta}} = - (y- \pi(\boldsymbol{x}; \boldsymbol{\beta})) \boldsymbol{x}. \]

Since \(\left|y- \pi(\boldsymbol{x}; \boldsymbol{\beta})\right| \le 1\),

\[ \sup_{\boldsymbol{\beta}} \left\Vert\frac{\partial \mathscr{R}(f(\boldsymbol{x}), y)}{\partial \boldsymbol{\beta}}\right\Vert \le \sup_{\boldsymbol{\beta}} \left|y- \pi(\boldsymbol{x}; \boldsymbol{\beta})\right| \left\Vert\boldsymbol{x}\right\Vert \le \left\Vert\boldsymbol{x}\right\Vert. \]

As we showed in class, this means that

\[ \left| \mathscr{R}(f(\boldsymbol{x}; \boldsymbol{\beta}), y) - \mathscr{R}(f(\boldsymbol{x}; \boldsymbol{\beta}'), y) \right| \le \left\Vert\boldsymbol{x}\right\Vert \left\Vert\boldsymbol{\beta}- \boldsymbol{\beta}'\right\Vert. \]

Assuming that \(\mathbb{E}_{\vphantom{}}\left[\left\Vert\boldsymbol{x}\right\Vert\right] < \infty\), the class satisfies the integral Lipschitz conditions, and so has asymptotically vanishing generalization error.

8 An Information Criterion partial derivation (254 only \(\star\) \(\star\) \(\star\))

In this problem, we will informally derive part of the “AIC,” often known as the “Akaike information criterion.” (Though Akaike himself modestly claimed that the acronym was “an information criterion.”) AIC was an early and important attempt at correcting for the fact that empirical risk minimization tends to be optimistic.

The purpose of this derivation is to make it clear that the AIC depends on assumptions of correct specification that are untenable in machine learning, and so to recommend against its use when cross–validation is possible.

For this problem, assume that

  • \(z_n \overset{\mathrm{IID}}{\sim}p(z_n \vert \theta_0) \textrm{ for some }\theta_0 \in \mathbb{R}^{P}\)
  • \(\mathscr{L}(z_n; \theta) = - \log p(z_n \vert\theta)\)
  • \(\hat{\theta}:= \underset{\theta}{\mathrm{argmin}}\, \hat{\mathscr{R}}(\theta)\) exists and \(\hat{\theta}\rightarrow \theta_0\) as \(N \rightarrow \infty\).

We will also assume that \(\hat{\mathscr{R}}(\theta)\) is three times continuously differentiable and so, for any \(\theta\) sufficiently close to \(\theta'\), we can form the second–order Taylor series evaluated at \(\theta'\), cenetered at \(\theta\):

\[ \hat{\mathscr{R}}(\theta') \approx \hat{\mathscr{R}}(\theta) + \nabla_\theta \hat{\mathscr{R}}(\theta)(\theta' - \theta) + \frac{1}{2} (\theta' - \theta)^\intercal \nabla^2_{\theta} \hat{\mathscr{R}}(\theta)(\theta' - \theta). \]

Here, the \(\nabla_\theta\) notation means

\[ \begin{aligned} \nabla_\theta \hat{\mathscr{R}}(\theta) ={}& \left. \frac{\partial \hat{\mathscr{R}}(\theta)}{\partial \theta} \right|_{\theta=\theta} & \nabla^2_{\theta} \hat{\mathscr{R}}(\theta) ={}& \left. \frac{\partial^2 \hat{\mathscr{R}}(\theta)}{\partial \theta \partial \theta^\intercal} \right|_{\theta=\theta}. \end{aligned} \]

We will use \(\mathbb{E}_{z}\left[\cdot\right]\) to denote expectation with respect to a new datapoint \(z\), and \(\mathbb{E}_{\mathcal{D}}\left[\cdot\right]\) to denote expectation with respect to the data, \(z_1,\ldots,z_N\).

(a)

Show that \(\nabla_\theta \hat{\mathscr{R}}(\hat{\theta}) = \boldsymbol{0}\), and so

\[ \hat{\mathscr{R}}(\theta_0) \approx \hat{\mathscr{R}}(\hat{\theta}) + \frac{1}{2} (\theta_0 - \hat{\theta})^\intercal \nabla^2_{\theta} \hat{\mathscr{R}}(\hat{\theta})(\theta_0 - \hat{\theta}), \]

which is reasonable because \(\theta_0 - \hat{\theta}\) is small as \(N\rightarrow \infty\).

(b)

Show that \(\mathbb{E}_{\mathcal{D}}\left[\hat{\mathscr{R}}(\theta_0)\right] = \mathbb{E}_{z}\left[\mathscr{R}(z; \theta_0)\right] = \mathscr{R}(\theta_0)\).

(c)

Show that, if \(\nabla^2_{\theta} \hat{\mathscr{R}}(\hat{\theta}) \approx \nabla^2_{\theta} \mathscr{R}(\theta_0)\), which is reasonable because \(\theta_0 - \hat{\theta}\) is small as \(N\rightarrow \infty\), \[ \mathscr{R}(\theta_0) \approx \mathbb{E}_{\vphantom{}}\left[\hat{\mathscr{R}}(\hat{\theta})\right] + \frac{1}{2} \mathrm{trace}\left( \nabla^2_{\theta} \mathscr{R}(\theta_0) \mathbb{E}_{\mathcal{D}}\left[ (\theta_0 - \hat{\theta})(\theta_0 - \hat{\theta})^\intercal\right]\right). \]

(d)

Recall from introductory statistics that, if the model is correctly specified, then

\[ \mathbb{E}_{\mathcal{D}}\left[ (\theta_0 - \hat{\theta})(\theta_0 - \hat{\theta})^\intercal\right] \approx \frac{1}{N} \nabla^2_{\theta} \mathscr{R}(\theta_0)^{-1}. \]

Using this, conclude that

\[ \mathscr{R}(\theta_0) \approx \mathbb{E}_{\vphantom{}}\left[\hat{\mathscr{R}}(\hat{\theta})\right] + \frac{1}{2} \frac{P}{N}, \]

where \(P\) is the dimension of \(\theta\).

(e)

The full AIC uses a similar argument to control the difference \(\mathscr{R}(\hat{\theta}) - \mathscr{R}(\theta_0)\), giving another factor of \(P / N\) (again under correct specification), and

\[ 2 \mathscr{R}(\hat{\theta}) \approx 2 \mathbb{E}_{\vphantom{}}\left[\hat{\mathscr{R}}(\hat{\theta})\right] + 2 \frac{P}{N} := AIC. \]

Comment on the relative strengths and weakness of cross validation relative to the AIC approximation to estimate \(\mathscr{R}(\hat{\theta})\).

Solutions

(a)

Since \(\hat{\mathscr{R}}\) is continuously differentiable, its gradient is zero at its minimizer, so \(\nabla_\theta \hat{\mathscr{R}}(\hat{\theta}) = \boldsymbol{0}\). The rest uses this identity and the Taylor series centered at \(\hat{\theta}\) evaluated at \(\theta_0\).

(b)

\[ \begin{aligned} \mathbb{E}_{\mathcal{D}}\left[\hat{\mathscr{R}}(\theta_0)\right] ={}& \mathbb{E}_{\mathcal{D}}\left[\frac{1}{N} \sum_{n=1}^N\mathscr{R}(f(x_n; \theta_0), y_n)\right] \\={}& \frac{1}{N} \sum_{n=1}^N\mathbb{E}_{\mathcal{D}}\left[\mathscr{R}(f(x_n; \theta_0), y_n)\right] \\={}& \frac{1}{N} \sum_{n=1}^N\mathscr{R}(\theta_0) \\={}& \mathscr{R}(\theta_0). \end{aligned} \]

(c)

Take expectations of (a), plug in (b), and use the trace trick \[ \begin{aligned} \mathbb{E}_{\mathcal{D}}\left[ (\theta_0 - \hat{\theta})^\intercal \nabla^2_{\theta} \hat{\mathscr{R}}(\hat{\theta})(\theta_0 - \hat{\theta}) \right] ={}& \mathbb{E}_{\mathcal{D}}\left[ \mathrm{trace}\left((\theta_0 - \hat{\theta})^\intercal \nabla^2_{\theta} \hat{\mathscr{R}}(\hat{\theta}) (\theta_0 - \hat{\theta})\right) \right] \\={}& \mathbb{E}_{\mathcal{D}}\left[ \mathrm{trace}\left( \nabla^2_{\theta} \hat{\mathscr{R}}(\hat{\theta}) (\theta_0 - \hat{\theta}) (\theta_0 - \hat{\theta})^\intercal \right) \right] \\\approx{}& \mathbb{E}_{\mathcal{D}}\left[ \mathrm{trace}\left( \nabla^2_{\theta} \mathscr{R}(\theta_0) (\theta_0 - \hat{\theta}) (\theta_0 - \hat{\theta})^\intercal \right) \right] \\={}& \mathrm{trace}\left( \nabla^2_{\theta} \hat{\mathscr{R}}(\theta_0) \mathbb{E}_{\mathcal{D}}\left[ (\theta_0 - \hat{\theta}) (\theta_0 - \hat{\theta})^\intercal \right] \right) \end{aligned} \]

(d)

Plugging the inverse Fisher covariance approxmation into (c) gives

\[ \begin{aligned} \mathbb{E}_{\mathcal{D}}\left[ (\theta_0 - \hat{\theta})^\intercal \nabla^2_{\theta} \hat{\mathscr{R}}(\hat{\theta})(\theta_0 - \hat{\theta}) \right] \approx{}& \mathrm{trace}\left( \nabla^2_{\theta} \hat{\mathscr{R}}(\theta_0) \frac{1}{N} \nabla^2_{\theta} \mathscr{R}(\theta_0)^{-1} \right) \\={}& \frac{1}{N} \mathrm{trace}\left({\boldsymbol{I}_{P}}\right) \\={}& \frac{P}{N}. \end{aligned} \]

Plug this into the Taylor series to get the desired result.

(e)

AIC is computationally much easier than CV, as it requires only a single model fit. But it requires the assumption that the model is correctly specified, that you are doing maximum likelihood estimation, and that the asymptotics hold.

9 Bibliography