Homework 2 (due March 1st 9pm)
Stat 154/254: Statistical Machine Learning
1 Infinitely fine step function ULLN fail
In this homework problem, we carefully analyze the first simple example of ULLN failure given in lecture.
Let \(x_n \overset{\mathrm{IID}}{\sim}\textrm{Unif}(0,1)\), \(y\overset{\mathrm{IID}}{\sim}\mathcal{N}\left(0,1\right)\) independently of \(x\), and \((x_n, y_n)\) be IID draws from a joint distribution. For each \(K\), define the step functions
\[ z_{jK} = \mathrm{I}\left((j - 1)/K \le x\le j/K\right) \quad\textrm{for}\quad j=1,\ldots,K. \]
Define the function class
\[ \mathcal{F}= \left\{ f(x; \boldsymbol{\beta}) = \sum_{K=1}^\infty \sum_{j=1}^K \beta_{jK} z_{jK}(x) \textrm{ for some set of }\beta_{jK} \right\}. \]
Note that the vector \(\boldsymbol{\beta}\) may be infinitely long. Finally, define
\[ \hat{\mathscr{L}}(f) := \frac{1}{N} \sum_{n=1}^N(f(x_n) - y_n)^2 \quad\textrm{and}\quad \mathscr{L}(f) = \mathbb{E}_{\,}\left[(f(x) - y)^2\right] \]
(a)
Show that \(\mathbb{E}_{\,}\left[z_{jK}\right] = \mathbb{E}_{\,}\left[z_{jK}^2\right] = 1/K\) and so
\[ \mathbb{E}_{\,}\left[f(x; \boldsymbol{\beta})\right] = \sum_{K=1}^\infty \frac{1}{K} \sum_{j=1}^K \beta_{jK}. \]
(b)
Define \(\mathcal{F}_K := \left\{ f(x; \boldsymbol{\gamma}) = \sum_{j=1}^K \gamma_{j} z_{jK}(x) \textrm{ for some set of }\gamma_{j} \right\}\). Show that \(\mathcal{F}_K \subset \mathcal{F}\).
(c)
With probability one, all \(x_n\) are distinct in any dataset of size \(N\). Show that you can take \(K\) large enough that there exists \(f\in \mathcal{F}_K\) such that
\[ \begin{aligned} \hat{\mathscr{L}}(\hat{\boldsymbol{\gamma}}) ={}& \frac{1}{N} \sum_{n=1}^N(f(x_n, \hat{\boldsymbol{\gamma}}) - y_n)^2 ={} 0, \end{aligned} \]
where \(\hat{\boldsymbol{\gamma}}\) is the OLS estimator of \(\boldsymbol{\gamma}\).
(d)
Show that \(\mathscr{L}(f) \ge 1\) for any \(f\).
(e)
Combine the above results to show that \(\mathcal{F}\) does not satisfy a ULLN in the sense that
\[ \sup_{f\in \mathcal{F}} \left|\hat{\mathscr{L}}(f) - \mathscr{L}(f)\right| \ge 1. \]
(a)
Since \(x\) is uniform, \(\mathbb{E}_{\,}\left[z_{jK}\right] = \mathbb{E}_{\,}\left[\mathrm{I}\left((j - 1)/K \le x \le j / K\right)\right] = \frac{j}{K} - \frac{j - 1}{K} = 1/K\). And the indicator squared is the same as the indicator. Then
\[ \mathbb{E}_{\,}\left[f(x; \boldsymbol{\beta})\right] = \sum_{K=1}^\infty \sum_{j=1}^K \beta_{jK} \mathbb{E}_{\,}\left[z_{jk}\right] = \sum_{K=1}^\infty \frac{1}{K} \sum_{j=1}^K \beta_{jK}. \]
(b)
Any element of \(\mathcal{F}_K\) can be formed by setting \(\beta_{jK'} = 0\) for \(K' \ne K\).
(c)
Let \(\Delta = \min_{n,m} |x_n - x_m|\). Since the \(x_n\) are distinct, \(\Delta > 0\). Take \(K > 1 / \Delta\); then \(x_n\) are all in distinct bins of width \(1/K\), and so \(z_{jK}(x_n)\) are either \(0\) for all \(n\) (if no \(x_n\) falls in its interval) or \(1\) for exactly one \(n\) (if a single \(x_n\) falls in its interval).
Writing \(\boldsymbol{Z}_j = (z_{jK}(x_1), \ldots, z_{jK}(x_N))^\intercal\) as the vector of values taken by \(z_{jK}\), we thus have \(\boldsymbol{Z}_j\intercal\boldsymbol{Z}_\ell = 0\) for \(j \ne \ell\), and there are exactly \(N\) distinct \(\boldsymbol{Z}_j\) with \(\boldsymbol{Z}_j^\intercal\boldsymbol{Z}_j = 1\). This means the \(N \times K\) matrix \((\boldsymbol{Z}_1 \ldots \boldsymbol{Z}_{K})\) is rank \(N\), and so the projection of \(\boldsymbol{Y}\) onto the column span of \(\boldsymbol{Z}\) is just \(\boldsymbol{Y}\).
(d)
Since \(y\) and \(x\) are in fact independent, \[ \begin{aligned} \mathscr{L}(f) ={}& \mathbb{E}_{\,}\left[(y- f(x))^2\right] \\={}& \mathbb{E}_{\,}\left[y^2\right] - 2 \mathbb{E}_{\,}\left[y\right] \mathbb{E}_{\,}\left[f(x)\right] + \mathbb{E}_{\,}\left[f(x)^2\right] \\={}& \mathbb{E}_{\,}\left[y^2\right] + \mathbb{E}_{\,}\left[f(x)^2\right] \\\ge{}& \mathbb{E}_{\,}\left[y^2\right] \\={}& 1. \end{aligned} \]
(e)
\[ \begin{aligned} \sup_{f\in \mathcal{F}} \left|\hat{\mathscr{L}}(f) - \mathscr{L}(f)\right| \ge{}& \left|\hat{\mathscr{L}}(\hat{f}) - \mathscr{L}(\hat{f})\right| & \textrm{(by (b))} \\={}& \left|\mathscr{L}(\hat{f})\right| & \textrm{(by (c))} \\\ge{}& 1 & \textrm{(by (d))}. \end{aligned} \]
2 An Information Criterion partial derivation
In this problem, we will informally derive part of the “AIC,” often known as the “Akaike information criterion.” (Though Akaike himself modestly claimed that the acronym was “an information criterion.”) AIC was an early and important attempt at correcting for the fact that empirical risk minimization tends to be optimistic.
The purpose of this derivation is to make it clear that the AIC depends on assumptions of correct specification that are untenable in machine learning, and so to recommend against its use when cross–validation is possible.
For this problem, assume that
- \(z_n \overset{\mathrm{IID}}{\sim}p(z_n \vert \theta_0) \textrm{ for some }\theta_0 \in \mathbb{R}^{P}\)
- \(\mathscr{L}(z_n; \theta) = - \log p(z_n \vert\theta)\)
- \(\hat{\theta}:= \underset{\theta}{\mathrm{argmin}}\, \hat{\mathscr{L}}(\theta)\) exists and \(\hat{\theta}\rightarrow \theta_0\) as \(N \rightarrow \infty\).
We will also assume that \(\hat{\mathscr{L}}(\theta)\) is three times continuously differentiable and so, for any \(\theta\) sufficiently close to \(\theta'\), we can form the second–order Taylor series evaluated at \(\theta'\), cenetered at \(\theta\):
\[ \hat{\mathscr{L}}(\theta') \approx \hat{\mathscr{L}}(\theta) + \nabla_\theta \hat{\mathscr{L}}(\theta)(\theta' - \theta) + \frac{1}{2} (\theta' - \theta)^\intercal \nabla^2_{\theta} \hat{\mathscr{L}}(\theta)(\theta' - \theta). \]
Here, the \(\nabla_\theta\) notation means
\[ \begin{aligned} \nabla_\theta \hat{\mathscr{L}}(\theta) ={}& \left. \frac{\partial \hat{\mathscr{L}}(\theta)}{\partial \theta} \right|_{\theta=\theta} & \nabla^2_{\theta} \hat{\mathscr{L}}(\theta) ={}& \left. \frac{\partial^2 \hat{\mathscr{L}}(\theta)}{\partial \theta \partial \theta^\intercal} \right|_{\theta=\theta}. \end{aligned} \]
We will use \(\mathbb{E}_{z}\left[\cdot\right]\) to denote expectation with respect to a new datapoint \(z\), and \(\mathbb{E}_{\mathcal{D}}\left[\cdot\right]\) to denote expectation with respect to the data, \(z_1,\ldots,z_N\).
(a)
Show that \(\nabla_\theta \hat{\mathscr{L}}(\hat{\theta}) = \boldsymbol{0}\), and so
\[ \hat{\mathscr{L}}(\theta_0) \approx \hat{\mathscr{L}}(\hat{\theta}) + \frac{1}{2} (\theta_0 - \hat{\theta})^\intercal \nabla^2_{\theta} \hat{\mathscr{L}}(\hat{\theta})(\theta_0 - \hat{\theta}), \]
which is reasonable because \(\theta_0 - \hat{\theta}\) is small as \(N\rightarrow \infty\).
(b)
Show that \(\mathbb{E}_{\mathcal{D}}\left[\hat{\mathscr{L}}(\theta_0)\right] = \mathbb{E}_{z}\left[\mathscr{L}(z; \theta_0)\right] = \mathscr{L}(\theta_0)\).
(c)
Show that, if \(\nabla^2_{\theta} \hat{\mathscr{L}}(\hat{\theta}) \approx \nabla^2_{\theta} \mathscr{L}(\theta_0)\), which is reasonable because \(\theta_0 - \hat{\theta}\) is small as \(N\rightarrow \infty\), \[ \mathscr{L}(\theta_0) \approx \mathbb{E}_{\,}\left[\hat{\mathscr{L}}(\hat{\theta})\right] + \frac{1}{2} \mathrm{trace}\left( \nabla^2_{\theta} \mathscr{L}(\theta_0) \mathbb{E}_{\mathcal{D}}\left[ (\theta_0 - \hat{\theta})(\theta_0 - \hat{\theta})^\intercal\right]\right). \]
(d)
Recall from introductory statistics that, if the model is correctly specified, then
\[ \mathbb{E}_{\mathcal{D}}\left[ (\theta_0 - \hat{\theta})(\theta_0 - \hat{\theta})^\intercal\right] \approx \frac{1}{N} \nabla^2_{\theta} \mathscr{L}(\theta_0)^{-1}. \]
Using this, conclude that
\[ \mathscr{L}(\theta_0) \approx \mathbb{E}_{\,}\left[\hat{\mathscr{L}}(\hat{\theta})\right] + \frac{1}{2} \frac{P}{N}, \]
where \(P\) is the dimension of \(\theta\).
(e)
The full AIC uses a similar argument to control the difference \(\mathscr{L}(\hat{\theta}) - \mathscr{L}(\theta_0)\), giving another factor of \(P / N\) (again under correct specification), and
\[ 2 \mathscr{L}(\hat{\theta}) \approx 2 \mathbb{E}_{\,}\left[\hat{\mathscr{L}}(\hat{\theta})\right] + 2 \frac{P}{N} := AIC. \]
Comment on the relative strengths and weakness of cross validation relative to the AIC approximation to estimate \(\mathscr{L}(\hat{\theta})\).
(a)
Since \(\hat{\mathscr{L}}\) is continuously differentiable, its gradient is zero at its minimizer, so \(\nabla_\theta \hat{\mathscr{L}}(\hat{\theta}) = \boldsymbol{0}\). The rest uses this identity and the Taylor series centered at \(\hat{\theta}\) evaluated at \(\theta_0\).
(b)
\[ \begin{aligned} \mathbb{E}_{\mathcal{D}}\left[\hat{\mathscr{L}}(\theta_0)\right] ={}& \mathbb{E}_{\mathcal{D}}\left[\frac{1}{N} \sum_{n=1}^N\mathscr{L}(f(x_n; \theta_0), y_n)\right] \\={}& \frac{1}{N} \sum_{n=1}^N\mathbb{E}_{\mathcal{D}}\left[\mathscr{L}(f(x_n; \theta_0), y_n)\right] \\={}& \frac{1}{N} \sum_{n=1}^N\mathscr{L}(\theta_0) \\={}& \mathscr{L}(\theta_0). \end{aligned} \]
(c)
Take expectations of (a), plug in (b), and use the trace trick \[ \begin{aligned} \mathbb{E}_{\mathcal{D}}\left[ (\theta_0 - \hat{\theta})^\intercal \nabla^2_{\theta} \hat{\mathscr{L}}(\hat{\theta})(\theta_0 - \hat{\theta}) \right] ={}& \mathbb{E}_{\mathcal{D}}\left[ \mathrm{trace}\left((\theta_0 - \hat{\theta})^\intercal \nabla^2_{\theta} \hat{\mathscr{L}}(\hat{\theta}) (\theta_0 - \hat{\theta})\right) \right] \\={}& \mathbb{E}_{\mathcal{D}}\left[ \mathrm{trace}\left( \nabla^2_{\theta} \hat{\mathscr{L}}(\hat{\theta}) (\theta_0 - \hat{\theta}) (\theta_0 - \hat{\theta})^\intercal \right) \right] \\\approx{}& \mathbb{E}_{\mathcal{D}}\left[ \mathrm{trace}\left( \nabla^2_{\theta} \mathscr{L}(\theta_0) (\theta_0 - \hat{\theta}) (\theta_0 - \hat{\theta})^\intercal \right) \right] \\={}& \mathrm{trace}\left( \nabla^2_{\theta} \hat{\mathscr{L}}(\theta_0) \mathbb{E}_{\mathcal{D}}\left[ (\theta_0 - \hat{\theta}) (\theta_0 - \hat{\theta})^\intercal \right] \right) \end{aligned} \]
(d)
Plugging the inverse Fisher covariance approxmation into (c) gives
\[ \begin{aligned} \mathbb{E}_{\mathcal{D}}\left[ (\theta_0 - \hat{\theta})^\intercal \nabla^2_{\theta} \hat{\mathscr{L}}(\hat{\theta})(\theta_0 - \hat{\theta}) \right] \approx{}& \mathrm{trace}\left( \nabla^2_{\theta} \hat{\mathscr{L}}(\theta_0) \frac{1}{N} \nabla^2_{\theta} \mathscr{L}(\theta_0)^{-1} \right) \\={}& \frac{1}{N} \mathrm{trace}\left(\boldsymbol{I}_P\right) \\={}& \frac{P}{N}. \end{aligned} \]
Plug this into the Taylor series to get the desired result.
(e)
AIC is computationally much easier than CV, as it requires only a single model fit. But it requires the assumption that the model is correctly specified, that you are doing maximum likelihood estimation, and that the asymptotics hold.
3 Leave–one–out with an estimator that depends strongly on N
This example is designed to strengthen the intuition about why leave–K–out estimators are biased.
Consider \(y_n \overset{\mathrm{IID}}{\sim}\mathcal{N}\left(5, 1\right)\) and squared loss: \(\mathscr{L}(\hat{y}, y) = (\hat{y}- y)^2\), and \(\mathscr{L}(f) = \mathbb{E}_{\,}\left[(f- y)^2\right]\). We will have no regressors, so our estimating functions are just constants.
We will take \(N = 100\) for this problem, and let \(\boldsymbol{Y}_{-n}\) denote the leave–one–out vector, \(\boldsymbol{Y}_{-n} = (y_1, \ldots, y_{n-1}, y_{n+1}, \ldots, y_N)\).
An estimation procedure \(F\) takes in a dataset \(\boldsymbol{Y}= (y_1, \ldots, y_N)\) and returns a constant estimate \(\hat{y}= F(\boldsymbol{Y}) \in \mathbb{R}^{}\). For this, we consider the (silly) estimator
\[ \hat{y}= F(\boldsymbol{Y}) = \begin{cases} 5 & \textrm{if the length of }\boldsymbol{Y}\ge 100 \\ 0 & \textrm{if the length of }\boldsymbol{Y}< 100. \end{cases} \]
(a)
What is the optimal estimate, \(f^{\star}= \underset{f}{\mathrm{argmin}}\, \mathscr{L}(f)\)?
(b)
What is \(\mathscr{L}(F(\boldsymbol{Y}))\)? What is \(\mathscr{L}(F(\boldsymbol{Y}_{-n}))\)?
(c)
Recall the leave–one–out cross–validation (CV) estimator
\[ \hat{\mathscr{L}}_{CV}(\boldsymbol{Y}) := \frac{1}{N} \sum_{n=1}^N\mathscr{L}(F(\boldsymbol{Y}_{-n}, y_n)). \]
What is \(\mathbb{E}_{\boldsymbol{Y}}\left[\hat{\mathscr{L}}_{CV}(\boldsymbol{Y})\right]\)? Is the leave–one–out CV estimator an unbiased estimate of \(\mathscr{L}(F(\boldsymbol{Y}))\)?
(d)
Connect this toy example with the intuition that leave–\(K\)–out CV tends to be more biased for larger \(K\).
(a)
\(f^{\star}= \mathbb{E}_{\,}\left[y\right] = 5\).
(b)
Here, \(F(\boldsymbol{Y}) = 5\), so \(\mathscr{L}(F(\boldsymbol{Y})) = \mathbb{E}_{\,}\left[(y- 5)^2\right] = 1\).
In contrast, \(F(\boldsymbol{Y}_{-n}) = 0\), so \(\mathscr{L}(F(\boldsymbol{Y}_{-n})) = \mathbb{E}_{\,}\left[y^2\right] = 1 + 5^2 = 26\).
(c)
Using (b),
\[ \hat{\mathscr{L}}_{CV}(\boldsymbol{Y}) := \frac{1}{N} \sum_{n=1}^N\mathscr{L}(F(\boldsymbol{Y}_{-n}, y_n) = \frac{1}{N} \sum_{n=1}^N(0 - y_n)^2 = \frac{1}{N} \sum_{n=1}^Ny_n^2, \]
and
\(\mathbb{E}_{\boldsymbol{Y}}\left[\hat{\mathscr{L}}_{CV}(\boldsymbol{Y})\right] = \frac{1}{N} \sum_{n=1}^N\mathbb{E}_{\,}\left[y_n^2\right] = 26\). So CV is not an unbiased estimate of \(\mathscr{L}(F(\boldsymbol{Y})) = 1\).
(d)
CV estimates the loss of an estimation procedures that uses fewer datapoints than the original estimator. If the estimator depends strongly on the number of datapoints you use, as it does artificially in this case, then leaving data out can severely bias your CV estimate of the loss.
4 Leave–one–out CV for linear regression
This homework problem derives a closed-form expression for the effect of leave–one–out CV for linear regression.
We will use the following result, known as the Woodbury formula (but also many other names, including the Sherman-Morrison-Woodbury formula). Let \(A\) denote an invertible matrix, and \(\boldsymbol{u}\) and \(\boldsymbol{v}\) vectors the same length as \(A\). Then
\[ (A + \boldsymbol{u}\boldsymbol{v}^\intercal)^{-1} ={} A^{-1} - \frac{A^{-1} \boldsymbol{u}\boldsymbol{v}^\intercal A^{-1}}{1 + \boldsymbol{v}^\intercal A^{-1} \boldsymbol{u}}. \]
We will also use the definition of a “leverage score” from lecture \(h_n := \boldsymbol{x}_n^\intercal(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n\). Note that \(h_n = (\boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal)_{nn}\) is the \(n\)–th diagonal entry of the projection matrix \(\underset{\boldsymbol{X}}{\boldsymbol{P}}\).
Let \(\hat{\boldsymbol{\beta}}_{-n}\) denote the estimate of \(\hat{\beta}\) with the datapoint \(n\) left out. Similarly, let \(\boldsymbol{X}_{-n}\) denote the \(\boldsymbol{X}\) matrix with row \(n\) left out, and \(\boldsymbol{Y}_{-n}\) denote the \(\boldsymbol{Y}\) matrix with row \(n\) left out.
(a)
Prove that
\[ \hat{\boldsymbol{\beta}}_{-n} = (\boldsymbol{X}_{-n}^\intercal\boldsymbol{X}_{-n})^{-1} \boldsymbol{X}_{-n}^\intercal\boldsymbol{Y}_{-n} = (\boldsymbol{X}^\intercal\boldsymbol{X}- \boldsymbol{x}_n \boldsymbol{x}_n^\intercal)^{-1} (\boldsymbol{X}^\intercal\boldsymbol{Y}- \boldsymbol{x}_n y_n) \]
(b)
Using the Woodbury formula, derive the following expression: \[ (\boldsymbol{X}^\intercal\boldsymbol{X}- \boldsymbol{x}_n \boldsymbol{x}_n^\intercal)^{-1} = (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} + \frac{(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n \boldsymbol{x}_n^\intercal(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} }{1 - h_n} \]
(c)
Combine (a) and (b) to derive the following explicit expression for \(\hat{\boldsymbol{\beta}}_{-n}\):
\[ \hat{\boldsymbol{\beta}}_{-n} = \hat{\boldsymbol{\beta}}- (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n \frac{1}{1 - h_n} \hat{\varepsilon}_n. \]
(d)
Using (c), derive the following explicit expression the leave-one-out error on the \(n\)–th observation:
\[ y_n - \boldsymbol{x}_n^\intercal\hat{\boldsymbol{\beta}}_{-n} = \frac{\hat{\varepsilon}_n}{1 - h_n}. \]
(e)
Using (d), prove that the leave–one–out CV estimator of \(\mathscr{L}(\hat{\boldsymbol{\beta}})\) is given by
\[ \frac{1}{N} \sum_{n=1}^N(y_n - \boldsymbol{x}_n^\intercal\hat{\boldsymbol{\beta}}_{-n})^2 = \frac{1}{N} \sum_{n=1}^N\frac{\hat{\varepsilon}_n^2}{(1 - h_n)^2}, \]
where \(\hat{\varepsilon}_n = y_n - \hat{y}_n\) is the residual from the full regression without leaving any data out. Using this formula, leave–one–out CV can be computed using only the original regression and \((\boldsymbol{X}^\intercal\boldsymbol{X})^{-1}\) rather than doing \(N\) separate matrix inversions.
(a)
This follows from \(\boldsymbol{X}^\intercal\boldsymbol{X}= \sum_{n=1}^N\boldsymbol{x}_n \boldsymbol{x}_n^\intercal\) and \(\boldsymbol{X}^\intercal\boldsymbol{Y}= \sum_{n=1}^N\boldsymbol{x}_n y_n\).
(b)
Direct application of the formula with \(\boldsymbol{u}= \boldsymbol{x}_n\) and \(\boldsymbol{v}= -\boldsymbol{x}_n\) gives \[ (\boldsymbol{X}^\intercal\boldsymbol{X}- \boldsymbol{x}_n \boldsymbol{x}_n^\intercal)^{-1} = (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} + \frac{(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n \boldsymbol{x}_n^\intercal(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} }{1 - \boldsymbol{x}_n^\intercal(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1}\boldsymbol{x}_n}. \]
Then recognize the leverage score.
(c)
We have
\[ (\boldsymbol{X}^\intercal\boldsymbol{X}- \boldsymbol{x}_n \boldsymbol{x}_n^\intercal)^{-1} \boldsymbol{X}^\intercal\boldsymbol{Y}= \hat{\beta}+ \frac{(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n \boldsymbol{x}_n^\intercal\hat{\beta}}{1 - h_n}. \]
and
\[ (\boldsymbol{X}^\intercal\boldsymbol{X}- \boldsymbol{x}_n \boldsymbol{x}_n^\intercal)^{-1} \boldsymbol{x}_n y_n = (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n y_n + \frac{(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n h_n }{1 - h_n} y_n. \]
Combining,
\[ \begin{aligned} \hat{\boldsymbol{\beta}}_{-n} ={}& \hat{\beta}+ (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n \left( \frac{\boldsymbol{x}_n^\intercal\hat{\beta}}{1 - h_n} - y_n - \frac{h_n }{1 - h_n} y_n \right) \\={}& \hat{\beta}+ (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n \left( \frac{1 }{1 - h_n} \hat{y}_n - \left(1 + \frac{h_n }{1 - h_n} \right)y_n \right) \\={}& \hat{\beta}+ (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n \left( \frac{1 }{1 - h_n} \hat{y}_n - \frac{1 }{1 - h_n} y_n \right) \\={}& \hat{\beta}- (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n \frac{1 }{1 - h_n} \hat{\varepsilon}_n \end{aligned} \]
(d)
\[ \begin{aligned} y_n - \boldsymbol{x}_n^\intercal\hat{\boldsymbol{\beta}}_{-n} ={}& y_n - \boldsymbol{x}_n^\intercal\hat{\beta}+ \boldsymbol{x}_n^\intercal(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n \frac{1 }{1 - h_n} \hat{\varepsilon}_n \\={}& y_n - \hat{y}_n + \frac{h_n }{1 - h_n} \hat{\varepsilon}_n \\={}& \left( 1 + \frac{h_n }{1 - h_n} \right)\hat{\varepsilon}_n \\={}& \frac{\hat{\varepsilon}_n }{1 - h_n} \end{aligned} \]
(e)
Just plug it in.
5 Use the ULLN to prove logistic regression consistency (254 only \(\star\) \(\star\) \(\star\))
Consider the logistic regression loss for binary \(y\) and bounded regression coefficients:
\[ \mathcal{F}:= \{ f(\boldsymbol{x}; \boldsymbol{\beta}) = \exp(\boldsymbol{x}^\intercal\boldsymbol{\beta}) / (1 + \exp(\boldsymbol{x}^\intercal\boldsymbol{\beta})) \textrm{ such that }\left\Vert\boldsymbol{\beta}\right\Vert_2^2 \le K\} \]
\[ \mathscr{L}(f(\boldsymbol{x}; \boldsymbol{\beta}), y) := - y\boldsymbol{\beta}^\intercal\boldsymbol{x}+ \log(1 + \exp(\boldsymbol{\beta}^\intercal\boldsymbol{x})). \]
Assuming that \(\mathbb{E}_{\,}\left[\left\Vert\boldsymbol{x}\right\Vert\right] < \infty\), show that \(\mathcal{F}\) satisfies the integrable Lipchitz condition and so has vanishing generalization error:
\[ \sup_{f\in \mathcal{F}} \left|\hat{\mathscr{L}}(f) - \mathscr{L}(f)\right| \rightarrow 0. \]
Hint: you can establish that a function is Lipschitz by bounding the derivative.
Letting \(\pi(\boldsymbol{x}; \boldsymbol{\beta}) = \exp(\exp(\boldsymbol{\beta}^\intercal\boldsymbol{x})) / (1 + \exp(\boldsymbol{\beta}^\intercal\boldsymbol{x}))\), the derivative of the loss is
\[ \frac{\partial \mathscr{L}(f(\boldsymbol{x}; \boldsymbol{\beta}), y)}{\partial \boldsymbol{\beta}} = - (y- \pi(\boldsymbol{x}; \boldsymbol{\beta})) \boldsymbol{x}. \]
Since \(\left|y- \pi(\boldsymbol{x}; \boldsymbol{\beta})\right| \le 1\),
\[ \sup_{\boldsymbol{\beta}} \left\Vert\frac{\partial \mathscr{L}(f(\boldsymbol{x}), y)}{\partial \boldsymbol{\beta}}\right\Vert \le \sup_{\boldsymbol{\beta}} \left|y- \pi(\boldsymbol{x}; \boldsymbol{\beta})\right| \left\Vert\boldsymbol{x}\right\Vert \le \left\Vert\boldsymbol{x}\right\Vert. \]
As we showed in class, this means that
\[ \left| \mathscr{L}(f(\boldsymbol{x}; \boldsymbol{\beta}), y) - \mathscr{L}(f(\boldsymbol{x}; \boldsymbol{\beta}'), y) \right| \le \left\Vert\boldsymbol{x}\right\Vert \left\Vert\boldsymbol{\beta}- \boldsymbol{\beta}'\right\Vert. \]
Assuming that \(\mathbb{E}_{\,}\left[\left\Vert\boldsymbol{x}\right\Vert\right] < \infty\), the class satisfies the integral Lipschitz conditions, and so has asymptotically vanishing generalization error.
Note: the original problem erroneously did not state the necessary assumption that \(\mathbb{E}_{\,}\left[\left\Vert\boldsymbol{x}\right\Vert\right] < \infty\). Students should not lose points for assuming it on their own.
6 An one–parameter expressive function (254 only \(\star\) \(\star\) \(\star\))
Note: this solution path does not seem correct as written. This question is now extra credit (for a correct version)!
In this exercise, we dig deeper into the example given in Figure 7.5 of Hastie, Tibshirani, and Friedman (2009). One reason this is an interesting example is it is a smooth one–dimensional family of functions which are nevertheless extremely expressive, giving lie to the notion of “counting parameters” to measure complexity (see, e.g., the AIC above).
Let \(x\in [0,1]\), \(y\in \{0, 1\}\) and consider the function class
\[ \mathcal{F}:= \{ f(x; \theta) = \mathrm{I}\left(\cos(\pi \theta x) \ge 0\right) % \begin{cases} % 1 & \textrm{ if }\lfloor \x \theta \rfloor \textrm{ is even } \\ % 0 & \textrm{ if }\lfloor \x \theta \rfloor \textrm{ is odd } \\ % \end{cases} % \textrm{Round}(\theta \x) \textrm{ for }\theta \ge 0. \}, \]
This problem will use the missclassification loss \(\mathscr{L}(\hat{y}, y) = \mathrm{I}\left(\hat{y}\ne y\right)\). The goal is to show carefully that, for any finite dataset,
\[ \inf_{f\in \mathcal{F}} \frac{1}{N} \sum_{n=1}^N\mathscr{L}(f(x_n), y_n) = 0. \]
Since, in general, \(\mathbb{E}_{\,}\left[\mathscr{L}(f(x), y)\right] > 0\), this means \(\mathcal{F}\) does not obey a ULLN.
Note: I will sketch one particular way to prove this, which might not be the best. If you have a more elegant proof, you may submit instead of this entire problem, and we will grade generously!
(a)
Draw a picture of the functions \(f(x; 1)\), \(f(x; 2)\), and \(f(x; 10)\).
(b)
Show that \(x_n\) is correctly classified if there exists an integer \(K_n\) such that \(\theta x_n = K_n + \varepsilon_n\), \(\left|\varepsilon_n\right| < 0.5\), and \(K_n\) is odd when \(y_n = 0\) or even when \(y_n = 1\).
(c)
For any \(x_n\) and any \(\delta\), show that there exist integers \(A_n\) and \(B_n\) such that
- \(x'_n = \frac{A_n}{B_n}\)
- \(B_n\) is odd
- \(A_n\) is odd if and only if \(y_n = 0\), and
- \(|x_n - x'_n| < \delta\).
Hint: use a property of real numbers to show that you can always write \(x= A' / B' + \delta / 2\) where \(A\) and \(B\) are integers with no common factors. Then for any even \(M\), write
\[ x'_n = \frac{MA + 1 - y_n}{MB + 1}. \]
Show these numbers satisfy the conditions, and that \(M\) can be made sufficiently large to make \(\delta\) as small as necessary.
(d)
Given the construction in (c) with \(\delta < 0.5\), and the conculsion of (b), show that any set of points \((x_n, y_n)\) is perfectly classified by
\[ \hat{\theta}= \prod_{n=1}^N B_n. \]
(e)
Show that, for any dataset,
\[ \inf_{f \in \mathcal{F}} \frac{1}{N} \sum_{n=1}^N\mathscr{L}(f(x_n), y_n) = 0. \]
Argue that the function class \(\mathcal{F}\) does not obey a ULLN.
(f)
The function class \(\mathcal{F}\) is a one-parameter class of continuous functions, and yet fails the ULLN. Recall that we proved that a function class does satisfy a ULLN if it satisfies the “integrable Lipschitz” condition:
\[ \left|f(x; \theta) - f(x; \theta')\right| \le M(z) \left\Vert\theta - \theta'\right\Vert, \]
for \(\theta \in \Omega\), compact \(\Omega\), and \(\mathbb{E}_{\,}\left[M(z)\right] < \infty\). Find at least one assumption of the integrable Lipschitz that \(\mathcal{F}\) does not satisfy in this case.
(g)
Propose a simple modification of \(\mathcal{F}\) that would satisfy the integrable Lipchitz condtions.
Unknown!