Question 1 (3 points each) (154 and 254)
For this problem, let \(z_n \in \{0, 1\}\) be a binary regressor and let \(y_n \in \mathbb{R}^{}\) be a bounded, continuous random variable. You may assume that the pairs \((z_n, y_n)\) are independent and identically distributed for \(n=1,\ldots,N\).
Let \(\boldsymbol{\beta}= (\beta_0, \beta_1)^\intercal\in \mathbb{R}^{2}\). For a fixed \(\lambda > 0\), let \(\hat{\mathscr{R}}(\boldsymbol{\beta}) = \frac{1}{N} \sum_{n=1}^N(y_n - \beta_0 - \beta_1 z_n)^2 + \lambda \beta_1^2\).
(a) (3 pts) Let \(\boldsymbol{X}\) denote an \(N \times 2\) matrix, whose first column is all ones, and whose second column contains \(z_1, \ldots, z_N\):
\[ \boldsymbol{X}= \begin{pmatrix} 1 & z_1 \\ 1 & z_2 \\ \vdots \\ 1 & z_N \\ \end{pmatrix}. \]
Let \(\boldsymbol{Y}\) denote the \(N \times 1\) vector containing \(y_1, \ldots, y_N\). Write the squared error, \(\frac{1}{N} \sum_{n=1}^N(y_n - \beta_0 - \beta_1 z_n)^2\), in terms of \(\boldsymbol{X}\), \(\boldsymbol{Y}\), \(\boldsymbol{\beta}\), and \(N\) only.
(b) (3 pts) Write an explicit expression for the gradient \(\nabla \hat{\mathscr{R}}(\boldsymbol{\beta}) := \partial \hat{\mathscr{R}}(\boldsymbol{\beta}) / \partial \boldsymbol{\beta}\).
(c) (3 pts) Write an expression for \(\hat{\boldsymbol{\beta}}(\lambda)\), the unique minimum of \(\hat{\mathscr{R}}(\boldsymbol{\beta})\), assuming it exists.
(a) If we let \(\boldsymbol{x}_n = (1, z_n)^\intercal\), then \(\beta_0 + \beta_1 z_n = \boldsymbol{x}_n^\intercal\boldsymbol{\beta}\), so \(\beta_0 + \beta_1 z_n\) is the \(n\)–th row of \(\boldsymbol{X}\boldsymbol{\beta}\), and \(\boldsymbol{Y}- \boldsymbol{X}\boldsymbol{\beta}\) is the vector of residuals. The sum of the square residuals is then \((\boldsymbol{Y}- \boldsymbol{X}\boldsymbol{\beta})^\intercal(\boldsymbol{Y}- \boldsymbol{X}\boldsymbol{\beta})\), and their mean is
\[ \textrm{Squared error} = \frac{1}{N} (\boldsymbol{Y}- \boldsymbol{X}\boldsymbol{\beta})^\intercal(\boldsymbol{Y}- \boldsymbol{X}\boldsymbol{\beta}). \]
Grading: Up to two points partial credit for:
- Noting that the prediction is \(\boldsymbol{X}\boldsymbol{\beta}\)
- Recognizing the sum of squares as a quadratic form
Three points only for a complete, correct answer.
(b) This is easiest to do with the result from (a).
\[ \begin{aligned} \nabla \hat{\mathscr{R}}(\boldsymbol{\beta}) ={}& \nabla \frac{1}{N} (\boldsymbol{Y}^\intercal\boldsymbol{Y}- 2 \boldsymbol{Y}^\intercal\boldsymbol{X}\boldsymbol{\beta}+ \boldsymbol{\beta}^\intercal\boldsymbol{X}^\intercal\boldsymbol{X}\boldsymbol{\beta}) + \nabla \lambda \beta_1^2 \\={}& -2 \boldsymbol{X}^\intercal\boldsymbol{Y}+ 2 \boldsymbol{X}^\intercal\boldsymbol{X}\boldsymbol{\beta}+ 2 \lambda \begin{pmatrix} 0 \\ \beta_1 \end{pmatrix}, \end{aligned} \] since \(\partial \beta_1^2 / \partial \beta_0 = 0\).
If you do it directly, you should get the sum notation versions of the same expressions:
\[ \begin{aligned} \nabla \hat{\mathscr{R}}(\boldsymbol{\beta}) ={}& \begin{pmatrix} \frac{\partial \hat{\mathscr{R}}(\boldsymbol{\beta})}{\partial \beta_0}\\ \frac{\partial \hat{\mathscr{R}}(\boldsymbol{\beta})}{\partial \beta_1} \end{pmatrix} \quad\textrm{where} \\ \frac{\partial \hat{\mathscr{R}}(\boldsymbol{\beta})}{\partial \beta_0} ={}& -2 \frac{1}{N} \sum_{n=1}^Ny_n + 2 \beta_0 + 2 \frac{1}{N} \sum_{n=1}^Nz_n \beta_1 \\ \frac{\partial \hat{\mathscr{R}}(\boldsymbol{\beta})}{\partial \beta_1} ={}& -2 \frac{1}{N} \sum_{n=1}^Nz_n y_n + 2 \frac{1}{N} \sum_{n=1}^Nz_n \beta_0 + 2 \frac{1}{N} \sum_{n=1}^Nz_n^2 \beta_1 + 2 \lambda \beta_1. \end{aligned} \]
Grading: Up to two points partial credit for:
- Getting the gradient of the sum of squared errors correct
- Writing no expressions with non–conformal matrices
- Getting the gradient of the regularization term correct
Three points only for a complete, correct answer.
(c) Again, this is much easier to do in matrix form, since you don’t need to solve the linear system explicitly. Slightly rewriting (b), and setting it equal to \(\boldsymbol{0}\) at \(\hat{\boldsymbol{\beta}}(\lambda)\),
\[ \begin{aligned} \nabla \hat{\mathscr{R}}(\hat{\boldsymbol{\beta}}) ={}& -2 \frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{Y}+ 2 \frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{X}\hat{\boldsymbol{\beta}}+ 2 \begin{pmatrix} 0 & 0 \\ 0 & \lambda \end{pmatrix} \hat{\boldsymbol{\beta}} \\={}& -2 \frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{Y}+ 2 \left( \frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{X}+ \begin{pmatrix} 0 & 0 \\ 0 & \lambda \end{pmatrix} \right) \hat{\boldsymbol{\beta}} = \boldsymbol{0}\Rightarrow\\ \hat{\boldsymbol{\beta}}={}& \left( \frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{X}+ \begin{pmatrix} 0 & 0 \\ 0 & \lambda \end{pmatrix} \right)^{-1} \frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{Y}. \end{aligned} \]
Grading: Up to two points partial credit for:
- Writing no expressions with non–conformal matrices
- Writing the correct solution but missing the \(1/N\)
- Writing the correct solution but with \(\lambda {\boldsymbol{I}_{\,}}\) in place of the term that regularizes only \(\beta_1^2\)
Three points only for a complete, correct answer. (Full credit if you do it correctly by hand, not in matrix form, but no partial credit points available if you made algebra mistakes along the way.)
Question 2 (254 only)
This question continues question 1.
(a) (3 pts) Under what conditions is \(\hat{\boldsymbol{\beta}}(\lambda)\), defined in 1(c), unique?
(b) (3 pts) Fix a dataset \(\boldsymbol{X}\) and \(\boldsymbol{Y}\). For this dataset, what is \(\lim_{\lambda \rightarrow \infty} \hat{\boldsymbol{\beta}}(\lambda)\)? Note: For this question, the limit is with respect to \(\lambda\), with \(N\) and the data fixed.
(c) (3 pts) Fix \(\lambda > 0\). Write an expression for \(\lim_{N\rightarrow \infty} \hat{\boldsymbol{\beta}}(\lambda)\) in terms of expectations over the random variable \(x,y\). Note: For this question, the limit is with respect to \(N\), with \(\lambda\) fixed.
(a) It is unique as long as the matrix \[ S := \left( \frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{X}+ \begin{pmatrix} 0 & 0 \\ 0 & \lambda \end{pmatrix} \right) \] is invertible. Let \(M_{z} := \frac{1}{N} \sum_{n=1}^Nz_n\) and \(M_{zz} := \frac{1}{N} \sum_{n=1}^Nz_n^2\). Then \[ S = \begin{pmatrix} 1 & M_z \\ M_z & M_{zz} + \lambda, \end{pmatrix} \] and the determinant is given by \[ |S| = (M_{zz} + \lambda) - M_z^2. \] Since \(M_{zz} - M_z^2 \ge 0\) (this is the sample variance of the \(z_n\)), and \(\lambda > 0\), \(S\) is always invertible.
Grading: Up to two points partial credit for:
- Observing that the key is that the \(S\) matrix is invertible
- Claiming that \(S\) is invertible precisely because \(\lambda > 0\)
- Correctly claiming that \(S\) is invertible when \(\lambda > 0\), but based on an incorrect expression for \(S\) that uses the incorrect \(\lambda {\boldsymbol{I}_{\,}}\) as the regularization gradient term.
Three points only for proving carefully that \(\lambda > 0\) suffices for \(S\) to be invertible with the correct expression for \(S\).
(b) Noting that
\[ \frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{Y}= \begin{pmatrix} \frac{1}{N} \sum_{n=1}^Ny_n \\ \frac{1}{N} \sum_{n=1}^Nz_n y_n, \end{pmatrix} \] define \(M_y := \frac{1}{N} \sum_{n=1}^Ny_n\) and \(M_{zy} := \frac{1}{N} \sum_{n=1}^Nz_n y_n\). Then
\[ \begin{aligned} \hat{\beta}(\lambda) ={}& \begin{pmatrix} 1 & M_z \\ M_z & M_{zz} + \lambda, \end{pmatrix}^{-1} \begin{pmatrix} M_y \\ M_{zy} \end{pmatrix} \\={}& \frac{1}{\lambda + M_{zz} - M_z^2} \begin{pmatrix} M_{zz} + \lambda & -M_z \\ -M_z & 1, \end{pmatrix} \begin{pmatrix} M_y \\ M_{zy} \end{pmatrix} \\={}& \frac{1}{\lambda + M_{zz} - M_z^2} \begin{pmatrix} (M_{zz} + \lambda) M_y - M_z M_{zy}\\ -M_y M_z + M_{zy} \end{pmatrix}. \end{aligned} \]
As \(\lambda \rightarrow \infty\),
\[ \lim_{\lambda \rightarrow \infty} \frac{-M_y M_z + M_{zy}}{\lambda + M_{zz} - M_z^2} = 0, \]
so \(\hat{\boldsymbol{\beta}}_1 \rightarrow 0\). And
\[ \lim_{\lambda \rightarrow \infty} \frac{(M_{zz} + \lambda) M_y - M_z M_{zy}}{\lambda + M_{zz} - M_z^2} = M_y, \]
so \(\hat{\boldsymbol{\beta}}_0 \rightarrow M_y\), and
\[ \lim_{\lambda \rightarrow \infty} \hat{\boldsymbol{\beta}}(\lambda) = \begin{pmatrix} M_y \\ 0 \end{pmatrix}. \]
Grading: Up to two points partial credit for:
- Asserting that \(\hat{\boldsymbol{\beta}}_0 \rightarrow M_y\), possibly without proof.
- Asserting that \(\hat{\boldsymbol{\beta}}_1 \rightarrow 0\), possibly without proof.
- Incorrectly asserting that \(\hat{\boldsymbol{\beta}}\rightarrow \boldsymbol{0}\) based on the mistaken assumption that the regularization is based on the whole vector, not just \(\beta_1\).
Three points only for proving the limits carefully, and getting the correct answers.
(c) Using the same expression from (b) and the law of large numbers, we have that, as \(N\rightarrow \infty\),
\[ \begin{aligned} M_{zz} \rightarrow{}& \mathbb{E}_{\vphantom{}}\left[z^2\right] \\ M_{yz} \rightarrow{}& \mathbb{E}_{\vphantom{}}\left[yz\right] \\ M_{y} \rightarrow{}& \mathbb{E}_{\vphantom{}}\left[y\right] \\ M_{z} \rightarrow{}& \mathbb{E}_{\vphantom{}}\left[z\right]. \end{aligned} \]
Recognizing \(\mathrm{Var}_{\,}\left(z\right) = \mathbb{E}_{\vphantom{}}\left[z^2\right] - \mathbb{E}_{\vphantom{}}\left[z\right]^2\) and \(\mathrm{Cov}_{\,}\left(z, y\right) = \mathbb{E}_{\vphantom{}}\left[yz\right] - \mathbb{E}_{\vphantom{}}\left[y\right] \mathbb{E}_{\vphantom{}}\left[z\right]\), by the continuous mapping theorem,
\[ \lim_{N \rightarrow \infty} \hat{\boldsymbol{\beta}}(\lambda) = \frac{1}{\lambda + \mathrm{Var}_{\,}\left(z\right)} \begin{pmatrix} (\mathbb{E}_{\vphantom{}}\left[z^2\right] + \lambda) \mathbb{E}_{\vphantom{}}\left[y\right] - \mathbb{E}_{\vphantom{}}\left[z\right] \mathbb{E}_{\vphantom{}}\left[zy\right]\\ \mathrm{Cov}_{\,}\left(z, y\right) \end{pmatrix}. \]
Grading: Up to two points partial credit for:
- Applying the law of large numbers to each sum
- Plugging the results into the expression for \(\hat{\beta}\)
- Getting the same limit but with \(\lambda = 0\), which would be the limit if there were not a \(1/N\) in from of the squared error.
Three points only for proving the limits carefully, and getting the correct answers. To receive full credit, students must mention the law of large numbers, but can still receive full credit if they forget to mention the continuous mapping theorem.
Quiz 0 answer sheets
Turn in the following answer sheets.
Please write your full name and email address here:
\[\\[2in]\]
Also, please put your intials on each page in case the pages get separated.
\[\\[1in]\]
You have 30 minutes for this quiz.
Please write your solutions on the following pages.
Indicate clearly which problem your solution is for.
Multiple Choice (3 points each) (154 and 254)
For each question, only select one answer. If you select multiple answers, the question will be marked as incorrect.
There may be multiple answers that are plausibly correct, but one will be clearly preferable to the others. For each question, credit is given only for the best answer.
Refererence (A91440711)
MCQ1: For a function \(f(x)\) defined on a domain \(D\), select the statement that is necessarily true.
- [*] Both \(\inf_{x\in D} f(x)\) and \(\sup_{x\in D} f(x)\) exist, but may be \(-\infty\) or \(\infty\).
MCQ2: Suppose that the scalars \(x\) and \(y\) are jointly multivariate normal. Select the one statement that is not necessarily true.
- [*] The variance of the sum is given by \(\mathrm{Var}_{\,}\left(x + y\right) = \mathrm{Var}_{\,}\left(x\right) + \mathrm{Var}_{\,}\left(y\right)\).
MCQ3: Let \(x_n\) denote IID random variables with \(\mathbb{E}_{\vphantom{}}\left[x_n^2\right] < \infty\). The law of large numbers states that, as \(N \rightarrow \infty\),
- [*] \(\frac{1}{N} \sum_{n=1}^Nx_n \rightarrow \mathbb{E}_{\vphantom{}}\left[x\right]\).