Homework 0 (Review)

Stat 154/254: Statistical Machine Learning

Multivariate normal exercises

Let \(\boldsymbol{x}\sim \mathcal{N}\left(0, \Sigma\right)\) be a \(P\)–dimensional multivariate normal random vector with invertible covariance matrix \(\Sigma\). Let \(\varepsilon\sim \mathcal{N}\left(0, \sigma^2\right)\) denote a univariate normal random variable which is independent of \(\boldsymbol{x}\). Finally, for a given \(\boldsymbol{\beta}\), let \(y= \boldsymbol{x}^\intercal\boldsymbol{\beta}+ \varepsilon\).

(a)

Give explicit expressions for the distributions of the following quantities in terms of the fixed quantities \(\sigma\), \(\Sigma\), and \(\boldsymbol{\beta}\). You do not need to specify the density explicitly, just fully characterize the distribution. For example, it is an acceptable answer to write \(\mathbb{P}_{\,}\left(\boldsymbol{x}\right) = \mathcal{N}\left(0, \Sigma\right)\).

Note that \((y, \boldsymbol{x}^\intercal)^\intercal\) is a \(P + 1\)—dimensional vector consisting of \(y\) stacked on top of \(\boldsymbol{x}\).

\(\mathbb{P}_{\,}\left((y, \boldsymbol{x}^\intercal)^\intercal\right)\)
\(\mathbb{P}_{\,}\left(\boldsymbol{x}^\intercal\boldsymbol{\beta}\right)\)
\(\mathbb{P}_{\,}\left(y| \boldsymbol{x}\right)\)
\(\mathbb{P}_{\,}\left(y\right)\)
\(\mathbb{P}_{\,}\left(\varepsilon\vert \boldsymbol{x}\right)\)
\(\mathbb{P}_{\,}\left(\varepsilon\vert y, \boldsymbol{x}\right)\)
\(\mathbb{P}_{\,}\left(\boldsymbol{x}| y\right)\)

(b)

When \(\boldsymbol{\beta}= \boldsymbol{0}\), how is \(\mathbb{P}_{\,}\left(\boldsymbol{x}\vert y\right)\) different from \(\mathbb{P}_{\,}\left(\boldsymbol{x}\right)\)? Explain this result in intuitive terms.

(c)

When \(\sigma = 0\), what is the dimension of the support of \(\mathbb{P}_{\,}\left(\boldsymbol{x}\vert y\right)\)? (Recall that the support is, informally, the region of nonzero probability.) Explain this result in intuitive terms.

(d)

As the diagonal entries of \(\Sigma\) go to infinity, how is \(\mathrm{Var}_{\,}\left(y\vert \boldsymbol{x}\right)\) different from \(\mathrm{Var}_{\,}\left(y\right)\)? Explain your answer in intuitive terms.

(e) (254 only \(\star\) \(\star\) \(\star\))

Now, additionally assume that \(\boldsymbol{\beta}\sim \mathcal{N}\left(\boldsymbol{0}, \boldsymbol{V}\right)\) independently of \(\boldsymbol{x}\), and that \(y= \boldsymbol{x}^\intercal\boldsymbol{\beta}+ \varepsilon\) holds conditionally on \(\boldsymbol{x}\), \(\boldsymbol{\beta}\), and \(\varepsilon\). Give explicit expressions for the following distributions in terms of \(\Sigma\), \(\sigma\), and \(\boldsymbol{V}\).

\(\mathbb{P}_{\,}\left((y, \boldsymbol{\beta}^\intercal)^\intercal\vert \boldsymbol{x}\right)\)
\(\mathbb{P}_{\,}\left(\boldsymbol{\beta}^\intercal\boldsymbol{x}\vert \boldsymbol{x}\right)\)
\(\mathbb{P}_{\,}\left(\boldsymbol{\beta}\vert y, \boldsymbol{x}\right)\)

Solutions

(a)

Note that

\[ \begin{aligned} \mathbb{E}_{\,}\left[y\boldsymbol{x}\right] ={}& \mathbb{E}_{\,}\left[(\beta^\intercal\boldsymbol{x}+ \varepsilon) \boldsymbol{x}^\intercal\right] \\={}& \beta^\intercal\mathbb{E}_{\,}\left[\boldsymbol{x}\boldsymbol{x}^\intercal\right] + \mathbb{E}_{\,}\left[\varepsilon\boldsymbol{x}^\intercal\right] \\={}& \beta^\intercal\Sigma \end{aligned} \]

and

\[ \mathbb{E}_{\,}\left[y^2\right] = \mathbb{E}_{\,}\left[(\beta^\intercal\boldsymbol{x}+ \varepsilon) (\boldsymbol{x}^\intercal\boldsymbol{\beta}+ \varepsilon)\right] = \boldsymbol{\beta}^\intercal\Sigma \boldsymbol{\beta}+ \sigma^2. \]

\[ \mathbb{P}_{\,}\left((y, \boldsymbol{x}^\intercal)^\intercal\right) = \mathcal{N}\left(0, \begin{pmatrix} \sigma^2 + \boldsymbol{\beta}^\intercal\Sigma \boldsymbol{\beta}& \boldsymbol{\beta}^\intercal\Sigma \\ \Sigma \boldsymbol{\beta}& \Sigma \end{pmatrix} \right) \]

\(\mathbb{P}_{\,}\left(\boldsymbol{x}^\intercal\boldsymbol{\beta}\right) = \mathcal{N}\left(0, \boldsymbol{\beta}^\intercal\Sigma \boldsymbol{\beta}\right)\)
Using the condtional proprety of Gaussians,

\[ \begin{aligned} \mathbb{E}_{\,}\left[y| \boldsymbol{x}\right] ={}& \mathbb{E}_{\,}\left[y\right] + \mathrm{Cov}_{\,}\left(y, \boldsymbol{x}\right) \mathrm{Cov}_{\,}\left(\boldsymbol{x}\right)^{-1} (\boldsymbol{x}- \mathbb{E}_{\,}\left[\boldsymbol{x}\right]) \\={}& 0 + \boldsymbol{\beta}^\intercal\Sigma \Sigma^{-1} \boldsymbol{x} \\={}& \boldsymbol{\beta}^\intercal\boldsymbol{x}. \end{aligned} \]

Also,

\[ \begin{aligned} \mathrm{Cov}_{\,}\left(y| \boldsymbol{x}\right) ={}& \mathrm{Cov}_{\,}\left(y\right) - \mathrm{Cov}_{\,}\left(y, \boldsymbol{x}\right) \mathrm{Cov}_{\,}\left(\boldsymbol{x}\right)^{-1} \mathrm{Cov}_{\,}\left(\boldsymbol{x}, y\right) \\={}& \sigma^2 + \boldsymbol{\beta}^\intercal\Sigma \boldsymbol{\beta}- \boldsymbol{\beta}^\intercal\Sigma \Sigma^{-1} \Sigma \boldsymbol{\beta} \\={}& \sigma^2. \end{aligned} \]

You can more easily just do these same calculations directly on \(y= \boldsymbol{x}^\intercal\boldsymbol{\beta}+ \varepsilon,\) but it’s nice to see the conditional formulas give the same answer.

It follows that

\[ \mathbb{P}_{\,}\left(y\vert \boldsymbol{x}\right) = \mathcal{N}\left(\boldsymbol{x}^\intercal\boldsymbol{\beta}, \sigma^2\right). \]

\(\mathbb{P}_{\,}\left(y\right) = \mathcal{N}\left(0, \sigma^2 + \boldsymbol{\beta}^\intercal\Sigma \boldsymbol{\beta}\right)\). Note that marginally over \(\boldsymbol{x}\), the distribution of \(y\) has greater variance than conditional on \(\boldsymbol{x}\).
\(\mathbb{P}_{\,}\left(\varepsilon\vert \boldsymbol{x}\right) = \mathcal{N}\left(0, \sigma^2\right)\), since \(\varepsilon\) and \(\boldsymbol{x}\) are independent.
\(\mathbb{P}_{\,}\left(\varepsilon\vert y, \boldsymbol{x}\right)\) is a point mass at \(y- \boldsymbol{x}^\intercal\boldsymbol{\beta}\); it can take no other value.
\(\mathbb{P}_{\,}\left(\boldsymbol{x}| y\right)\)

This one is more complicated than \(\mathbb{P}_{\,}\left(y\vert \boldsymbol{x}\right)\), and it is easiest to use the conditional formulas directly. We know that it is Gaussian. The mean is given by

\[ \begin{aligned} \mathbb{E}_{\,}\left[\boldsymbol{x}| y\right] ={}& \mathbb{E}_{\,}\left[\boldsymbol{x}\right] + \mathrm{Cov}_{\,}\left(\boldsymbol{x}, y\right) \mathrm{Cov}_{\,}\left(y\right)^{-1} (y- \mathbb{E}_{\,}\left[y\right]) \\={}& 0 + \Sigma \boldsymbol{\beta}\left(\sigma^2 + \boldsymbol{\beta}^\intercal\Sigma \boldsymbol{\beta}\right)^{-1} y \\={}& 0 + \frac{y}{\sigma^2 + \boldsymbol{\beta}^\intercal\Sigma \boldsymbol{\beta}} \Sigma \boldsymbol{\beta}. \end{aligned} \]

The covariance is given by

\[ \begin{aligned} \mathrm{Cov}_{\,}\left(\boldsymbol{x}| y\right) ={}& \mathrm{Cov}_{\,}\left(\boldsymbol{x}\right) - \mathrm{Cov}_{\,}\left(\boldsymbol{x}, y\right) \mathrm{Cov}_{\,}\left(y\right)^{-1} \mathrm{Cov}_{\,}\left(y, \boldsymbol{x}\right) \\={}& \Sigma - \Sigma \boldsymbol{\beta}\left(\sigma^2 + \boldsymbol{\beta}^\intercal\Sigma \boldsymbol{\beta}\right)^{-1} \boldsymbol{\beta}^\intercal\Sigma. \end{aligned} \]

(b)

(c)

Since we’ve fixed the value \(\boldsymbol{\beta}^\intercal\boldsymbol{x}\) by conditioning on \(y\), \(\mathrm{Var}_{\,}\left(\boldsymbol{\beta}^\intercal\boldsymbol{x}| y\right)\) should equal zero. Furthermore, conditioning on \(y\) should tell us nothing about \(\boldsymbol{x}+ \boldsymbol{v}\) when \(\boldsymbol{v}^\intercal\boldsymbol{\beta}= 0\). Since there are \(P-1\) such vectors \(\boldsymbol{v}\), the support of \(\mathbb{P}_{\,}\left(\boldsymbol{x}| y\right)\) should be \(P-1\) dimensional.

We can in fact use (a.g) to confirm that, when \(\sigma^2 = 0\):

\[ \begin{aligned} \mathrm{Var}_{\,}\left(\boldsymbol{\beta}^\intercal\boldsymbol{x}| y\right) ={}& \boldsymbol{\beta}^\intercal\Sigma \boldsymbol{\beta}- \boldsymbol{\beta}^\intercal\Sigma \boldsymbol{\beta}\left(\boldsymbol{\beta}^\intercal\Sigma \boldsymbol{\beta}\right)^{-1} \boldsymbol{\beta}^\intercal\Sigma \boldsymbol{\beta} \\={}& \boldsymbol{\beta}^\intercal\Sigma \boldsymbol{\beta}\left(1 - \frac{\boldsymbol{\beta}^\intercal\Sigma \boldsymbol{\beta}}{\boldsymbol{\beta}^\intercal\Sigma \boldsymbol{\beta}} \right) \\={}& 0. \end{aligned} \]

(d)

No matter what the variance of \(\boldsymbol{x}\) is, \(\mathrm{Var}_{\,}\left(y\vert \boldsymbol{x}\right)\) is \(\sigma^2\). However, the marginal variance \(\mathrm{Var}_{\,}\left(y\right)\) goes to infinity, since it also includes the variability from \(\boldsymbol{x}\). This simple fact is at the core of some interesting phenomena in Bayesian statistics such as “Lindley’s paradox.”

(e) (254 only \(\star\) \(\star\) \(\star\))

Everything is jointly normal, \(y\) and \(\boldsymbol{\beta}\) are conditionally normal as well. We have \(\mathbb{E}_{\,}\left[y\vert \boldsymbol{x}\right] = \mathbb{E}_{\,}\left[\boldsymbol{\beta}\right]^\intercal\boldsymbol{x}= 0\), \(\mathrm{Cov}_{\,}\left(y, \boldsymbol{\beta}| \boldsymbol{x}\right) = \mathbb{E}_{\,}\left[\boldsymbol{x}^\intercal\boldsymbol{\beta}\boldsymbol{\beta}^\intercal| \boldsymbol{x}\right] = \boldsymbol{V}\boldsymbol{x}\), and \(\mathrm{Var}_{\,}\left(y| \boldsymbol{x}\right) = \mathbb{E}_{\,}\left[y^2 | xv\right] = \sigma^2 + \boldsymbol{x}^\intercal\boldsymbol{V}\boldsymbol{x}\). And \(\boldsymbol{\beta}\) is independent of \(\boldsymbol{x}\), so conditioning does not change its distribution.

\[ \mathbb{P}_{\,}\left((y, \boldsymbol{\beta}^\intercal)^\intercal\vert \boldsymbol{x}\right) = \mathcal{N}\left( \begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} \sigma^2 + \boldsymbol{x}^\intercal\boldsymbol{V}\boldsymbol{x}& \boldsymbol{x}^\intercal\boldsymbol{V}\\ \boldsymbol{V}\boldsymbol{x}& \boldsymbol{V} \end{pmatrix} \right). \]

\(\boldsymbol{\beta}\) is independent of \(\boldsymbol{x}\), so \(\mathbb{P}_{\,}\left(\boldsymbol{\beta}^\intercal\boldsymbol{x}\vert \boldsymbol{x}\right) = \mathcal{N}\left(0, \boldsymbol{x}^\intercal\boldsymbol{V}\boldsymbol{x}\right)\).

Applying the conditioning rule to \(\mathbb{P}_{\,}\left((y, \boldsymbol{\beta}^\intercal)^\intercal\vert \boldsymbol{x}\right)\) is multivariate normal, with

\[ \mathbb{E}_{\,}\left[\boldsymbol{\beta}| y, \boldsymbol{x}\right] = 0 + \frac{y}{\sigma^2 + \boldsymbol{x}^\intercal\boldsymbol{V}\boldsymbol{x}} \boldsymbol{V}\boldsymbol{x} \]

and

\[ \mathrm{Cov}_{\,}\left(\boldsymbol{\beta}| y, \boldsymbol{x}\right) = \boldsymbol{V}- \boldsymbol{V}\boldsymbol{x}\boldsymbol{x}^\intercal\boldsymbol{V}\left(\sigma^2 + \boldsymbol{x}^\intercal\boldsymbol{V}\boldsymbol{x}\right)^{-1}. \]

LLNs and CLTs

For this problem, you may use without proof the Law of Large Numbers (LLN) and Central Limit Theorem (CLT) for IID random variables with finite variance. (See, e.g., Pitman (2012) section 3.3.)

For \(n=1,\ldots N \ldots \infty\), let \(x_n\) denote IID variables uniformly distributed on the interval \([0,1]\). Let \(a_n\) denote a deterministic sequence satisfying \(\frac{1}{N} \sum_{n=1}^Na_n \rightarrow 0\) and \(\frac{1}{N} \sum_{n=1}^Na_n^2 \rightarrow 1\), each as \(N \rightarrow \infty\).

(a)

Compute \(\mathbb{E}_{\,}\left[\frac{1}{N} \sum_{n=1}^Nx_n\right]\) and \(\mathrm{Var}_{\,}\left(\frac{1}{N} \sum_{n=1}^Nx_n\right)\).

(b)

What is \(\lim_{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^Nx_n\)? Explain your answer in terms of your result from (a).

(c)

Let \(z_n = x_n - 0.5\). What is \(\mathbb{E}_{\,}\left[\frac{1}{\sqrt{N}} \sum_{n=1}^Nz_n\right]\) and \(\mathrm{Var}_{\,}\left(\frac{1}{\sqrt{N}} \sum_{n=1}^Nz_n\right)\)?

(d)

What is \(\lim_{N \rightarrow \infty} \frac{1}{\sqrt{N}} \sum_{n=1}^N(x_n - 0.5)\)? Explain how your answer relates to your result from (c).

(e)

What is \(\lim_{N \rightarrow \infty} \frac{1}{\sqrt{N}} \sum_{n=1}^Nx_n\)? Explain how your answer relates to your result from (d).

(f) (254 only \(\star\) \(\star\) \(\star\))

Compute \(\mathbb{E}_{\,}\left[\frac{1}{N} \sum_{n=1}^Na_n x_n\right]\) and \(\mathrm{Var}_{\,}\left(\frac{1}{N} \sum_{n=1}^Na_n x_n\right)\).

(g) (254 only \(\star\) \(\star\) \(\star\))

What is \(\lim_{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^Na_n x_n\)? Explain your answer in terms of your result from (f).

(h) (254 only \(\star\) \(\star\) \(\star\))

Let \(s_N := \frac{1}{\sqrt{N}} \sum_{n=1}^N(a_n x_n - 0.5 \frac{1}{N} \sum_{n=1}^Na_n)\). Compute \(\mathbb{E}_{\,}\left[s_N\right]\) and \(\mathrm{Var}_{\,}\left(s_N\right)\).

Advanced note: As you might guess from these computations, a result known as the Lindeberg–Feller central limit theorem (e.g., Van der Vaart (2000) 2.27) gives conditions under which

\[ \lim_{N \rightarrow \infty} s_N \rightarrow \mathcal{N}\left(0, \lim_{N\rightarrow \infty} \mathrm{Var}_{\,}\left(s_N\right)\right). \]

Solutions

(a)

\(\mathbb{E}_{\,}\left[\frac{1}{N} \sum_{n=1}^Nx_n\right] = \frac{1}{N} \sum_{n=1}^N\mathbb{E}_{\,}\left[x_n\right] = \frac{1}{N} \sum_{n=1}^N0.5 = 0.5\).

\[ \begin{aligned} \mathrm{Var}_{\,}\left(\frac{1}{N} \sum_{n=1}^Nx_n\right) ={}& \mathbb{E}_{\,}\left[(\frac{1}{N} \sum_{n=1}^Nx_n - 0.5)^2\right] \\={}& \mathbb{E}_{\,}\left[\frac{1}{N^2} \sum_{n=1}^N \sum_{m=1}^N (x_n-0.5)(x_m - 0.5)\right] \\={}& \frac{1}{N^2} \mathbb{E}_{\,}\left[\sum_{n=1}^N (x_n-0.5)^2\right] \\={}& \frac{1}{N} \frac{1}{N} \sum_{n=1}^N\mathrm{Var}_{\,}\left(x_n\right) \\={}& \frac{1}{12 N}. \end{aligned} \]

(b)

By the LLN is is \(\mathbb{E}_{\,}\left[x_n\right] = 0.5\).

(c)

Note that \(\frac{1}{\sqrt{N}} \sum_{n=1}^Nz_n = \sqrt{N}(\frac{1}{N} \sum_{n=1}^Nx_n - 0.5)\). So \(\mathbb{E}_{\,}\left[\frac{1}{\sqrt{N}} \sum_{n=1}^Nz_n\right] = 0\) and \(\mathrm{Var}_{\,}\left(\frac{1}{\sqrt{N}} \sum_{n=1}^Nz_n\right) = N \mathrm{Var}_{\,}\left(\frac{1}{N} \sum_{n=1}^Nx_n\right) = \frac{1}{12}\).

(d)

\(\lim_{N \rightarrow \infty} \frac{1}{\sqrt{N}} \sum_{n=1}^N(x_n - 0.5) \rightarrow \mathcal{N}\left(0, 1/12\right)\). The scaling \(\sqrt{N}\) is just the right scaling so that the variance goes to a non–zero constant.

(e)

\[ \lim_{N \rightarrow \infty} \frac{1}{\sqrt{N}} \sum_{n=1}^Nx_n = \lim_{N \rightarrow \infty} \frac{1}{\sqrt{N}} \sum_{n=1}^Nz_n + \sqrt{N} \frac{1}{N} \sum_{n=1}^N0.5 = \lim_{N \rightarrow \infty} \frac{1}{\sqrt{N}} \sum_{n=1}^Nz_n + \sqrt{N} 0.5 \rightarrow \infty. \]

You are adding \(\sqrt{N}\) times a positive constant; the mean diverges, and the variance goes to a constant.

(f) (254 only \(\star\) \(\star\) \(\star\))

\(\mathbb{E}_{\,}\left[\frac{1}{N} \sum_{n=1}^Na_n x_n\right] = \frac{1}{N} \sum_{n=1}^Na_n 0.5\) and \(\mathrm{Var}_{\,}\left(\frac{1}{N} \sum_{n=1}^Na_n x_n\right) = \frac{1}{N} \frac{1}{N} \sum_{n=1}^Na_n^2 / 12\), by the same reasoning as above.

(g) (254 only \(\star\) \(\star\) \(\star\))

The variance of \(\frac{1}{N} \sum_{n=1}^Na_n x_n\) goes to zero and the mean goes to \(\lim_{N \rightarrow \infty} 0.5 \frac{1}{N} \sum_{n=1}^Na_n =: 0.5 \overline{a}\), so the limit is \(0.5 \overline{a}\).

(h) (254 only \(\star\) \(\star\) \(\star\))

\(\mathbb{E}_{\,}\left[s_N\right] = 0\) and \(\mathrm{Var}_{\,}\left(s_N\right) = \frac{1}{N} \sum_{n=1}^Na_n^2 / 12\).

Multivariate calculus

Consider the logistic loss function \(\mathcal{L}(\boldsymbol{\beta})\) given by

\[ \begin{aligned} \phi(\zeta) :={}& \frac{\exp(\zeta)}{1 + \exp(\zeta)} \\ \ell(y| p) :={}& -y\log p- (1 - y) \log(1 - p) = - y\log \frac{p}{1-p} - \log(1 - p)\\ \mathcal{L}(\boldsymbol{\beta}) ={}& \sum_{n=1}^N\ell(y_n | \phi(\boldsymbol{\beta}^\intercal\boldsymbol{x}_n)). \end{aligned} \]

Let \(\boldsymbol{x}_n\) and \(\boldsymbol{\beta}\) be \(P\)–dimensional vectors.

(a)

Compute \(\partial \phi(\boldsymbol{\beta}^\intercal\boldsymbol{x}_n) / \partial \boldsymbol{\beta}\) and \(\partial^2 \phi(\boldsymbol{\beta}^\intercal\boldsymbol{x}_n) / \partial \boldsymbol{\beta}\partial \boldsymbol{\beta}^\intercal\).

(b)

Using (a), compute \(\partial \mathcal{L}(\boldsymbol{\beta}) / \partial \boldsymbol{\beta}\). (It helps to observe that \(\log \frac{\phi(\boldsymbol{x}_n^\intercal\boldsymbol{\beta})}{1-\phi(\boldsymbol{x}_n^\intercal\boldsymbol{\beta})} = \boldsymbol{\beta}^\intercal\boldsymbol{x}_n\).)

(c)

Using (a), compute \(\partial^2 \mathcal{L}(\boldsymbol{\beta}) / \partial \boldsymbol{\beta}\partial \boldsymbol{\beta}^\intercal\).

(d)

Let \(\boldsymbol{X}\) denote the \(N \times P\) matrix whose \(n\)–th row is \(\boldsymbol{x}_n^\intercal\), and let \(\boldsymbol{\varepsilon}\) denote the \(N \times 1\) column vector whose \(n\)–th entry is \(y_n - \phi(\boldsymbol{x}_n^\intercal\boldsymbol{\beta})\). Write \(\partial \mathcal{L}(\boldsymbol{\beta}) / \partial \boldsymbol{\beta}\) using \(\boldsymbol{X}\) and \(\boldsymbol{\varepsilon}\) using only matrix multiplication (i.e., without explicit summation of the form \(\sum_{n=1}^N(\cdot)\)).

(e) (254 only \(\star\) \(\star\) \(\star\))

Sketch pseudocode for an iterative computer program to approximately find a \(\hat{\boldsymbol{\beta}}\) satisfying \(\left. \frac{\partial \mathcal{L}(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} \right|_{\hat{\boldsymbol{\beta}}} = \boldsymbol{0}\).

(f) (254 only \(\star\) \(\star\) \(\star\))

Using your result from (c), argue that \(\hat{\boldsymbol{\beta}}\) from (e) is (up to numerical error) a minimizer of \(\mathcal{L}(\beta)\).

Solutions

(a)

Take \(\zeta = \boldsymbol{\beta}^\intercal\boldsymbol{x}\) and use the chain rule.
\[ \begin{aligned} \partial \phi(\zeta) / \partial \zeta ={}& \frac{\exp(\zeta)}{1 + \exp(\zeta)} - \frac{\exp(\zeta)^2}{(1 + \exp(\zeta))^2} \\={}& \phi(\zeta) - \phi(\zeta)^2 \\={}& \phi(\zeta) (1 - \phi(\zeta)) \Rightarrow \\ \partial \phi(\boldsymbol{x}^\intercal\boldsymbol{\beta}) / \partial \boldsymbol{\beta}={}& \phi(\boldsymbol{x}^\intercal\boldsymbol{\beta}) (1 - \phi(\boldsymbol{x}^\intercal\boldsymbol{\beta})) \boldsymbol{x}\\ \end{aligned} \]

Noting that \(\partial^2 \zeta / \partial \boldsymbol{\beta}\partial \boldsymbol{\beta}^\intercal= 0\), the chain rule gives \(\partial^2 \phi(\zeta) / \partial \zeta^2 \boldsymbol{x}\boldsymbol{x}^\intercal\). And differentiating the first derivative gives \[ \begin{aligned} \partial^2 \phi(\zeta) / \partial \zeta^2 ={}& \partial \phi(\zeta) / \partial \zeta - 2 \phi(\zeta) \partial \phi(\zeta) / \partial \zeta \\={}& \phi(\zeta) (1 - \phi(\zeta)) - 2 \phi(\zeta)^2 (1 - \phi(\zeta)) \\={}& \phi(\zeta) (1 - \phi(\zeta))(1 - 2 \phi(\zeta)). \end{aligned} \].

(b)

We have \(\ell(y| \phi(\boldsymbol{\beta}^\intercal\boldsymbol{x})) = -y\boldsymbol{x}^\intercal\boldsymbol{\beta}- \log(1 - \phi(\boldsymbol{\beta}^\intercal\boldsymbol{x}))\). Now,

\[ \begin{aligned} \frac{\partial}{\partial \zeta} \log(1 - \phi(\zeta)) ={}& - \frac{\partial \phi(\zeta) / \partial \zeta}{1 - \phi(\zeta)} \\={}& - \frac{\phi(\zeta)(1 - \phi(\zeta))}{1 - \phi(\zeta)} \\={}& - \phi(\zeta). \end{aligned} \]

So, by the chain rule,

\[ \begin{aligned} \frac{\partial \ell(y| \phi(\boldsymbol{\beta}^\intercal\boldsymbol{x}))}{\partial \boldsymbol{\beta}} ={}& -y\boldsymbol{x}+ \phi(\boldsymbol{x}^\intercal\boldsymbol{\beta}) \boldsymbol{x} \\={}& -(y- \phi(\boldsymbol{x}^\intercal\boldsymbol{\beta})) \boldsymbol{x}. \end{aligned} \]

Combining,

\[ \frac{\partial \mathcal{L}(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} = - \sum_{n=1}^N(y_n - \phi(\boldsymbol{x}_n^\intercal\boldsymbol{\beta})) \boldsymbol{x}_n. \]

(c)

From (a) and (b), leting \(\zeta_n = \boldsymbol{\beta}^\intercal\boldsymbol{x}_n\), we have

\[ \begin{aligned} \frac{\partial^2 \mathcal{L}(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}\partial \boldsymbol{\beta}^\intercal} ={}& - \sum_{n=1}^N\frac{\partial \phi(\zeta_n)}{\partial \zeta_n} \boldsymbol{x}_n \boldsymbol{x}_n^\intercal \\={}& - \sum_{n=1}^N(1 - \phi(\boldsymbol{x}_n^\intercal\boldsymbol{\beta})) \phi(\boldsymbol{x}_n^\intercal\boldsymbol{\beta})\boldsymbol{x}_n \boldsymbol{x}_n^\intercal. \end{aligned} \]

(d)

\[ \frac{\partial \mathcal{L}(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} = -\boldsymbol{X}^\intercal\boldsymbol{\varepsilon}. \]

(Which is a pretty expression.)

(e) (254 only \(\star\) \(\star\) \(\star\))

One option is gradient descent. I’ll be very brief (you should be more detailed.) Starting at \(\boldsymbol{\beta}_t\), compute \(\boldsymbol{\varepsilon}_t = \boldsymbol{Y}- \phi(\boldsymbol{X}\boldsymbol{\beta}_t)\), and \(\gv_t = \boldsymbol{X}^\intercal\boldsymbol{\varepsilon}_t\). Then take \(\boldsymbol{\beta}_{t+1} = \boldsymbol{\beta}_{t} - \gv_t \delta\) for some small \(\delta\). Terminate when \(\left\Vert\gv_t\right\Vert\) is small.

(f) (254 only \(\star\) \(\star\) \(\star\))

Since \(\phi(\eta_n) (1 - \phi(\eta_n)) > 0\), and since \(\boldsymbol{x}\boldsymbol{x}^\intercal\) is a postitive semi–definite matrix for each \(\boldsymbol{x}_n\), it follows that the second derivative of \(\mathcal{L}(\boldsymbol{\beta})\) is positive semi–definite, and so convex, and so any stationary point is a minimizer.

Projections (Yu (2022) HW 1.3)

(a)

For any vector \(\boldsymbol{x}=\left(x_{1}, x_{2}, x_{3}\right)^{\intercal} \in \mathbb{R}^{3}\), what is its projection on the vector \((1,0,0)^{\intercal}\) ?

(b)

What is the orthogonal projection of \(\boldsymbol{x}=\left[x_{1}, x_{2}, x_{3}\right]^{\intercal} \in \mathbb{R}^{3}\) on the subspace spanned by the vectors \([1,0,0]^{\intercal}\) and \([0,1,0]^{\intercal}\) ?

(c)

Write the projection matrices in the previous two cases.

(d)

Write the expression for the orthogonal projection of a vector \(\boldsymbol{x}\in \mathbb{R}^{d}\) along any given vector \(\boldsymbol{a}\in \mathbb{R}^{d}\). What is the projection matrix in this case?

(e)

Given two orthogonal vectors \(\boldsymbol{a}_{1}\) and \(\boldsymbol{a}_{2}\) in \(\mathbb{R}^{d}\), what is the orthognal projection of a generic vector \(\boldsymbol{x}\) on to the subspace spanned by these two vectors?

(f)

Suppose that the two vectors \(\boldsymbol{a}_{1}\) and \(\boldsymbol{a}_{2}\) are not orthogonal. How will you compute the orthogonal projection of \(\boldsymbol{x}\) in this case? It may be useful to revise Gram Schmidt Orthogonalization.

(g)

Generalize the answer from (f) to the case to compute the orthogonal projection along a k-dimensional subspace spanned by the vectors \(\boldsymbol{a}_{1}, \ldots, \boldsymbol{a}_{k}\) which need not be orthogonal.

(h)

Define the matrix

\[ \boldsymbol{A}=\left[\boldsymbol{a}_{1}, \ldots, \boldsymbol{a}_{k}\right] \in \mathbb{R}^{D \times K} \]

such that the columns are linearly independent. Prove that the orthogonal projection of any vector \(\boldsymbol{x}\in \mathbb{R}^{d}\) onto the k -dimensional subspace spanned by the vectors \(\boldsymbol{a}_{1}, \ldots, \boldsymbol{a}_{k}\) is given by

\[ \begin{equation*} \mathbf{A}\left(\mathbf{A}^{\intercal} \mathbf{A}\right)^{-1} \mathbf{A}^{\intercal} \boldsymbol{x}. \tag{1} \end{equation*} \]

Hint: Consider the least squares problem \(\min_{\boldsymbol{\beta}}\|\mathbf{A \boldsymbol{\beta}}-\boldsymbol{x}\|_{2}^{2}\).

(i)

How does the expression \(\mathbf{A}\left(\mathbf{A}^{\intercal} \mathbf{A}\right)^{-1} \mathbf{A}^{\intercal}\) simplify if the vectors \(\mathbf{a}_{1}, \ldots, \mathbf{a}_{k}\) are orthogonal?

Solutions

(a)

It is \((x_1, 0, 0)\).

(b)

It is \((x_1, x_2, 0)\).

(c)

\[ P_a = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \\ \end{pmatrix} \quad\textrm{P}\quad_b = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0 \\ \end{pmatrix} \]

(d)

It is \(\frac{\boldsymbol{x}^\intercal\boldsymbol{a}}{\boldsymbol{a}^\intercal\boldsymbol{a}} \boldsymbol{a}\) with projection matrix \(\boldsymbol{a}(\boldsymbol{a}^\intercal\boldsymbol{a}){-1} \boldsymbol{a}^\intercal\).

(e)

It is \(\boldsymbol{a}_1 (\boldsymbol{a}_1^\intercal\boldsymbol{x}) / (\boldsymbol{a}_1^\intercal\boldsymbol{a}_1) + \boldsymbol{a}_2 (\boldsymbol{a}_2^\intercal\boldsymbol{x}) / (\boldsymbol{a}_2^\intercal\boldsymbol{a}_2)\).

(f)

You can make \(\boldsymbol{a}_2\) orthogonal by subtracting its projection onto \(\boldsymbol{a}_1\).

(g)

You can construct an orthogonal basis by successively orthogonalizing.

(h)

The given matrix solves the least squares problem which coincides with the definition of projection.

(i)

How does the expression \(\mathbf{A}\left(\mathbf{A}^{\intercal} \mathbf{A}\right)^{-1} \mathbf{A}^{\intercal}\) simplify if the vectors \(\mathbf{a}_{1}, \ldots, \mathbf{a}_{k}\) are orthogonal?

In that case, \(\boldsymbol{A}^\intercal\boldsymbol{A}\) is diagonal, and the inverse is also diagonal.

RSS, F-tests, and variable selection

Consider a full–rank matrix \(\boldsymbol{X}\in \mathbb{R}^{N \times P}\) with regressor vector \(\boldsymbol{x}_n \in \mathbb{R}^{P}\) in the \(n\)–th row, and a matrix \(\boldsymbol{Y}\in \mathbb{R}^{N}\) of responses. Let \(\boldsymbol{X}_K \in \mathbb{R}^{N \times K}\) denote a matrix containing the first \(K\) columns of \(\boldsymbol{X}\), and, for a given \(K\), define \(\hat{\boldsymbol{\beta}}_K\) as the solution to the least squares problem

\[ \hat{\boldsymbol{\beta}}_K := \underset{\boldsymbol{\beta}}{\mathrm{argmin}}\, \left\Vert\boldsymbol{Y}- \boldsymbol{X}_K \boldsymbol{\beta}\right\Vert_2^2. \]

Let \(\hat{\boldsymbol{Y}}_K := \boldsymbol{X}_K \hat{\boldsymbol{\beta}}_K\), and define \(ESS_K := \hat{\boldsymbol{Y}}_K^\intercal\hat{\boldsymbol{Y}}_K\), \(TSS := \boldsymbol{Y}^\intercal\boldsymbol{Y}\), and \(R^2_K := ESS_K / TSS\).

You may assume, for simplicity, that the columns of \(\boldsymbol{X}\) and \(\boldsymbol{Y}\) are already centered, i.e., \(\boldsymbol{X}\boldsymbol{1}= \boldsymbol{0}\) and \(\boldsymbol{Y}^\intercal\boldsymbol{1}= 0\).

(a)

Using the standard formula for \(\hat{\boldsymbol{\beta}}_K\), show that \(\hat{\boldsymbol{Y}}_K\) is the projection of \(\boldsymbol{Y}\) onto the column span of \(\boldsymbol{X}_K\). (Hint: you can use a result from the projections problem above.)

(b)

Using (a), show that, for \(K' > K\), \(R^2_K \le R^2_{K'}\).

(c)

Under what conditions does \(R^2_{K} = R^2_{K'}\) for \(K' > K\)?

(d)

Assume that \(\boldsymbol{Y}\) is not orthogonal to any columns of \(\boldsymbol{X}\). Suppose you choose \(\hat{K}\) as a value that maximizes \(ESS_K\), i.e., \(\hat{K} := \underset{K}{\mathrm{argmax}}\, ESS_K\). Write an expression for \(\hat{K}\). Do you think \(\hat{K}\) is a good guide for choosing which regressors to include in a regression problem? Why or why not?

(e) (254 only \(\star\) \(\star\) \(\star\))

Let \(\beta_K\) denote the restriction of \(\beta\) to its first \(K\) components. For a given \(K\), write an expression for \(\phi_K\), the F–statistic testing the null hypothesis \(H_0: \beta_K = \boldsymbol{0}\) using the regression \(\hat{\boldsymbol{\beta}}_K\). (See, e.g., Gareth et al. (2021) section 3.2.2.)

(f) (254 only \(\star\) \(\star\) \(\star\))

We showed in (b) that \(K \mapsto R^2_{K}\) is non–decreasing as a function of \(K\). Is the same true of the F–statistics from (d), i.e., for \(K \mapsto \phi_K\)?

Solutions

(a)

\[ \hat{\boldsymbol{Y}}_K = \boldsymbol{X}_K \hat{\boldsymbol{\beta}}_K =\boldsymbol{X}_K (\boldsymbol{X}_K^\intercal\boldsymbol{X}_K)^{-1}\boldsymbol{X}_K^\intercal\boldsymbol{Y}, \]

which is the projection formula from above.

(b)

If \(K' > K\), then \(\left\Vert\hat{\boldsymbol{Y}}_{K'}\right\Vert \ge \left\Vert\hat{\boldsymbol{Y}}_K\right\Vert\), since \(\hat{\boldsymbol{Y}}_{K'}\) is projected onto a greater (or equal) subspace. The TSS does not depend on the regression. So \(R^2_K \le R^2_{K'}\).

(c)

This occurs when \(\boldsymbol{Y}\) is orthogonal to the extra dimensions spanned by \(\boldsymbol{X}_{K'}\).

(d)

\(\hat{K} = P\), since \(ESS_K\) is strictly increasing in \(K\). This is not a good guide, since it depends only very weakly on what \(\boldsymbol{X}\) and \(\boldsymbol{Y}\) actually are.

(e) (254 only \(\star\) \(\star\) \(\star\))

\[ \phi_K = \frac{N-K}{K} \frac{ESS}{RSS}. \]

(f) (254 only \(\star\) \(\star\) \(\star\))

It is not true.

Bibliography

Gareth, J., W. Daniela, H. Trevor, and T. Robert. 2021. An Introduction to Statistical Learning: With Applications in Python. Spinger.

Pitman, J. 2012. Probability. Springer Science & Business Media.

Van der Vaart, A. 2000. Asymptotic Statistics. Vol. 3. Cambridge university press.

Yu, B. 2022. “Stat154 Spring 2022.” https://stat154.github.io/.