Homework 0 (Notation and review)
Stat 154/254: Statistical Machine Learning
\[ \def\gv{\boldsymbol{g}} \]
Due Sunday September 7th (9pm)
1 Post on ED
There is a thread on ED about learning objectives. Please post three (or more) learning goals for yourself. Ideally, these learning objectives will be specific, achievable, and evaluable.
You can find the offical course objectives in the Course Policies webpage. Feel free to select a few that are most important to you, or write your own.
2 Extreme values of functions
Let \(C\) be some set, and \(f(x)\) a function defined on \(x \in C\). The expression \(\inf_{x \in C} f(x)\) means “the largest lower bound for \(f(x)\) over \(x \in C\)”. The symbol “inf” is pronounced “infimum.” Analogously, \(\sup_{x \in C} f(x)\) means “the smallest upper bound for \(f(x)\) over \(x \in C\)” and is pronounced “supremum.”
The infimum is like the minimum and the supremum is like the maximum, with the important difference that the infimum and supremum always exist, whereas the minimum and maximum might not. (For this to be true, we must allow for the supremum and infimum to be infinite, which we will in this class.)
The notation \(\underset{x \in C}{\mathrm{argmin}}\, f(x)\) means “the set of \(x\) at which \(f(x)\) achieves its minimum in \(C\)”, with an analogous meaning for \(\underset{x \in C}{\mathrm{argmax}}\, f(x)\). If there are no such \(x\), the \(\underset{}{\mathrm{argmin}}\,\) (or \(\underset{}{\mathrm{argmax}}\,\)) returns the empty set, \(\emptyset\).
(a) Suppose \(f(x)\) achieves its minimum in \(C\), meaning there exists some \(x_0 \in C\) such that \(f(x_0) \le f(x)\) for all \(x \in C\). When this happens, show that the infimum is the same as the minimum, i.e. \(\inf_{x \in C} f(x) = \min_{x \in C} f(x)\).
(b) Suppose that \(\underset{x \in C}{\mathrm{argmax}}\, f(x)\) is non-empty. When this happens, show that the supremum is the same as the maximum, i.e. \(\sup_{x \in C} f(x) = \max_{x \in C} f(x)\).
(c) Let \(f(x) = \exp(x)\), and let \(C = \mathbb{R}^{}\) be the whole real line. Find \(\inf_{x} \exp(x)\). Show that there is no \(x_0\) such that \(f(x_0) = \inf_{x} \exp(x)\), so \(\min_{x} \exp(x)\) does not exist.
(d) Let \(f(x) = x\) and let \(C = (0,1)\) denote the open unit interval. Find \(\sup_{x} f(x)\), and show that \(\max_{x} f(x)\) does not exist.
(e) Let \(f(x) = 1\) be a constant, and find \(\underset{x \in (0,1)}{\mathrm{argmin}}\, f(x)\).
(f) (254 only \(\star\) \(\star\) \(\star\)) Identify a function defined on all of \([0,1]\) for which \(\sup_{x\in [0,1]} f(x) = 1\) but for which the maximum does not exist.
(g) (254 only \(\star\) \(\star\) \(\star\)) Let \(A = \begin{pmatrix}1 & 0 \\ 0 & 0\end{pmatrix}\) and find \(\underset{x \in \mathbb{R}^{2}}{\mathrm{argmin}}\, x^\intercal A x\).
(a) Since \(f(x_0)\) is a lower bound of \(f(x)\) in \(C\), it must be the case that \(\inf_{x \in C} f(x)\), which is the largest lower bound, is greater than or equal to \(f(x_0)\). And any number \(f_1 > f(x_0)\) cannot be a lower bound, since \(f(x_0)\) would be lower. So \(f(x_0) \le \inf_{x \in C} f(x) \le f(x_0)\), from which the result follows.
(b) This is in fact a restatement of (a), with \(-f\) replacing \(f\).
(c) As \(x \rightarrow -\infty\), \(\exp(x)\) gets arbitrarily close to \(0\), but never takes any negative values, so \(\inf_{x} \exp(x) = 0\). However, there is no \(x\) such that \(\exp(x) = 0\), so \(\exp(x)\) does not achieve its minimum.
(d) As \(x \rightarrow 1\) from the left, \(f(x)\) gets arbitrarily close to \(1\), so \(\sup_{x} f(x) \ge 1\). But \(f(x) < 1\) for all \(x \in (0,1)\), so \(\sup_{x} f(x) = 1\), but \(f(x)\) never equals \(1\), so the maximum does not exist.
(e) The solution is the whole interval, \(\underset{x \in (0,1)}{\mathrm{argmin}}\, = (0,1)\), since \(f(x)\) achieves its minimum (and its maximum!) at every point.
(f) You can take the example from (d), but identify \(f(1) = 0\). The function is discontinuous, but the reasoning from (d) still applies.
(g) For any \(\boldsymbol{x}= (x_1, x_2)^\intercal\), \(\boldsymbol{x}^\intercal A \boldsymbol{x}= x_1^2 \ge 0\), which is achieved at any \(\boldsymbol{x}\) of the form \(\boldsymbol{x}= (0, x_2)^\intercal\) for any \(x_2\).
3 Multivariate normal exercises
Let \(\boldsymbol{x}\sim \mathcal{N}\left(0, \Sigma\right)\) be a \(P\)–dimensional multivariate normal random vector with invertible covariance matrix \(\Sigma\). Let \(\varepsilon\sim \mathcal{N}\left(0, \sigma^2\right)\) denote a univariate normal random variable which is independent of \(\boldsymbol{x}\). Finally, for a given \(\boldsymbol{\beta}\), let \(y= \boldsymbol{x}^\intercal\boldsymbol{\beta}+ \varepsilon\).
(a) Give explicit expressions for the distributions of the following quantities in terms of the fixed quantities \(\sigma\), \(\Sigma\), and \(\boldsymbol{\beta}\). You do not need to specify the density explicitly, just fully characterize the distribution. For example, it is an acceptable answer to write \(\mathbb{P}_{\,}\left(\boldsymbol{x}\right) = \mathcal{N}\left(0, \Sigma\right)\).
Note that \((y, \boldsymbol{x}^\intercal)^\intercal\) is a \(P + 1\)—dimensional vector consisting of \(y\) stacked on top of \(\boldsymbol{x}\).
- \(\mathbb{P}_{\,}\left((y, \boldsymbol{x}^\intercal)^\intercal\right)\)
- \(\mathbb{P}_{\,}\left(\boldsymbol{x}^\intercal\boldsymbol{\beta}\right)\)
- \(\mathbb{P}_{\,}\left(y| \boldsymbol{x}\right)\)
- \(\mathbb{P}_{\,}\left(y\right)\)
- \(\mathbb{P}_{\,}\left(\varepsilon\vert \boldsymbol{x}\right)\)
- \(\mathbb{P}_{\,}\left(\varepsilon\vert y, \boldsymbol{x}\right)\)
- \(\mathbb{P}_{\,}\left(\boldsymbol{x}| y\right)\)
(b) When \(\boldsymbol{\beta}= \boldsymbol{0}\), how is \(\mathbb{P}_{\,}\left(\boldsymbol{x}\vert y\right)\) different from \(\mathbb{P}_{\,}\left(\boldsymbol{x}\right)\)? Explain this result in intuitive terms.
(c) When \(\sigma = 0\), what is the dimension of the support of \(\mathbb{P}_{\,}\left(\boldsymbol{x}\vert y\right)\)? (Recall that the support is, informally, the region of nonzero probability.) Explain this result in intuitive terms.
(d) As the diagonal entries of \(\Sigma\) go to infinity, how is \(\mathrm{Var}_{\,}\left(y\vert \boldsymbol{x}\right)\) different from \(\mathrm{Var}_{\,}\left(y\right)\)? Explain your answer in intuitive terms.
(e) (254 only \(\star\) \(\star\) \(\star\))
Now, additionally assume that \(\boldsymbol{\beta}\sim \mathcal{N}\left(\boldsymbol{0}, \boldsymbol{V}\right)\) independently of \(\boldsymbol{x}\), and that \(y= \boldsymbol{x}^\intercal\boldsymbol{\beta}+ \varepsilon\) holds conditionally on \(\boldsymbol{x}\), \(\boldsymbol{\beta}\), and \(\varepsilon\). Give explicit expressions for the following distributions in terms of \(\Sigma\), \(\sigma\), and \(\boldsymbol{V}\).
- \(\mathbb{P}_{\,}\left((y, \boldsymbol{\beta}^\intercal)^\intercal\vert \boldsymbol{x}\right)\)
- \(\mathbb{P}_{\,}\left(\boldsymbol{\beta}^\intercal\boldsymbol{x}\vert \boldsymbol{x}\right)\)
- \(\mathbb{P}_{\,}\left(\boldsymbol{\beta}\vert y, \boldsymbol{x}\right)\)
(a)
Note that
\[ \begin{aligned} \mathbb{E}_{\vphantom{}}\left[y\boldsymbol{x}\right] ={}& \mathbb{E}_{\vphantom{}}\left[(\beta^\intercal\boldsymbol{x}+ \varepsilon) \boldsymbol{x}^\intercal\right] \\={}& \beta^\intercal\mathbb{E}_{\vphantom{}}\left[\boldsymbol{x}\boldsymbol{x}^\intercal\right] + \mathbb{E}_{\vphantom{}}\left[\varepsilon\boldsymbol{x}^\intercal\right] \\={}& \beta^\intercal\Sigma \end{aligned} \]
and
\[ \mathbb{E}_{\vphantom{}}\left[y^2\right] = \mathbb{E}_{\vphantom{}}\left[(\beta^\intercal\boldsymbol{x}+ \varepsilon) (\boldsymbol{x}^\intercal\boldsymbol{\beta}+ \varepsilon)\right] = \boldsymbol{\beta}^\intercal\Sigma \boldsymbol{\beta}+ \sigma^2. \]
\[ \mathbb{P}_{\,}\left((y, \boldsymbol{x}^\intercal)^\intercal\right) = \mathcal{N}\left(0, \begin{pmatrix} \sigma^2 + \boldsymbol{\beta}^\intercal\Sigma \boldsymbol{\beta}& \boldsymbol{\beta}^\intercal\Sigma \\ \Sigma \boldsymbol{\beta}& \Sigma \end{pmatrix} \right) \]
\(\mathbb{P}_{\,}\left(\boldsymbol{x}^\intercal\boldsymbol{\beta}\right) = \mathcal{N}\left(0, \boldsymbol{\beta}^\intercal\Sigma \boldsymbol{\beta}\right)\)
Using the condtional proprety of Gaussians,
\[ \begin{aligned} \mathbb{E}_{\vphantom{}}\left[y| \boldsymbol{x}\right] ={}& \mathbb{E}_{\vphantom{}}\left[y\right] + \mathrm{Cov}_{\,}\left(y, \boldsymbol{x}\right) \mathrm{Cov}_{\,}\left(\boldsymbol{x}\right)^{-1} (\boldsymbol{x}- \mathbb{E}_{\vphantom{}}\left[\boldsymbol{x}\right]) \\={}& 0 + \boldsymbol{\beta}^\intercal\Sigma \Sigma^{-1} \boldsymbol{x} \\={}& \boldsymbol{\beta}^\intercal\boldsymbol{x}. \end{aligned} \]
Also,
\[ \begin{aligned} \mathrm{Cov}_{\,}\left(y| \boldsymbol{x}\right) ={}& \mathrm{Cov}_{\,}\left(y\right) - \mathrm{Cov}_{\,}\left(y, \boldsymbol{x}\right) \mathrm{Cov}_{\,}\left(\boldsymbol{x}\right)^{-1} \mathrm{Cov}_{\,}\left(\boldsymbol{x}, y\right) \\={}& \sigma^2 + \boldsymbol{\beta}^\intercal\Sigma \boldsymbol{\beta}- \boldsymbol{\beta}^\intercal\Sigma \Sigma^{-1} \Sigma \boldsymbol{\beta} \\={}& \sigma^2. \end{aligned} \]
You can more easily just do these same calculations directly on \(y= \boldsymbol{x}^\intercal\boldsymbol{\beta}+ \varepsilon,\) but it’s nice to see the conditional formulas give the same answer.
It follows that
\[ \mathbb{P}_{\,}\left(y\vert \boldsymbol{x}\right) = \mathcal{N}\left(\boldsymbol{x}^\intercal\boldsymbol{\beta}, \sigma^2\right). \]
\(\mathbb{P}_{\,}\left(y\right) = \mathcal{N}\left(0, \sigma^2 + \boldsymbol{\beta}^\intercal\Sigma \boldsymbol{\beta}\right)\). Note that marginally over \(\boldsymbol{x}\), the distribution of \(y\) has greater variance than conditional on \(\boldsymbol{x}\).
\(\mathbb{P}_{\,}\left(\varepsilon\vert \boldsymbol{x}\right) = \mathcal{N}\left(0, \sigma^2\right)\), since \(\varepsilon\) and \(\boldsymbol{x}\) are independent.
\(\mathbb{P}_{\,}\left(\varepsilon\vert y, \boldsymbol{x}\right)\) is a point mass at \(y- \boldsymbol{x}^\intercal\boldsymbol{\beta}\); it can take no other value.
\(\mathbb{P}_{\,}\left(\boldsymbol{x}| y\right)\)
This one is more complicated than \(\mathbb{P}_{\,}\left(y\vert \boldsymbol{x}\right)\), and it is easiest to use the conditional formulas directly. We know that it is Gaussian. The mean is given by
\[ \begin{aligned} \mathbb{E}_{\vphantom{}}\left[\boldsymbol{x}| y\right] ={}& \mathbb{E}_{\vphantom{}}\left[\boldsymbol{x}\right] + \mathrm{Cov}_{\,}\left(\boldsymbol{x}, y\right) \mathrm{Cov}_{\,}\left(y\right)^{-1} (y- \mathbb{E}_{\vphantom{}}\left[y\right]) \\={}& 0 + \Sigma \boldsymbol{\beta}\left(\sigma^2 + \boldsymbol{\beta}^\intercal\Sigma \boldsymbol{\beta}\right)^{-1} y \\={}& 0 + \frac{y}{\sigma^2 + \boldsymbol{\beta}^\intercal\Sigma \boldsymbol{\beta}} \Sigma \boldsymbol{\beta}. \end{aligned} \]
The covariance is given by
\[ \begin{aligned} \mathrm{Cov}_{\,}\left(\boldsymbol{x}| y\right) ={}& \mathrm{Cov}_{\,}\left(\boldsymbol{x}\right) - \mathrm{Cov}_{\,}\left(\boldsymbol{x}, y\right) \mathrm{Cov}_{\,}\left(y\right)^{-1} \mathrm{Cov}_{\,}\left(y, \boldsymbol{x}\right) \\={}& \Sigma - \Sigma \boldsymbol{\beta}\left(\sigma^2 + \boldsymbol{\beta}^\intercal\Sigma \boldsymbol{\beta}\right)^{-1} \boldsymbol{\beta}^\intercal\Sigma. \end{aligned} \]
(b)
When \(\boldsymbol{\beta}= \boldsymbol{0}\), how is \(\mathbb{P}_{\,}\left(\boldsymbol{x}\vert y\right)\) different from \(\mathbb{P}_{\,}\left(\boldsymbol{x}\right)\)? Explain this result in intuitive terms.
(c)
Since we’ve fixed the value \(\boldsymbol{\beta}^\intercal\boldsymbol{x}\) by conditioning on \(y\), \(\mathrm{Var}_{\,}\left(\boldsymbol{\beta}^\intercal\boldsymbol{x}| y\right)\) should equal zero. Furthermore, conditioning on \(y\) should tell us nothing about \(\boldsymbol{x}+ \boldsymbol{v}\) when \(\boldsymbol{v}^\intercal\boldsymbol{\beta}= 0\). Since there are \(P-1\) such vectors \(\boldsymbol{v}\), the support of \(\mathbb{P}_{\,}\left(\boldsymbol{x}| y\right)\) should be \(P-1\) dimensional.
We can in fact use (a.g) to confirm that, when \(\sigma^2 = 0\):
\[ \begin{aligned} \mathrm{Var}_{\,}\left(\boldsymbol{\beta}^\intercal\boldsymbol{x}| y\right) ={}& \boldsymbol{\beta}^\intercal\Sigma \boldsymbol{\beta}- \boldsymbol{\beta}^\intercal\Sigma \boldsymbol{\beta}\left(\boldsymbol{\beta}^\intercal\Sigma \boldsymbol{\beta}\right)^{-1} \boldsymbol{\beta}^\intercal\Sigma \boldsymbol{\beta} \\={}& \boldsymbol{\beta}^\intercal\Sigma \boldsymbol{\beta}\left(1 - \frac{\boldsymbol{\beta}^\intercal\Sigma \boldsymbol{\beta}}{\boldsymbol{\beta}^\intercal\Sigma \boldsymbol{\beta}} \right) \\={}& 0. \end{aligned} \]
(d)
No matter what the variance of \(\boldsymbol{x}\) is, \(\mathrm{Var}_{\,}\left(y\vert \boldsymbol{x}\right)\) is \(\sigma^2\). However, the marginal variance \(\mathrm{Var}_{\,}\left(y\right)\) goes to infinity, since it also includes the variability from \(\boldsymbol{x}\). This simple fact is at the core of some interesting phenomena in Bayesian statistics such as “Lindley’s paradox.”
(e) (254 only \(\star\) \(\star\) \(\star\))
Now, additionally assume that \(\boldsymbol{\beta}\sim \mathcal{N}\left(\boldsymbol{0}, \boldsymbol{V}\right)\) independently of \(\boldsymbol{x}\), and that \(y= \boldsymbol{x}^\intercal\boldsymbol{\beta}+ \varepsilon\) holds conditionally on \(\boldsymbol{x}\), \(\boldsymbol{\beta}\), and \(\varepsilon\). Give explicit expressions for the following distributions in terms of \(\Sigma\), \(\sigma\), and \(\boldsymbol{V}\).
Everything is jointly normal, \(y\) and \(\boldsymbol{\beta}\) are conditionally normal as well. We have \(\mathbb{E}_{\vphantom{}}\left[y\vert \boldsymbol{x}\right] = \mathbb{E}_{\vphantom{}}\left[\boldsymbol{\beta}\right]^\intercal\boldsymbol{x}= 0\), \(\mathrm{Cov}_{\,}\left(y, \boldsymbol{\beta}| \boldsymbol{x}\right) = \mathbb{E}_{\vphantom{}}\left[\boldsymbol{x}^\intercal\boldsymbol{\beta}\boldsymbol{\beta}^\intercal| \boldsymbol{x}\right] = \boldsymbol{V}\boldsymbol{x}\), and \(\mathrm{Var}_{\,}\left(y| \boldsymbol{x}\right) = \mathbb{E}_{\vphantom{}}\left[y^2 | xv\right] = \sigma^2 + \boldsymbol{x}^\intercal\boldsymbol{V}\boldsymbol{x}\). And \(\boldsymbol{\beta}\) is independent of \(\boldsymbol{x}\), so conditioning does not change its distribution.
\[ \mathbb{P}_{\,}\left((y, \boldsymbol{\beta}^\intercal)^\intercal\vert \boldsymbol{x}\right) = \mathcal{N}\left( \begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} \sigma^2 + \boldsymbol{x}^\intercal\boldsymbol{V}\boldsymbol{x}& \boldsymbol{x}^\intercal\boldsymbol{V}\\ \boldsymbol{V}\boldsymbol{x}& \boldsymbol{V} \end{pmatrix} \right). \]
\(\boldsymbol{\beta}\) is independent of \(\boldsymbol{x}\), so \(\mathbb{P}_{\,}\left(\boldsymbol{\beta}^\intercal\boldsymbol{x}\vert \boldsymbol{x}\right) = \mathcal{N}\left(0, \boldsymbol{x}^\intercal\boldsymbol{V}\boldsymbol{x}\right)\).
Applying the conditioning rule to \(\mathbb{P}_{\,}\left((y, \boldsymbol{\beta}^\intercal)^\intercal\vert \boldsymbol{x}\right)\) is multivariate normal, with
\[ \mathbb{E}_{\vphantom{}}\left[\boldsymbol{\beta}| y, \boldsymbol{x}\right] = 0 + \frac{y}{\sigma^2 + \boldsymbol{x}^\intercal\boldsymbol{V}\boldsymbol{x}} \boldsymbol{V}\boldsymbol{x} \]
and
\[ \mathrm{Cov}_{\,}\left(\boldsymbol{\beta}| y, \boldsymbol{x}\right) = \boldsymbol{V}- \boldsymbol{V}\boldsymbol{x}\boldsymbol{x}^\intercal\boldsymbol{V}\left(\sigma^2 + \boldsymbol{x}^\intercal\boldsymbol{V}\boldsymbol{x}\right)^{-1}. \]
4 LLNs and CLTs
For this problem, you may use without proof the Law of Large Numbers (LLN) and Central Limit Theorem (CLT) for IID random variables with finite variance. (See, e.g., Pitman (2012) section 3.3.)
For \(n=1,\ldots N \ldots \infty\), let \(x_n\) denote IID variables uniformly distributed on the interval \([0,1]\). Let \(a_n\) denote a deterministic sequence satisfying \(\frac{1}{N} \sum_{n=1}^Na_n \rightarrow 0\) and \(\frac{1}{N} \sum_{n=1}^Na_n^2 \rightarrow 1\), each as \(N \rightarrow \infty\).
(a)
Compute \(\mathbb{E}_{\vphantom{}}\left[\frac{1}{N} \sum_{n=1}^Nx_n\right]\) and \(\mathrm{Var}_{\,}\left(\frac{1}{N} \sum_{n=1}^Nx_n\right)\).
(b)
What is \(\lim_{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^Nx_n\)? Explain your answer in terms of your result from (a).
(c)
Let \(z_n = x_n - 0.5\). What is \(\mathbb{E}_{\vphantom{}}\left[\frac{1}{\sqrt{N}} \sum_{n=1}^Nz_n\right]\) and \(\mathrm{Var}_{\,}\left(\frac{1}{\sqrt{N}} \sum_{n=1}^Nz_n\right)\)?
(d)
What is \(\lim_{N \rightarrow \infty} \frac{1}{\sqrt{N}} \sum_{n=1}^N(x_n - 0.5)\)? Explain how your answer relates to your result from (c).
(e)
What is \(\lim_{N \rightarrow \infty} \frac{1}{\sqrt{N}} \sum_{n=1}^Nx_n\)? Explain how your answer relates to your result from (d).
(f) (254 only \(\star\) \(\star\) \(\star\))
Compute \(\mathbb{E}_{\vphantom{}}\left[\frac{1}{N} \sum_{n=1}^Na_n x_n\right]\) and \(\mathrm{Var}_{\,}\left(\frac{1}{N} \sum_{n=1}^Na_n x_n\right)\).
(g) (254 only \(\star\) \(\star\) \(\star\))
What is \(\lim_{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^Na_n x_n\)? Explain your answer in terms of your result from (f).
(h) (254 only \(\star\) \(\star\) \(\star\))
Let \(s_N := \frac{1}{\sqrt{N}} \sum_{n=1}^N(a_n x_n - 0.5 \frac{1}{N} \sum_{n=1}^Na_n)\). Compute \(\mathbb{E}_{\vphantom{}}\left[s_N\right]\) and \(\mathrm{Var}_{\,}\left(s_N\right)\).
Advanced note: As you might guess from these computations, a result known as the Lindeberg–Feller central limit theorem (e.g., Van der Vaart (2000) 2.27) gives conditions under which
\[ \lim_{N \rightarrow \infty} s_N \rightarrow \mathcal{N}\left(0, \lim_{N\rightarrow \infty} \mathrm{Var}_{\,}\left(s_N\right)\right). \]
(a)
\(\mathbb{E}_{\vphantom{}}\left[\frac{1}{N} \sum_{n=1}^Nx_n\right] = \frac{1}{N} \sum_{n=1}^N\mathbb{E}_{\vphantom{}}\left[x_n\right] = \frac{1}{N} \sum_{n=1}^N0.5 = 0.5\).
\[ \begin{aligned} \mathrm{Var}_{\,}\left(\frac{1}{N} \sum_{n=1}^Nx_n\right) ={}& \mathbb{E}_{\vphantom{}}\left[(\frac{1}{N} \sum_{n=1}^Nx_n - 0.5)^2\right] \\={}& \mathbb{E}_{\vphantom{}}\left[\frac{1}{N^2} \sum_{n=1}^N \sum_{m=1}^N (x_n-0.5)(x_m - 0.5)\right] \\={}& \frac{1}{N^2} \mathbb{E}_{\vphantom{}}\left[\sum_{n=1}^N (x_n-0.5)^2\right] \\={}& \frac{1}{N} \frac{1}{N} \sum_{n=1}^N\mathrm{Var}_{\,}\left(x_n\right) \\={}& \frac{1}{12 N}. \end{aligned} \]
(b)
By the LLN is is \(\mathbb{E}_{\vphantom{}}\left[x_n\right] = 0.5\).
(c)
Note that \(\frac{1}{\sqrt{N}} \sum_{n=1}^Nz_n = \sqrt{N}(\frac{1}{N} \sum_{n=1}^Nx_n - 0.5)\). So \(\mathbb{E}_{\vphantom{}}\left[\frac{1}{\sqrt{N}} \sum_{n=1}^Nz_n\right] = 0\) and \(\mathrm{Var}_{\,}\left(\frac{1}{\sqrt{N}} \sum_{n=1}^Nz_n\right) = N \mathrm{Var}_{\,}\left(\frac{1}{N} \sum_{n=1}^Nx_n\right) = \frac{1}{12}\).
(d)
\(\lim_{N \rightarrow \infty} \frac{1}{\sqrt{N}} \sum_{n=1}^N(x_n - 0.5) \rightarrow \mathcal{N}\left(0, 1/12\right)\). The scaling \(\sqrt{N}\) is just the right scaling so that the variance goes to a non–zero constant.
(e)
\[ \lim_{N \rightarrow \infty} \frac{1}{\sqrt{N}} \sum_{n=1}^Nx_n = \lim_{N \rightarrow \infty} \frac{1}{\sqrt{N}} \sum_{n=1}^Nz_n + \sqrt{N} \frac{1}{N} \sum_{n=1}^N0.5 = \lim_{N \rightarrow \infty} \frac{1}{\sqrt{N}} \sum_{n=1}^Nz_n + \sqrt{N} 0.5 \rightarrow \infty. \]
You are adding \(\sqrt{N}\) times a positive constant; the mean diverges, and the variance goes to a constant.
(f) (254 only \(\star\) \(\star\) \(\star\))
\(\mathbb{E}_{\vphantom{}}\left[\frac{1}{N} \sum_{n=1}^Na_n x_n\right] = \frac{1}{N} \sum_{n=1}^Na_n 0.5\) and \(\mathrm{Var}_{\,}\left(\frac{1}{N} \sum_{n=1}^Na_n x_n\right) = \frac{1}{N} \frac{1}{N} \sum_{n=1}^Na_n^2 / 12\), by the same reasoning as above.
(g) (254 only \(\star\) \(\star\) \(\star\))
The variance of \(\frac{1}{N} \sum_{n=1}^Na_n x_n\) goes to zero and the mean goes to \(\lim_{N \rightarrow \infty} 0.5 \frac{1}{N} \sum_{n=1}^Na_n =: 0.5 \overline{a}\), so the limit is \(0.5 \overline{a}\).
(h) (254 only \(\star\) \(\star\) \(\star\))
\(\mathbb{E}_{\vphantom{}}\left[s_N\right] = 0\) and \(\mathrm{Var}_{\,}\left(s_N\right) = \frac{1}{N} \sum_{n=1}^Na_n^2 / 12\).
5 Multivariate calculus
Consider the logistic loss function \(\mathcal{L}(\boldsymbol{\beta})\) given by
\[ \begin{aligned} \phi(\zeta) :={}& \frac{\exp(\zeta)}{1 + \exp(\zeta)} \\ \ell(y| p) :={}& -y\log p- (1 - y) \log(1 - p) = - y\log \frac{p}{1-p} - \log(1 - p)\\ \mathcal{L}(\boldsymbol{\beta}) ={}& \sum_{n=1}^N\ell(y_n | \phi(\boldsymbol{\beta}^\intercal\boldsymbol{x}_n)). \end{aligned} \]
Let \(\boldsymbol{x}_n\) and \(\boldsymbol{\beta}\) be \(P\)–dimensional vectors.
(a)
Compute \(\partial \phi(\boldsymbol{\beta}^\intercal\boldsymbol{x}_n) / \partial \boldsymbol{\beta}\) and \(\partial^2 \phi(\boldsymbol{\beta}^\intercal\boldsymbol{x}_n) / \partial \boldsymbol{\beta}\partial \boldsymbol{\beta}^\intercal\).
(b)
Using (a), compute \(\partial \mathcal{L}(\boldsymbol{\beta}) / \partial \boldsymbol{\beta}\). (It helps to observe that \(\log \frac{\phi(\boldsymbol{x}_n^\intercal\boldsymbol{\beta})}{1-\phi(\boldsymbol{x}_n^\intercal\boldsymbol{\beta})} = \boldsymbol{\beta}^\intercal\boldsymbol{x}_n\).)
(c)
Using (a), compute \(\partial^2 \mathcal{L}(\boldsymbol{\beta}) / \partial \boldsymbol{\beta}\partial \boldsymbol{\beta}^\intercal\).
(d)
Let \(\boldsymbol{X}\) denote the \(N \times P\) matrix whose \(n\)–th row is \(\boldsymbol{x}_n^\intercal\), and let \(\boldsymbol{\varepsilon}\) denote the \(N \times 1\) column vector whose \(n\)–th entry is \(y_n - \phi(\boldsymbol{x}_n^\intercal\boldsymbol{\beta})\). Write \(\partial \mathcal{L}(\boldsymbol{\beta}) / \partial \boldsymbol{\beta}\) using \(\boldsymbol{X}\) and \(\boldsymbol{\varepsilon}\) using only matrix multiplication (i.e., without explicit summation of the form \(\sum_{n=1}^N(\cdot)\)).
(e) (254 only \(\star\) \(\star\) \(\star\))
Sketch pseudocode for an iterative computer program to approximately find a \(\hat{\boldsymbol{\beta}}\) satisfying \(\left. \frac{\partial \mathcal{L}(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} \right|_{\hat{\boldsymbol{\beta}}} = \boldsymbol{0}\).
(f) (254 only \(\star\) \(\star\) \(\star\))
Using your result from (c), argue that \(\hat{\boldsymbol{\beta}}\) from (e) is (up to numerical error) a minimizer of \(\mathcal{L}(\beta)\).
(a)
Take \(\zeta = \boldsymbol{\beta}^\intercal\boldsymbol{x}\) and use the chain rule.
\[
\begin{aligned}
\partial \phi(\zeta) / \partial \zeta ={}& \frac{\exp(\zeta)}{1 + \exp(\zeta)} - \frac{\exp(\zeta)^2}{(1 + \exp(\zeta))^2}
\\={}&
\phi(\zeta) - \phi(\zeta)^2
\\={}&
\phi(\zeta) (1 - \phi(\zeta))
\Rightarrow \\
\partial \phi(\boldsymbol{x}^\intercal\boldsymbol{\beta}) / \partial \boldsymbol{\beta}={}&
\phi(\boldsymbol{x}^\intercal\boldsymbol{\beta}) (1 - \phi(\boldsymbol{x}^\intercal\boldsymbol{\beta})) \boldsymbol{x}\\
\end{aligned}
\]
Noting that \(\partial^2 \zeta / \partial \boldsymbol{\beta}\partial \boldsymbol{\beta}^\intercal= 0\), the chain rule gives \(\partial^2 \phi(\zeta) / \partial \zeta^2 = \boldsymbol{x}\boldsymbol{x}^\intercal\). And differentiating the first derivative gives \[ \begin{aligned} \partial^2 \phi(\zeta) / \partial \zeta^2 ={}& \partial \phi(\zeta) / \partial \zeta - 2 \phi(\zeta) \partial \phi(\zeta) / \partial \zeta \\={}& \phi(\zeta) (1 - \phi(\zeta)) - 2 \phi(\zeta)^2 (1 - \phi(\zeta)) \\={}& \phi(\zeta) (1 - \phi(\zeta))(1 - 2 \phi(\zeta)). \end{aligned} \].
(b)
We have \(\ell(y| \phi(\boldsymbol{\beta}^\intercal\boldsymbol{x})) = -y\boldsymbol{x}^\intercal\boldsymbol{\beta}- \log(1 - \phi(\boldsymbol{\beta}^\intercal\boldsymbol{x}))\). Now,
\[ \begin{aligned} \frac{\partial}{\partial \zeta} \log(1 - \phi(\zeta)) ={}& - \frac{\partial \phi(\zeta) / \partial \zeta}{1 - \phi(\zeta)} \\={}& - \frac{\phi(\zeta)(1 - \phi(\zeta))}{1 - \phi(\zeta)} \\={}& - \phi(\zeta). \end{aligned} \]
So, by the chain rule,
\[ \begin{aligned} \frac{\partial \ell(y| \phi(\boldsymbol{\beta}^\intercal\boldsymbol{x}))}{\partial \boldsymbol{\beta}} ={}& -y\boldsymbol{x}+ \phi(\boldsymbol{x}^\intercal\boldsymbol{\beta}) \boldsymbol{x} \\={}& -(y- \phi(\boldsymbol{x}^\intercal\boldsymbol{\beta})) \boldsymbol{x}. \end{aligned} \]
Combining,
\[ \frac{\partial \mathcal{L}(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} = - \sum_{n=1}^N(y_n - \phi(\boldsymbol{x}_n^\intercal\boldsymbol{\beta})) \boldsymbol{x}_n. \]
(c)
From (a) and (b), leting \(\zeta_n = \boldsymbol{\beta}^\intercal\boldsymbol{x}_n\), we have
\[ \begin{aligned} \frac{\partial^2 \mathcal{L}(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}\partial \boldsymbol{\beta}^\intercal} ={}& - \sum_{n=1}^N\frac{\partial \phi(\zeta_n)}{\partial \zeta_n} \boldsymbol{x}_n \boldsymbol{x}_n^\intercal \\={}& - \sum_{n=1}^N(1 - \phi(\boldsymbol{x}_n^\intercal\boldsymbol{\beta})) \phi(\boldsymbol{x}_n^\intercal\boldsymbol{\beta})\boldsymbol{x}_n \boldsymbol{x}_n^\intercal. \end{aligned} \]
(d)
\[ \frac{\partial \mathcal{L}(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} = -\boldsymbol{X}^\intercal\boldsymbol{\varepsilon}. \]
(Which is a pretty expression.)
(e) (254 only \(\star\) \(\star\) \(\star\))
One option is gradient descent. I’ll be very brief (you should be more detailed.) Starting at \(\boldsymbol{\beta}_t\), compute \(\boldsymbol{\varepsilon}_t = \boldsymbol{Y}- \phi(\boldsymbol{X}\boldsymbol{\beta}_t)\), and \(\gv_t = \boldsymbol{X}^\intercal\boldsymbol{\varepsilon}_t\). Then take \(\boldsymbol{\beta}_{t+1} = \boldsymbol{\beta}_{t} - \gv_t \delta\) for some small \(\delta\). Terminate when \(\left\Vert\gv_t\right\Vert\) is small.
(f) (254 only \(\star\) \(\star\) \(\star\))
Since \(\phi(\eta_n) (1 - \phi(\eta_n)) > 0\), and since \(\boldsymbol{x}\boldsymbol{x}^\intercal\) is a postitive semi–definite matrix for each \(\boldsymbol{x}_n\), it follows that the second derivative of \(\mathcal{L}(\boldsymbol{\beta})\) is positive semi–definite, and so convex, and so any stationary point is a minimizer.
6 Projections (Yu (2022) HW 1.3)
(a)
For any vector \(\boldsymbol{x}=\left(x_{1}, x_{2}, x_{3}\right)^{\intercal} \in \mathbb{R}^{3}\), what is its projection on the vector \((1,0,0)^{\intercal}\) ?
(b)
What is the orthogonal projection of \(\boldsymbol{x}=\left[x_{1}, x_{2}, x_{3}\right]^{\intercal} \in \mathbb{R}^{3}\) on the subspace spanned by the vectors \([1,0,0]^{\intercal}\) and \([0,1,0]^{\intercal}\) ?
(c)
Write the projection matrices in the previous two cases.
(d)
Write the expression for the orthogonal projection of a vector \(\boldsymbol{x}\in \mathbb{R}^{d}\) along any given vector \(\boldsymbol{a}\in \mathbb{R}^{d}\). What is the projection matrix in this case?
(e)
Given two orthogonal vectors \(\boldsymbol{a}_{1}\) and \(\boldsymbol{a}_{2}\) in \(\mathbb{R}^{d}\), what is the orthognal projection of a generic vector \(\boldsymbol{x}\) on to the subspace spanned by these two vectors?
(f)
Suppose that the two vectors \(\boldsymbol{a}_{1}\) and \(\boldsymbol{a}_{2}\) are not orthogonal. How will you compute the orthogonal projection of \(\boldsymbol{x}\) in this case? It may be useful to revise Gram Schmidt Orthogonalization.
(g)
Generalize the answer from (f) to the case to compute the orthogonal projection along a k-dimensional subspace spanned by the vectors \(\boldsymbol{a}_{1}, \ldots, \boldsymbol{a}_{k}\) which need not be orthogonal.
(h)
Define the matrix
\[ \boldsymbol{A}=\left[\boldsymbol{a}_{1}, \ldots, \boldsymbol{a}_{k}\right] \in \mathbb{R}^{D \times K} \]
such that the columns are linearly independent. Prove that the orthogonal projection of any vector \(\boldsymbol{x}\in \mathbb{R}^{d}\) onto the k -dimensional subspace spanned by the vectors \(\boldsymbol{a}_{1}, \ldots, \boldsymbol{a}_{k}\) is given by
\[ \begin{equation*} \mathbf{A}\left(\mathbf{A}^{\intercal} \mathbf{A}\right)^{-1} \mathbf{A}^{\intercal} \boldsymbol{x}. \tag{1} \end{equation*} \]
Hint: Consider the least squares problem \(\min_{\boldsymbol{\beta}}\|\mathbf{A \boldsymbol{\beta}}-\boldsymbol{x}\|_{2}^{2}\).
(i)
How does the expression \(\mathbf{A}\left(\mathbf{A}^{\intercal} \mathbf{A}\right)^{-1} \mathbf{A}^{\intercal}\) simplify if the vectors \(\mathbf{a}_{1}, \ldots, \mathbf{a}_{k}\) are orthogonal?
(a)
It is \((x_1, 0, 0)\).
(b)
It is \((x_1, x_2, 0)\).
(c)
\[ P_a = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \\ \end{pmatrix} \quad\textrm{P}\quad_b = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0 \\ \end{pmatrix} \]
(d)
It is \(\frac{\boldsymbol{x}^\intercal\boldsymbol{a}}{\boldsymbol{a}^\intercal\boldsymbol{a}} \boldsymbol{a}\) with projection matrix \(\boldsymbol{a}(\boldsymbol{a}^\intercal\boldsymbol{a}){-1} \boldsymbol{a}^\intercal\).
(e)
It is \(\boldsymbol{a}_1 (\boldsymbol{a}_1^\intercal\boldsymbol{x}) / (\boldsymbol{a}_1^\intercal\boldsymbol{a}_1) + \boldsymbol{a}_2 (\boldsymbol{a}_2^\intercal\boldsymbol{x}) / (\boldsymbol{a}_2^\intercal\boldsymbol{a}_2)\).
(f)
You can make \(\boldsymbol{a}_2\) orthogonal by subtracting its projection onto \(\boldsymbol{a}_1\).
(g)
You can construct an orthogonal basis by successively orthogonalizing.
(h)
The given matrix solves the least squares problem which coincides with the definition of projection.
(i)
How does the expression \(\mathbf{A}\left(\mathbf{A}^{\intercal} \mathbf{A}\right)^{-1} \mathbf{A}^{\intercal}\) simplify if the vectors \(\mathbf{a}_{1}, \ldots, \mathbf{a}_{k}\) are orthogonal?
In that case, \(\boldsymbol{A}^\intercal\boldsymbol{A}\) is diagonal, and the inverse is also diagonal.
7 Constrained optimization
The purpose of this exercise is to justify the method of Lagrange multipliers in a special case that will come up a lot in this class. You should have seen Lagrange multipliers for constrained optimization in multivariate calculus before, though in this problem we will be particularly careful to carefully justify the method.
Consider the objective function \[ \mathscr{L}(\theta) = \frac{1}{2}\boldsymbol{\theta}^\intercal\boldsymbol{A}\boldsymbol{\theta}+ \boldsymbol{b}^\intercal\boldsymbol{\theta}+ c \] for \(\boldsymbol{\theta}\in \mathbb{R}^{P}\), a symmetric, positive definite matrix \(\boldsymbol{A}\), a given vector \(\boldsymbol{b}\in \mathbb{R}^{P}\), and a constant \(c\). We want to find the minimum of \(\mathscr{L}(\theta)\) among vectors with squared norm less than some fixed, known \(t\). Let \(D = \{\theta: \left\Vert\boldsymbol{\theta}\right\Vert_2^2 \le t\}\) where \(\left\Vert\boldsymbol{\theta}\right\Vert_2^2 := \boldsymbol{\theta}^\intercal\boldsymbol{\theta}\). We want to find the solution to the “primal problem”
\[ M_p := \inf_{\theta \in D} \mathscr{L}(\theta). \]
We will also consider the “dual problem” \[ \begin{aligned} \mathscr{L}(\boldsymbol{\theta}, \lambda) :={}& \mathscr{L}(\boldsymbol{\theta}) + \lambda(\left\Vert\boldsymbol{\theta}\right\Vert_2^2 - t) \\ M_d :={}& \sup_{\lambda \ge 0} \inf_{\boldsymbol{\theta}\in \mathbb{R}^{P}} \mathscr{L}(\boldsymbol{\theta}, \lambda). \end{aligned} \]
The dual problem is easier than the primal problem because it involves an infumum over all \(\theta\) rather than a constrained set. The term \(\lambda\) is called the “Lagrange multiplier.” Note that, in the dual problem, \(\lambda\) is constrained to be positive.
The goal of this exercise is to show carefully that \(M_p = M_d\) in this case, so we can safely solve the dual problem instead of the primal problem.
(a) Find a closed–form expression for the unconstrained optimization problem \(\hat{\boldsymbol{\theta}}^{UC} := \underset{\boldsymbol{\theta}\in \mathbb{R}^{P}}{\mathrm{argmin}}\, \mathscr{L}(\boldsymbol{\theta})\) in terms of \(\boldsymbol{A}\), \(\boldsymbol{b}\), and \(c\).
(b) Assuming that \(\lambda > 0\), find a closed–form expression for \(\hat{\boldsymbol{\theta}}(\lambda) := \underset{\boldsymbol{\theta}\in \mathbb{R}^{P}}{\mathrm{argmin}}\, \mathscr{L}(\boldsymbol{\theta}, \lambda)\). In particular, show carefully that we can safely write \(\hat{\boldsymbol{\theta}}(\lambda)\) instead of \(\hat{\boldsymbol{\theta}}(\lambda, t)\) since the optimum does not depend on \(t\).
(c) Show that the following two problems are equivalent:
\[ \inf_{\boldsymbol{\theta}\in D} \mathscr{L}(\theta) = \inf_{\boldsymbol{\theta}\in \mathbb{R}^{P}} \sup_{\lambda \ge 0} \mathscr{L}(\boldsymbol{\theta}, \lambda). \]
Hint: Find an explicit expression for \(\sup_{\lambda \ge 0} \mathscr{L}(\boldsymbol{\theta}, \lambda)\) in the case where when the constraint \(\left\Vert\boldsymbol{\theta}\right\Vert_2^2 \le t\) is violated and in the case when it is satisfied. You should find that, in the latter problem, you pay “infinite cost” for violating the constraint. Then consider what happens when \(\boldsymbol{\theta}\) does not violate the constraint, and show that at least one such \(\boldsymbol{\theta}\) exists.
(d) Using (c), show that
\[ M_d = \sup_{\lambda \ge 0} \inf_{\boldsymbol{\theta}\in \mathbb{R}^{P}} \mathscr{L}(\boldsymbol{\theta}, \lambda) \le \inf_{\boldsymbol{\theta}\in \mathbb{R}^{P}} \sup_{\lambda \ge 0} \mathscr{L}(\boldsymbol{\theta}, \lambda) = \inf_{\boldsymbol{\theta}\in D} \mathscr{L}(\theta) = M_p. \]
The difference between \(\sup_{\lambda \ge 0} \inf_{\boldsymbol{\theta}\in \mathbb{R}^{P}} \mathscr{L}(\boldsymbol{\theta}, \lambda)\) and \(\inf_{\boldsymbol{\theta}\in \mathbb{R}^{P}} \sup_{\lambda \ge 0} \mathscr{L}(\boldsymbol{\theta}, \lambda)\) is called the “duality gap.” In order to show that \(M_d = M_p\), we need to show that the duality gap is zero. In other words, we need to show that we can safely exchange the order of the supremum and infimum in this problem.
(e) To show that the duality gap is zero, we will use part of an optimization result called the “saddle point theorem.” We now reproduce part of the proof of the saddle point theorem. Suppose we have \(\hat{\boldsymbol{\theta}}\) and \(\hat{\lambda}\) such that \[ \mathscr{L}(\hat{\boldsymbol{\theta}}, \lambda) \le \mathscr{L}(\hat{\boldsymbol{\theta}}, \hat{\lambda}) \le \mathscr{L}(\boldsymbol{\theta}, \hat{\lambda}) \] for all \(\boldsymbol{\theta}\in \mathbb{R}^{P}\) and all \(\lambda \ge 0\). Such a pair \(\hat{\boldsymbol{\theta}}, \hat{\lambda}\) is called a “saddle point.” When a saddle point exists, the duality gap is zero, since
\[ M_p = \inf_{\boldsymbol{\theta}\in \mathbb{R}^{P}} \sup_{\lambda \ge 0} \mathscr{L}(\boldsymbol{\theta}, \lambda) \le \sup_{\lambda \ge 0} \mathscr{L}(\hat{\boldsymbol{\theta}}, \lambda) \le \mathscr{L}(\hat{\boldsymbol{\theta}}, \hat{\lambda}) \le \inf_{\boldsymbol{\theta}\in \mathbb{R}^{P}} \mathscr{L}(\boldsymbol{\theta}, \hat{\lambda}) \le \sup_{\lambda \ge 0} \inf_{\boldsymbol{\theta}\in \mathbb{R}^{P}} \mathscr{L}(\boldsymbol{\theta}, \lambda) = M_d. \]
For this exercise, verify each inequality in the preceding expression.
The remainder of the problem is showing that there exists a saddle point for this problem, and so \(M_p = M_d\).
(f) (254 only \(\star\) \(\star\) \(\star\)) By (e) it suffices to find a saddle point. We will split the problem into two cases. First, show that if \(\hat{\boldsymbol{\theta}}^{UC}\) from (a) has squared norm less than \(t\), then \(\hat{\theta}= \hat{\boldsymbol{\theta}}^{UC}\) and \(\hat{\lambda}= 0\) is a saddle point.
(g) (254 only \(\star\) \(\star\) \(\star\)) Now, suppose that \(\hat{\boldsymbol{\theta}}^{UC}\) from (a) has squared norm greater than \(t\). Assume that there exists a \(\hat{\lambda}\) such that \(\hat{\boldsymbol{\theta}}(\hat{\lambda})\) from (b) satisfies \(\left\Vert\hat{\boldsymbol{\theta}}(\hat{\lambda})\right\Vert_2^2 = t\). Under this assumption, show that \(\hat{\boldsymbol{\theta}}= \hat{\boldsymbol{\theta}}(\hat{\lambda})\) and \(\hat{\lambda}\) are a saddle point.
(h) (254 only \(\star\) \(\star\) \(\star\)) Show carefully that there exists there exists a \(\hat{\lambda}\) such that \(\hat{\boldsymbol{\theta}}(\hat{\lambda})\) from (b) satisfies \(\left\Vert\hat{\boldsymbol{\theta}}(\hat{\lambda})\right\Vert_2^2 = t\). Hint: You may argue by continuity of the matrix inverse, or you may get a more precise result by considering the eigenvalue decomposition of \(A\).
(i) Conclude that, whether we are in case (g) or (h), that \(\hat{\boldsymbol{\theta}}(\hat{\lambda}) \in \underset{\boldsymbol{\theta}\in D}{\mathrm{argmin}}\, \mathscr{L}(\boldsymbol{\theta})\), so \(\hat{\boldsymbol{\theta}}(\hat{\lambda})\) is a solution to the primal problem. (For 154 students, you may take the conclusions of the 254 questions for granted.)
(j) (254 only \(\star\) \(\star\) \(\star\)) (Extra) Find an explicit expression for \(\hat{\lambda}\) in case (h). Hint: Use the first–order condition for \(\underset{\boldsymbol{\theta}}{\mathrm{argmin}}\, \mathscr{L}(\boldsymbol{\theta}, \hat{\lambda})\), together with the fact that \(\left\Vert\hat{\boldsymbol{\theta}}(\hat{\lambda})\right\Vert_2^2 = t\). In light of this expression, interpret the constraint \(\hat{\lambda}\ge 0\).
(a) The derivative is given by
\[ \frac{\partial \mathscr{L}(\theta)}{\partial \theta} = \boldsymbol{A}\boldsymbol{\theta}+ \boldsymbol{b}. \]
Setting this equal to zero and solving gives \(\hat{\boldsymbol{\theta}}= -\boldsymbol{A}^{-1} \boldsymbol{b}\), which exists by invertibility of \(\boldsymbol{A}\), and is the unique minimizer since \(\boldsymbol{A}\) is positive definite.
(b)
The derivative is given by
\[ \frac{\partial \mathscr{L}(\boldsymbol{\theta}, \lambda)}{\partial \theta} = \boldsymbol{A}\boldsymbol{\theta}+ \boldsymbol{b}+ 2 \lambda \boldsymbol{\theta}. \]
Again setting this to zero and solving gives
\[ \hat{\boldsymbol{\theta}}(\lambda) = -(\boldsymbol{A}+ 2 \lambda {\boldsymbol{I}_{\,}})^{-1} \boldsymbol{b}, \]
since \(\boldsymbol{A}+ 2 \lambda {\boldsymbol{I}_{\,}}\) is invertible because \(\boldsymbol{A}\) is and because \(\lambda \ge 0\); for the same reason, \(\boldsymbol{A}+ 2 \lambda {\boldsymbol{I}_{\,}}\) is positive definite and this is the global minimum. Since the value of \(t\) does not affect the derivative, the solution is invariant to \(t\).
(c) Suppose that \(\theta \notin D\) so that \(\left\Vert\theta\right\Vert_2^2 > t\). In that case, \(\sup_{\lambda \ge 0} \mathscr{L}(\boldsymbol{\theta}, \lambda) = \infty\), since by \(\left\Vert\theta\right\Vert_2^2 - t> 0\), and so making \(\lambda\) larger only makes \(\mathscr{L}(\boldsymbol{\theta}, \lambda)\) large without bound. On the other hand, if \(\theta \in D\) and so \(\left\Vert\theta\right\Vert_2^2 \le t\), then \(\sup_{\lambda \ge 0} \mathscr{L}(\boldsymbol{\theta}, \lambda) = \mathscr{L}(\boldsymbol{\theta})\), which is achieved at \(\lambda = 0\), and cannot be made larger since the quantity multiplying \(\lambda\) is \(\left\Vert\theta\right\Vert_2^2 - t\le 0\).
Since the loss is infinite for having \(\theta \notin D\), and (by (a)) finite for \(\theta \in D\), the outer infimum is the same as taking the infimum over \(\theta \in D\).
(d) From (c), we have that
\[ \inf_{\boldsymbol{\theta}\in \mathbb{R}^{P}} \sup_{\lambda \ge 0} \mathscr{L}(\boldsymbol{\theta}, \lambda) = \inf_{\boldsymbol{\theta}\in D} \mathscr{L}(\theta) = M_p. \]
Then, by definition of the infimum and supremum, we have
\[ \begin{aligned} \inf_{\theta \in \mathbb{R}^{P}} \mathscr{L}(\boldsymbol{\theta}, \lambda) \le{}& \mathscr{L}(\boldsymbol{\theta}', \lambda) & \textrm{for any }\lambda, \boldsymbol{\theta}' \Rightarrow\\ \sup_{\lambda \ge 0} \inf_{\theta \in \mathbb{R}^{P}} \mathscr{L}(\boldsymbol{\theta}, \lambda) \le{}& \sup_{\lambda \ge 0} \mathscr{L}(\boldsymbol{\theta}', \lambda) & \textrm{for any }\boldsymbol{\theta}' \Rightarrow\\ \sup_{\lambda \ge 0} \inf_{\theta \in \mathbb{R}^{P}} \mathscr{L}(\boldsymbol{\theta}, \lambda) \le{}& \inf_{\theta \in \mathbb{R}^{P}} \sup_{\lambda \ge 0} \mathscr{L}(\boldsymbol{\theta}', \lambda). \end{aligned} \]
The result follows.
(e) We start from the center and go outwards.
\[ \begin{aligned} \mathscr{L}(\hat{\boldsymbol{\theta}}, \hat{\lambda}) \le{}& \mathscr{L}(\boldsymbol{\theta}, \hat{\lambda}) \textrm{ for all }\boldsymbol{\theta} & \textrm{(saddle point property) } \Rightarrow \\ \mathscr{L}(\hat{\boldsymbol{\theta}}, \hat{\lambda}) \le{}& \inf_{\theta \in \mathbb{R}^{P}} \mathscr{L}(\boldsymbol{\theta}, \hat{\lambda}) & \textrm{(definition of inf) } \Rightarrow \\ \inf_{\theta \in \mathbb{R}^{P}} \mathscr{L}(\boldsymbol{\theta}, \hat{\lambda}) \le{}& \sup_{\lambda \ge 0} \inf_{\theta \in \mathbb{R}^{P}} \mathscr{L}(\boldsymbol{\theta}, \lambda) & \textrm{(definition of sup) }. \end{aligned} \]
The left hand inequalties follow similiarly.
(f) We already argued in (c) that if \(\hat{\boldsymbol{\theta}}\in D\), then \(\mathscr{L}(\hat{\boldsymbol{\theta}}, \lambda)\) is minimized at \(\lambda = 0\). It follows that \(\mathscr{L}(\hat{\boldsymbol{\theta}}^{UC}, \lambda) \le \mathscr{L}(\hat{\boldsymbol{\theta}}^{UC}, \hat{\lambda})\) where \(\hat{\lambda}= 0\). And when \(\hat{\lambda}= 0\), then \(\mathscr{L}(\boldsymbol{\theta}, \hat{\lambda}) = \mathscr{L}(\boldsymbol{\theta})\), which is minimized at \(\hat{\boldsymbol{\theta}}^{UC}\), from which we have \(\mathscr{L}(\hat{\boldsymbol{\theta}}, \hat{\lambda}) \le \mathscr{L}(\boldsymbol{\theta}, \hat{\lambda})\).
(g) The saddlepoint property follows kind of immediately from our key assumption that there exists \(\hat{\lambda}\) and \(\hat{\boldsymbol{\theta}}= \hat{\boldsymbol{\theta}}(\hat{\lambda})\) with \(\left\Vert\hat{\boldsymbol{\theta}}\right\Vert_2^2 = t\). It follows that \(\mathscr{L}(\hat{\boldsymbol{\theta}}, \lambda) = \mathscr{L}(\hat{\boldsymbol{\theta}}) + \lambda (\left\Vert\hat{\boldsymbol{\theta}}\right\Vert_2^2 - t) = \mathscr{L}(\hat{\boldsymbol{\theta}})\), which does not depend on \(\lambda\), so \(\mathscr{L}(\hat{\boldsymbol{\theta}}, \lambda) \le \mathscr{L}(\hat{\boldsymbol{\theta}}, \hat{\lambda})\) trivially. Now, we have already argued that when \(\lambda = \hat{\lambda}\), \(\mathscr{L}(\boldsymbol{\theta}, \hat{\lambda})\) is minimized at \(\hat{\boldsymbol{\theta}}\), from which \(\mathscr{L}(\hat{\boldsymbol{\theta}}, \hat{\lambda}) \le \mathscr{L}(\boldsymbol{\theta}, \hat{\lambda})\).
(h) We showed in (b) that \(\hat{\boldsymbol{\theta}}(\lambda) = (\boldsymbol{A}+ \lambda {\boldsymbol{I}_{\,}}{})^{-1} \boldsymbol{b}\). We have three facts:
- By assumption, \(\left\Vert\boldsymbol{A}^{-1} \boldsymbol{b}\right\Vert_2^2 > t\)
- \(\left\Vert(\boldsymbol{A}+ \lambda {\boldsymbol{I}_{\,}}{})^{-1} \boldsymbol{b}\right\Vert_2^2 > t\rightarrow 0\) as \(\lambda \rightarrow \infty\), and
- \(\lambda \mapsto \left\Vert(\boldsymbol{A}+ \lambda {\boldsymbol{I}_{\,}}{})^{-1} \boldsymbol{b}\right\Vert_2^2\) is continuous.
It follows that there must be some \(\lambda > 0\) with \(\lambda \mapsto \left\Vert(\boldsymbol{A}+ \lambda {\boldsymbol{I}_{\,}}{})^{-1} \boldsymbol{b}\right\Vert_2^2 = t\) by the intermediate value theorem.
(i) We have shown that there exists a saddle point no matter what, and the result follows from the arguments above.
(j) We know that \(\hat{\boldsymbol{\theta}}\) satisfies the first–order condition
\[ \nabla \mathscr{L}(\hat{\theta}) + 2 \hat{\lambda}\hat{\boldsymbol{\theta}}= \boldsymbol{0}, \]
where \(\nabla \mathscr{L}(\theta)\) is the gradient of \(\mathscr{L}(\theta)\), and is given in (a). Since \(\left\Vert\hat{\boldsymbol{\theta}}\right\Vert_2^2 = \hat{\boldsymbol{\theta}}^\intercal\hat{\boldsymbol{\theta}}= t\),
\[ \hat{\boldsymbol{\theta}}^\intercal\nabla \mathscr{L}(\hat{\theta}) + 2 \hat{\lambda}t= \boldsymbol{0} \quad \Rightarrow \quad \hat{\lambda}= -\frac{\hat{\boldsymbol{\theta}}^\intercal\nabla \mathscr{L}(\hat{\theta})}{2 t} \ge 0. \]
Since \(t> 0\), this means that \(\hat{\boldsymbol{\theta}}^\intercal\nabla \mathscr{L}(\hat{\theta}) \le 0\). One possibility is that \(\nabla \mathscr{L}(\hat{\theta}) = \boldsymbol{0}\), meaning \(\hat{\boldsymbol{\theta}}\) is just an ordinary minimum and \(\hat{\lambda}= 0\). Alternatively, if \(\lambda > 0\) (strictly) then the loss function must be decreasing in the direction of \(\hat{\boldsymbol{\theta}}\), meaning the loss function could be decreased if we allowed \(\hat{\boldsymbol{\theta}}\) to be larger. This is consistent with our conclusion above that \(\lambda > 0\) only when the unconstrained minimum occurs outside the set \(D\).