Homework 2

Stat 154/254: Statistical Machine Learning

Due Sunday October 5th (9pm)

1 Risk minimizers for general zero–one loss

Take \(y\in \{0, 1\}\), and consider the general misclassification loss

\[ \mathscr{L}(\hat{y}, y) = c_{01} \mathrm{I}\left(\hat{y}=0, y=1\right) + c_{10} \mathrm{I}\left(\hat{y}=1, y=0\right) - c_{00} \mathrm{I}\left(\hat{y}=0, y=0\right) - c_{11} \mathrm{I}\left(\hat{y}=1, y=1\right), \]

for postive \(c_{00},c_{11},c_{01},c_{10}\). In terms of these constants and \(\mathbb{P}_{\,}\left(y| \boldsymbol{x}\right)\), find a formula for the optimal classifier, \(f^{\star}(\boldsymbol{x}) = \underset{f}{\mathrm{argmin}}\, \mathbb{E}_{\mathbb{P}_{\,}\left(\boldsymbol{x}, y\right)}\left[\mathscr{L}(f(\boldsymbol{x}), y) \right]\).

Solutions

The risk is \(\mathbb{E}_{\mathbb{P}_{\,}\left(\boldsymbol{x}, y\right)}\left[\mathscr{L}(f(\boldsymbol{x}), y)\right] = \mathbb{E}_{\mathbb{P}_{\,}\left(\boldsymbol{x}\right)}\left[\mathbb{E}_{\mathbb{P}_{\,}\left(y| \boldsymbol{x}\right)}\left[\mathscr{L}(f(\boldsymbol{x}), y)\right]\right]\). Also, we can write \(\mathrm{I}\left(f(\boldsymbol{x})=0, y=1\right) = \mathrm{I}\left(f(\boldsymbol{x}) = 0\right) \mathrm{I}\left(y= 1\right)\), and so on. Expanding only the inner expectation, and writing \(\pi(\boldsymbol{x}) = \mathbb{P}_{\,}\left(y= 1 \vert \boldsymbol{x}\right)\) for compactness, then gathering terms of the form \(\mathrm{I}\left(f(\boldsymbol{x}) = c\right)\),

\[ \begin{aligned} \mathbb{E}_{\mathbb{P}_{\,}\left(y| \boldsymbol{x}\right)}\left[\mathscr{L}(f(\boldsymbol{x}), y)\right] ={}& c_{01} \mathrm{I}\left(f(\boldsymbol{x})=0\right) \pi(\boldsymbol{x}) + c_{10} \mathrm{I}\left(f(\boldsymbol{x})=1\right) (1 - \pi(\boldsymbol{x})) - \\& c_{00} \mathrm{I}\left(f(\boldsymbol{x})=0\right) (1 - \pi(\boldsymbol{x})) - c_{11} \mathrm{I}\left(f(\boldsymbol{x})=1\right) \pi(\boldsymbol{x}) \\={}& \mathrm{I}\left(f(\boldsymbol{x})=0\right) \left(c_{01} \pi(\boldsymbol{x}) - c_{00} (1 - \pi(\boldsymbol{x})) \right) + \\& \mathrm{I}\left(f(\boldsymbol{x})=1\right) \left(c_{10} (1 - \pi(\boldsymbol{x})) - c_{11} \pi(\boldsymbol{x}) \right) % \\={}& % \ind{f(\xv)=0} \left((c_{01} + c_{00}) \pi(\xv) - c_{00} \right) + % \\& % \ind{f(\xv)=1} \left(-(c_{11} + c_{10}) \pi(\xv) + c_{10} \right). \end{aligned} \]

For any particular \(\boldsymbol{x}\), \(f^{\star}\) chooses the term with the lower loss. That is,

\[ \begin{aligned} f^{\star}(\boldsymbol{x}) ={}& \mathrm{I}\left(c_{10} (1 - \pi(\boldsymbol{x})) - c_{11} \pi(\boldsymbol{x}) \le c_{01} \pi(\boldsymbol{x}) - c_{00} (1 - \pi(\boldsymbol{x}))\right). \end{aligned} \]

In this expression, the \(\le\) may be replaced with \(<\), since when the two losses are equal the optimal classifier is not uniquely defined.

This expression can be solved for \(\pi(\boldsymbol{x})\) giving a threshold in terms of the various costs.

2 ROC curves of bad classifiers

(a)

Let \(x\sim \mathcal{N}\left(0, 1\right)\) be drawn independently of \(y\in \{0, 1\}\). Plot the ROC curve for the family of classifiers \(\hat{f}(x| t) = \mathrm{I}\left(x> t\right)\) indexed by \(t\in \mathbb{R}^{}\).

(b)

Suppose you have two not–so–great classifiers, A and B, with the following ROC curves. How can you combine them to get a better one?

(c)

Suppose you have the a terrible classifier with the following ROC curve. How can you use it to make a really good classifier?

Solutions

(a) The plot should be a forty–five degree line — at any \(t\), both the true and false rejection rates will be equal to \(p(x> t)\).

(b) I’ll give two possible answers. Full credit for the homework can be given for either solution or for variants — the main idea is to combine classifiers to get a better one.

One solution is to find the thresholds at which classifier A and classifier B’s ROC curves interesect. Call these thresholds \(t_A\) and \(t_B\), respectively. Then you could make a family of classifiers that use classifer \(A\) with \(t< t_A\) and classifier \(B\) with \(t > t_B\).

A better solution is to use randomized classifiers to form the convex hull of the ROC curves. For any thresholds \(t_A\) and \(t_B\) (possibly distinct from the preceding paragraph), you can classify according to \(A\) with probability \(p\) and \(B\) with probability \(1-p\). The TPR and FPR of the randomized classifier lies a proportion \(p\) along a line between the two points on the ROC curve. You can thus form a feasible classifier that connects any pair of points on the two ROC curves; applied to all points, if follows that there exist randomized classifiers achieving any point in the convex hull of the two ROC curves. An example is shown in purple between the two “humps” of the ROC curves:

Warning in geom_segment(aes(x = 0.25, y = 0.53, xend = 0.645, yend = 0.875), : All aesthetics have length 1, but the data has 66 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.

(c)

Just flip the classifier — take \(\hat{y}' = \mathrm{I}\left(\hat{y}= 0\right)\). This produces a classifier that is better than chance, rather than the current classifier, which is worse than chance.

3 The behavior of logistic regression with a separating hyperplane

Consider logistic loss \(\hat{\mathscr{R}}(\boldsymbol{\beta})\) with regressors \(\boldsymbol{x}_n\) and binary responses \(y_n\). Suppose that the data is separable, meaning there exists a \(\boldsymbol{\beta}^{*}\) such that the data is perfectly classified by \(\boldsymbol{\beta}^{*}\):

\[ y_n = 1 \textrm{ if and only if }{\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x}_n > 0. \]

(a)

Show that the data is also perfectly classified by \(C \boldsymbol{\beta}^{*}\) for any constant \(C > 0\).

(b)

Write an expression for the loss \(\hat{\mathscr{R}}(C \boldsymbol{\beta}^{*})\). What happens to the loss as \(C \rightarrow \infty\)?

(c)

Can there be any solution to the equation \(\partial \hat{\mathscr{R}}(\boldsymbol{\beta}) / \partial \boldsymbol{\beta}= \boldsymbol{0}\)? (Recally that \(\hat{\mathscr{R}}(\boldsymbol{\beta})\) is convex, and so any such solution would be at a global minimum.)

(d)

What will happen if you try to numerically optimize \(\mathscr{R}(\boldsymbol{\beta})\), for example by gradient descent?

Solutions

(a)

\[ {\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x}_n > 0 \quad\Leftrightarrow\quad C {\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x}_n > 0 \]

becuase \(C > 0\).

(b)

Let \(\pi^*_n = \exp(C \boldsymbol{x}_n^\intercal\boldsymbol{\beta}^{*}) / (1 + \exp(C \boldsymbol{x}_n^\intercal\boldsymbol{\beta}^{*}))\). As \(C \rightarrow \infty\), \(\pi^*_n \rightarrow 1\) if \(\boldsymbol{x}_n^\intercal\boldsymbol{\beta}^{*}> 0\) and \(\pi^*_n \rightarrow 0\) if \(\boldsymbol{x}_n^\intercal\boldsymbol{\beta}^{*}< 0\). \[ \begin{aligned} \hat{\mathscr{R}}(C \boldsymbol{\beta}^{*}) ={}& - \frac{1}{N} \sum_{n=1}^N\left(y_n \log \pi^*_n + (1 - y_n) \log(1 - \pi^*_n) \right) \\={}& - \frac{1}{N}\sum_{n:y_n = 1} \log \pi^*_n - \frac{1}{N}\sum_{n:y_n = 0} \log(1 - \pi^*_n) \\={}& - \frac{1}{N}\sum_{n:\boldsymbol{x}_n^\intercal\boldsymbol{\beta}^{*}> 0} \log \pi^*_n - \frac{1}{N}\sum_{n:\boldsymbol{x}_n^\intercal\boldsymbol{\beta}^{*}< 0} \log(1 - \pi^*_n), \end{aligned} \]

where the final line follows from the fact that \(y_n\) is perfectly classified by \(C \boldsymbol{\beta}^{*}\). As a consequence, \(\lim_{C \rightarrow \infty} \hat{\mathscr{R}}(C \boldsymbol{\beta}^{*}) = \log 1 = 0\), but \(\hat{\mathscr{R}}(C \boldsymbol{\beta}^{*}) > 0\) for any finite \(C\).

(c)

By the argument in (b), there cannot be any such point, since the empirical loss decreases without bound in the direction \(C\boldsymbol{\beta}^{*}\).

(d)

If the optimization algorithm is able to find the separating hyperplane, it will continue to make the vector \(\boldsymbol{\beta}\) larger and larger in magnitude until there is some floating point error.

4 Logistic regression as Gaussian generative modeling

In this problem, we will derive a classification technique known as “linear descriminant anlalysis” (LDA). (Not to be confused with “latent Dirichlet allocation,” which is a completely different ML technique for topic modeling.)

Suppose that classification data follows a generative model

\[ p(\boldsymbol{x}\vert y= 0) = \mathcal{N}\left(\boldsymbol{x}| \boldsymbol{\mu}_0, \boldsymbol{\Sigma}\right) \quad\textrm{and}\quad p(\boldsymbol{x}\vert y= 1) = \mathcal{N}\left(\boldsymbol{x}| \boldsymbol{\mu}_1, \boldsymbol{\Sigma}\right) \]

for some covariance matrix \(\boldsymbol{\Sigma}\) and two vectors \(\boldsymbol{\mu}_0\) and \(\boldsymbol{\mu}_1\). Suppose also that \(\mathbb{P}_{\,}\left(y= 1\right) = \mathbb{P}_{\,}\left(y= 0\right) = 0.5\), marginally.

Recall that the multivarite normal density is given by

\[ \mathcal{N}\left(\boldsymbol{x}| \boldsymbol{\mu}, \boldsymbol{\Sigma}\right) = \exp\left( -\frac{1}{2}(\boldsymbol{x}- \boldsymbol{\mu})^\intercal\boldsymbol{\Sigma}^{-1} (\boldsymbol{x}- \boldsymbol{\mu}) - \frac{1}{2} \log |\boldsymbol{\Sigma}| + C \right), \]

where the normalizing constant \(C\) does not depend on \(\boldsymbol{x}\), \(\boldsymbol{\mu}\), or \(\boldsymbol{\Sigma}\). You may assume that \(\boldsymbol{\Sigma}\) is invertible and that the \(\boldsymbol{\mu}_1\) and \(\boldsymbol{\mu}_0\) are distinct.

In this problem, we will compute the optimal classifier in closed form for this setting.

(a)

Show that \[ \frac{\mathbb{P}_{\,}\left(y=1\vert\boldsymbol{x}\right)}{\mathbb{P}_{\,}\left(y=0\vert\boldsymbol{x}\right)} = \frac{\mathbb{P}_{\,}\left(\boldsymbol{x}\vert y=1\right)}{\mathbb{P}_{\,}\left(\boldsymbol{x}\vert y=0\right)}. \] Meaning, the posterior probability ratio equals the likehood ratio, in the case that the priors equal.

Hint: Use Bayes’ formula.

(b)

Express the log likelihood ratio \[ \log \frac{\mathbb{P}_{\,}\left(\boldsymbol{x}\vert y=1\right)}{\mathbb{P}_{\,}\left(\boldsymbol{x}\vert y=0\right)} \]

in terms of \(\boldsymbol{x},\boldsymbol{\mu}_0,\boldsymbol{\mu}_1,\boldsymbol{\Sigma}\), and show that it is affine in \(\boldsymbol{x}\).

(c)

Let \(\hat{f}(\boldsymbol{x})\) be the classifier that predicts the class with the higher posterior probability given \(\boldsymbol{x}\). Use (a) and (b) to show that \[ \hat{f}(\boldsymbol{x}) = 1 \iff \boldsymbol{x}^T\boldsymbol{\beta}^* > c^*, \] where \(\boldsymbol{\beta}^* =\boldsymbol{\Sigma}^{-1}(\boldsymbol{\mu}_1-\boldsymbol{\mu}_0)\), and \(c^*\) is a constant that depends on \(\boldsymbol{\mu}_0,\boldsymbol{\mu}_1,\boldsymbol{\Sigma}\) but not on \(\boldsymbol{x}\). Connect this conclusion to logistic regression.

Solutions

(a)

Using Bayes’ rule twice, \[ \frac{\mathbb{P}_{\,}\left(y=1\vert\boldsymbol{x}\right)}{\mathbb{P}_{\,}\left(y=0\vert\boldsymbol{x}\right)} = \frac{\mathbb{P}_{\,}\left(y=1, \boldsymbol{x}\right) / \mathbb{P}_{\,}\left(\boldsymbol{x}\right)}{\mathbb{P}_{\,}\left(y=0, \boldsymbol{x}\right) / \mathbb{P}_{\,}\left(\boldsymbol{x}\right)} = \frac{\mathbb{P}_{\,}\left(\boldsymbol{x}\vert y=1\right) \mathbb{P}_{\,}\left(y= 1\right) / \mathbb{P}_{\,}\left(\boldsymbol{x}\right)}{\mathbb{P}_{\,}\left(\boldsymbol{x}\vert y= 0\right) \mathbb{P}_{\,}\left(y= 0\right) / \mathbb{P}_{\,}\left(\boldsymbol{x}\right)} = \frac{\mathbb{P}_{\,}\left(\boldsymbol{x}\vert y=1\right)}{\mathbb{P}_{\,}\left(\boldsymbol{x}\vert y= 0\right)}, \]

where the final line uses the fact that \(\mathbb{P}_{\,}\left(y= 1\right) = \mathbb{P}_{\,}\left(y= 0\right) = 1/2\).

(b)

\[ \begin{aligned} \log \frac{\mathbb{P}_{\,}\left(\boldsymbol{x}\vert y=1\right)}{\mathbb{P}_{\,}\left(\boldsymbol{x}\vert y=0\right)} ={}& \log \mathbb{P}_{\,}\left(\boldsymbol{x}\vert y=1\right) - \log \mathbb{P}_{\,}\left(\boldsymbol{x}\vert y=0\right) \\={}& -\frac{1}{2}(\boldsymbol{x}- \boldsymbol{\mu}_1)^\intercal\boldsymbol{\Sigma}^{-1} (\boldsymbol{x}- \boldsymbol{\mu}_1) - \frac{1}{2} \log |\boldsymbol{\Sigma}| + C + \\{}& \frac{1}{2}(\boldsymbol{x}- \boldsymbol{\mu}_0)^\intercal\boldsymbol{\Sigma}^{-1} (\boldsymbol{x}- \boldsymbol{\mu}_0) + \frac{1}{2} \log |\boldsymbol{\Sigma}| - C \\={}& -\frac{1}{2}(\boldsymbol{x}- \boldsymbol{\mu}_1)^\intercal\boldsymbol{\Sigma}^{-1} (\boldsymbol{x}- \boldsymbol{\mu}_1) + \frac{1}{2}(\boldsymbol{x}- \boldsymbol{\mu}_0)^\intercal\boldsymbol{\Sigma}^{-1} (\boldsymbol{x}- \boldsymbol{\mu}_0) \\={}& -\frac{1}{2} \boldsymbol{\mu}_1^\intercal\boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_1 + \frac{1}{2} \boldsymbol{\mu}_0^\intercal\boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_0 + \boldsymbol{x}^\intercal\boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_1 - \boldsymbol{x}^\intercal\boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_0 \\={}& -\frac{1}{2} \boldsymbol{\mu}_1^\intercal\boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_1 + \frac{1}{2} \boldsymbol{\mu}_0^\intercal\boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_0 + \boldsymbol{x}^\intercal\boldsymbol{\Sigma}^{-1} (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_0). \end{aligned} \]

This takes the form \(b+ \boldsymbol{a}^\intercal\boldsymbol{x}\) where \(b= -\frac{1}{2} \boldsymbol{\mu}_1^\intercal\boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_1 + \frac{1}{2} \boldsymbol{\mu}_0^\intercal\boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_0\) and \(\boldsymbol{a}= \boldsymbol{\Sigma}^{-1} (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_0)\).

(c)

Combining (a) and (b), we set \(\hat{f}(\boldsymbol{x}) = 1\) when

\[ \begin{aligned} \frac{\mathbb{P}_{\,}\left(y= 1\vert \boldsymbol{x}\right)}{\mathbb{P}_{\,}\left(y=0 \vert \boldsymbol{x}\right)} \ge{}& 1 \quad\Leftrightarrow \\ \frac{\mathbb{P}_{\,}\left(\boldsymbol{x}\vert y=1\right)}{\mathbb{P}_{\,}\left(\boldsymbol{x}\vert y=0\right)} \ge{}& 1 \quad\Leftrightarrow \\ \log \frac{\mathbb{P}_{\,}\left(\boldsymbol{x}\vert y=1\right)}{\mathbb{P}_{\,}\left(\boldsymbol{x}\vert y=0\right)} \ge{}& 0 \quad\Leftrightarrow\\ \boldsymbol{x}^\intercal\boldsymbol{\Sigma}^{-1} (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_0) \ge{}& \frac{1}{2} \boldsymbol{\mu}_1^\intercal\boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_1 - \frac{1}{2} \boldsymbol{\mu}_0^\intercal\boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_0. \end{aligned} \]

Logistic regression (with a constant in the regrsesion) also tries to find an affine decision boundary, but does so in a fundamentally different way — by minimizing the cross–entropy loss. Both LDA and logistic regression thus result in affine decision boundaries, but different affine decision boundaries.

5 Domain mismatch

A common problem in machine learning is when the distribution of your survey data does not match the distribution of the population in which you intend to use your predictor. Under misspecification, this can cause problems in general.

For this problem, we’ll explore a very simple example of domain mismatch in the context of classification. Dealing with this problem is the area of “domain adaptation.”

Suppose that both \(y\) and \(x\) are binary. We have two populations, the sample population \(p_s(x, y)\) which we observe, and which we learn our prediction function, and a target population \(p_t(x, y)\) on which we with to make predictions. For this problem, we’ll compare the exact risk minimizers for simplicity.

The data is distributed as follows. Let \(0 < \varepsilon < 1/4\). In both the sample and target, the distribution \(p(y\vert x)\) is the same:

\[ p(y= 1 \vert x) = \begin{cases} x= 0: & \varepsilon \\ x= 1: & 1 - \varepsilon. \end{cases} \]

In the sample, \(p_s(x= 1) = 1 - \varepsilon\), and in the target, \(p_t(x= 1) = \varepsilon\).

(a) Suppose we allow our prediction function, \(f\), to depend on \(x\), so \(f: \{0,1\}\mapsto \{0,1\}\). Since the distribution of \(y\) actually does depend only on \(x\), this model is correctly specified. Identify the sample population optimal classifier \(f_s^* = \underset{f}{\mathrm{argmin}}\, \mathscr{R}_s(f)\) where \(\mathscr{R}_s(f) = \mathbb{E}_{p_s(x, y)}\left[\mathrm{I}\left(f(x) \ne y\right)\right]\)?

(b) Now, as in (a), identify the target population optimal classifier, \(f_t^* = \underset{f}{\mathrm{argmin}}\, \mathscr{R}_t(f)\) where \(\mathscr{R}_t(f) = \mathbb{E}_{p_t(x, y)}\left[\mathrm{I}\left(f(x) \ne y\right)\right]\)?

(c) We would ideally like to use \(f_t^*\), but we only observe samples from \(p_s(\cdot)\), and so learn \(f_s^*\). What is the regret on the target population, \(\mathscr{R}_t(f_s^*) - \mathscr{R}_t(f_t^*)\)?

(d) Now, suppose we do not use \(x\) to form our predictions, but instead predict the same value, \(g\in \{0,1\}\), for every observation. Since the true distribution of \(y\) depends on \(x\), this model is misspecified. Under misspecification, what is the sample optimal classifier, \(g_s^* = \underset{f}{\mathrm{argmin}}\, \mathscr{R}_s(g)\) where \(\mathscr{R}_s(g) = \mathbb{E}_{p_s(x, y)}\left[\mathrm{I}\left(g\ne y\right)\right]\)?

(e) Now, as in (d), identify the target population optimal constant classifier, \(g_t^* = \underset{g}{\mathrm{argmin}}\, \mathscr{R}_t(g)\) where \(\mathscr{R}_t(g) = \mathbb{E}_{p_t(x, y)}\left[\mathrm{I}\left(g\ne y\right)\right]\)?

(f) What is the regret on the target population, \(\mathscr{R}_t(g_s^*) - \mathscr{R}_t(g_t^*)\), for the misspecified classifier?

(g) You should have gotten qualitatively different answers for (c) and for (f). Describe in ordinary language what went wrong.

(h) Imagine you are trying to design a classifier to detect infrequent events, such as very rare disease conditions. Connect the result above to how such a classifier might fail to be useful due to domain mismatch (recalling that essentially all ML models are misspecfied).

Solutions

We write \(\mathscr{R}_s(f) = \mathbb{E}_{p_s(x)}\left[\mathbb{E}_{p(y\vert x)}\left[\mathrm{I}\left(f(x) \ne y\right)\right]\right]\), and expand the inner expectation as

\[ \begin{aligned} \mathbb{E}_{p(y\vert x)}\left[\mathrm{I}\left(f(x) \ne y\right)\right] ={}& \mathrm{I}\left(f(x) = 1\right) (1 - p(y= 1 \vert x)) + \mathrm{I}\left(f(x) = 0\right) p(y= 1 \vert x). \end{aligned} \]

So the optimal classifier is

\[ f^{\star}(\boldsymbol{x}) = \mathrm{I}\left(1 - p(y= 1 \vert x) \le p(y= 1 \vert x)\right) = \mathrm{I}\left(1/2 \le p(y= 1 \vert x)\right), \]

as usual. Since \(\varepsilon < 1/4 < 1/2\), when \(x = 1\) \(p(y= 1 \vert \boldsymbol{x}) > 1/2\) and so \(f^{\star}(1) = 1\). Similarly, \(f^{\star}(0) = 0\). The distribution of \(\boldsymbol{x}\) does not matter.

(b) In (a) we never used the distribtion of \(x\), and \(p(y\vert x)\) is the same in both populations, so the answer is the same.

(c) Since \(f^{\star}_t = f^{\star}_s\) the regret is zero.

(d) Now, since \(g\) does not depend on \(x\), we cannot make separate predictions for each \(x\). Instead, we make optimal predictions based on

\[ \begin{aligned} \mathscr{R}(g) ={}& \mathbb{E}_{p_s(x)}\left[\mathbb{E}_{p(y\vert x)}\left[\mathrm{I}\left(g = 1\right) \mathrm{I}\left(y= 0\right) + \mathrm{I}\left(g = 0\right) \mathrm{I}\left(y= 1\right)\right]\right] \\={}& \mathrm{I}\left(g= 1\right) p_s(y= 0) + \mathrm{I}\left(g= 0\right) p_s(y= 1). \end{aligned} \]

Marginally,

\[ \begin{aligned} p_s(y= 1) ={}& \mathbb{E}_{p_s(x)}\left[p(y= 1 \vert x)\right] \\={}& p_s(x= 1) p(y= 1 \vert x= 1) + p_s(x= 0) p(y= 1 \vert x= 0) \\={}& (1 - \varepsilon) (1 - \varepsilon) + \varepsilon \varepsilon \\={}& (1 - \varepsilon)^2 + \varepsilon^2. \end{aligned} \]

This expression takes a minimum when \(\varepsilon = 1/2\), at which point \(p_s(y= 1) = 1/2\), so the fact that \(\varepsilon < 1/4\) means that \(p_s(y= 1) > 1/2\), and the optimal prediction rule is \(g^{\star}_s = 1\).

(e) Repeating the calculation from (d) but with \(p_t(x)\) gives

\[ \begin{aligned} p_t(y= 1) ={}& p_t(x= 1) p(y= 1 \vert x= 1) + p_t(x= 0) p(y= 1 \vert x= 0) \\={}& \varepsilon(1 - \varepsilon) + (1 - \varepsilon) \varepsilon \\={}& 2 \varepsilon - 2 \varepsilon^2. \end{aligned} \]

Again, this takes a maximum when \(\varepsilon = 1/2\), and so \(\varepsilon < 1/4\) implies that \(p_t(y= 1) < 1/2\), so the optimal prediction rule is \(g^{\star}_t = 0\).

(f) The risks are simply given by \(\mathscr{R}_t(g^{\star}_t) = p_t(y= 1)\) and \(\mathscr{R}_t(g^{\star}_s) = p_t(y= 0)\), and so the regret is \(p_t(y= 0) - p_t(y= 1) = 1 - 2 p_t(y= 1) = 1 - 4 \varepsilon - 4 \varepsilon^2 > 0\). Note that as \(\varepsilon \rightarrow 0\), the regret goes to the worst possible value of one.

(g) If we are able to make the optimal prediction for each \(x\), then our classifier will be the same no matter what the distribution of \(x\) is. Different populations will incur different risks if some \(x\) values are inherently harder to predict than others (this was not the case in the present example), but there will be no regret.

In contrast, if we don’t know that it’s differences in \(x\) that are driving the differences in \(y\) between the two populations, then learning algorithms may not generalize. This is the case in general for misspecifciation. In this case, the sample and target had very different marginal proportions of \(y= 1\) due to large differences in the \(x\) distribution.

(h) If you are trying to detect a disease that is rare in your training data, classifiers trained on average classification loss (or some proxy) may not have pay much a cost in terms of empirical risk by underestimating how often the disease occurs. For example, in (d), it was optimal to simply say that \(y= 0\) never occurred because you didn’t have any data to distinguish the instances of \(y=0\) from chance variation.

6 Proper scoring rules (254 only \(\star\) \(\star\) \(\star\))

In lectures, we showed that both squared loss and cross–entropy loss can be used as valid proxy loss functions for learning \(\mathbb{P}_{\,}\left(y\vert x\right)\) in classification problems. Gneiting and Raftery (2007) provide a complete characeterization of such loss functions, which they call (up to a sign change) “proper scoring rules.” (Their paper is very general, and they note that their results in the case of classification actually appeared in earlier work.) Note that Gneiting and Raftery (2007) define “scoring rules” as things that are better when they are higher, so we are looking for losses that are the negatives of proper scoring rules.

Specifically, let \(y\in \mathcal{Y}:= \{c_1, \ldots, c_K\}\) be a categorical variable, and let \(S_K\) denote the \(K\)–simplex, i.e., the set of all vectors \(\pi \in \mathbb{R}^{K}\) that satisfy \(\pi_k \in [0,1]\) and \(\sum_{k=1}^K \pi_k = 1\). A proxy loss function is of the form \(\mathscr{L}(\pi, y)\), that is, a function \(\mathscr{L}: S_K\times \mathcal{Y}\mapsto \mathbb{R}^{}\).

Suppose there is no \(x\) for the moment, and write \(\mathscr{R}_\mathscr{L}(\pi) := \mathbb{E}_{p(y)}\left[\mathscr{L}(\pi, y)\right]\). Then \(\mathscr{L}\) is the negative of a “stricly proper scoring rule” if \(\underset{\pi \in S_K}{\mathrm{argmin}}\, \mathscr{R}_\mathscr{L}(\pi) = p(y)\) — that is, if the risk minimizer is the true probability.

Theorem 2 of Gneiting and Raftery (2007) states that the negative of \(\mathscr{L}(\cdot, \cdot)\) is a strictly proper scoring rule if and only if there exists a strictly convex function \(g: S_K\mapsto \mathbb{R}^{}\) with subgradient \(g'(\cdot)\) such that

\[ -\mathscr{L}(\pi, c) = g(\pi) - g'(\pi)^\intercal\pi + \frac{\partial g(\pi)}{\partial \pi_c}. \]

Note the leading negative! (And if you don’t know what a subgradient is, don’t worry — when \(g\) is differentiable it’s just the gradient.)

Throughout, let \(\boldsymbol{z}\in \{0,1\}^K\) denote the one–hot encoding of \(y\) with \(\boldsymbol{z}_c = \mathrm{I}\left(y= c\right)\).

(a) Directly prove that \(\mathscr{L}_{sq}(\pi, y) = \sum_{c \in \mathcal{Y}} (\pi_c - \boldsymbol{z}_c)^2\) is the negative of a proper scoring rule by minimizing the risk.

(b) Directly prove that \(\mathscr{L}_{ce}(\pi, y) = -\sum_{c \in \mathcal{Y}} \mathrm{I}\left(y= c\right) \log \pi_c\) is the negative of a proper scoring rule by minimizing the risk.

(c) Using Theorem 2 of Gneiting and Raftery (2007), prove that \(\mathscr{L}_{sq}\) from (a) is the negative of a proper scoring rule. Hint: consider \(g(\pi) = \sum_{c \in \mathcal{Y}} \pi_c^2 - 1\).

(d) Using Theorem 2 of Gneiting and Raftery (2007), prove that \(\mathscr{L}_{ce}\) from (b) is the negative of a proper scoring rule. Hint: consider \(g(\pi) = \sum_{c \in \mathcal{Y}} \pi_c \log \pi_c\).

(e) Design a new loss function that is the negative of a proper scoring rule that we have not defined in class.

Solutions

(a) Let \(p_c = p(y= c)\) for short. There are many ways to do this, and in general it is important to constrain \(\pi_c\) to be a valid probability distribution. One easy way avoiding Lagrange multipliers is to observe that

\[ \mathbb{E}_{\vphantom{}}\left[(\pi_c - \boldsymbol{z}_c)^2\right] = \pi_c^2 - 2 \pi_c \mathbb{E}_{\vphantom{}}\left[\boldsymbol{z}_c\right] + \mathbb{E}_{\vphantom{}}\left[\boldsymbol{z}_c^2\right] = \pi_c^2 - 2 \pi_c p_c + p_c \]

so that

\[ \mathbb{E}_{\vphantom{}}\left[\sum_{c \in \mathcal{Y}} (\pi_c - \boldsymbol{z}_c)^2\right] = \sum_{c \in \mathcal{Y}} (\pi_c^2 - 2 \pi_c p_c) + \sum_{c \in \mathcal{Y}} p_c = \sum_{c \in \mathcal{Y}} \pi_c (\pi_c - 2 p_c) + 1. \]

This objective function is quadratic in \(\pi_c\) with a global minimum at \(\hat{\pi}_c = p_c\). Since this \(\hat{\pi}_c\) also satisfies the constraint that the \(\pi_c\) be a valid probability distribution, it is the constrained minimizer as well.

(b) Let \(\lambda\) be the Lagrange multiplier for the constraint \(\sum_c \pi_c = 1\). Then the primal form is

\[ \sum_c p_c \log \pi_c + \lambda (1 - \sum_c \pi_c), \] and at the optimal values \(\hat{\pi}_c\) and \(\hat{\lambda}\) satisfying the constraint, for each \(c\), \(\frac{p_c}{\hat{\pi}_c} - \hat{\lambda}= 0 \Rightarrow p_c - \hat{\lambda}\hat{\pi}_c = 0\). Summing against \(c\) gives \(\hat{\lambda}= 1 / \sum_c \hat{\pi}_c\), so that

\[ \frac{p_c}{\hat{\pi}_c} - \frac{\hat{\pi}_c}{\sum_c \hat{\pi}_c} = 0. \]

By inspection, this equality is satisfied by \(\hat{\pi}_c = p_c\).

(c) We need \(\frac{\partial g(\pi)}{\partial \pi_c} = 2\pi_c\), from which

\[ \begin{aligned} g(\pi) - g'(\pi)^\intercal\pi + \frac{\partial g(\pi)}{\partial \pi_k} ={}& \sum_c \pi_c^2 - 1 - \sum_c 2 \pi_c \pi_c + 2 \pi_k \\={}& - \sum_c \pi_c^2 - 1 + 2 \pi_k. \end{aligned} \]

And note that when \(z_k = 1\)

\[ \sum_c (\pi_c - z_c)^2 = \sum_c \pi_c^2 - 2 \sum_c \pi_c z_c + \sum_c z_c^2 = \sum_c \pi_c^2 - 2 \pi_k + 1, \]

the negative of the proper scoring rule. (Note that this was mechanical and easy, in contrast to constrained optimization, which required a trick.)

(d) We need \(\frac{\partial g(\pi)}{\partial \pi_c} = \log \pi_c + 1\), from which

\[ \begin{aligned} g(\pi) - g'(\pi)^\intercal\pi + \frac{\partial g(\pi)}{\partial \pi_k} ={}& \sum_c \pi_c \log \pi_c - \sum_c (\log \pi_c + 1)\pi_c + \log \pi_k + 1 \\={}& - \sum_c \pi_c + \log \pi_k + 1 \\={}& \log \pi_k. \end{aligned} \]

And when \(z_k = 1\), \(-\sum_c z_c \log \pi_c = -\log \pi_k\).

(e) Take any convex function, e.g. \(g(\pi) = \sum_c \pi_c^4\).

7 Exponential families and logistic regression (254 only \(\star\) \(\star\) \(\star\))

In this problem, we will show that logistic regression — and some of its nice properties — follow from general properties of a class of distributions known as exponential families.

Suppose that we have a probabilty distribution \(p(y\vert \eta)\) given by

\[ p(y\vert \eta) = \exp\left(y\eta - A(\eta) + H(y) \right). \]

For simplicity, let \(y\in \mathcal{Y}\) where \(\mathcal{Y}\) is discrete. (This is not essential to any of the results but allows us to write the problem without using any measure theoretic ideas.)

(a)

\[ \def\sumy{\sum_{y\in \mathcal{Y}}} \]

Show that, necessarily, \(A(\eta) ={} \log \sumy \exp(y\eta + H(y))\). The function \(A(\eta)\) is known as the “log partition function, and \(\eta\) is known as the”natural parameter.”

(b)

Define \(\mu(\eta) := \mathbb{E}_{p(y\vert \eta)}\left[y\right]\) and \(v(\eta) := \mathrm{Var}_{p(y\vert \eta)}\left(y\right)\). Using (a), show that

\[ \mu(\eta) = \frac{\partial A(\eta)}{\partial \eta} \quad\textrm{and}\quad v(\eta) = \frac{\partial^2 A(\eta)}{\partial \eta^2}. \]

(c)

Using (b), show that the first and second derivatives of the log likelihood are

\[ \frac{\partial \log p(y\vert \eta)}{\partial \eta} = (y- \mu(\eta)) \quad\textrm{and}\quad \frac{\partial^2 \log p(y\vert \eta)}{\partial \eta^2} = -v(\eta). \]

(d)

Suppose that we observe pairs \((\boldsymbol{x}_n, y_n)\), and model \(\eta_n := \boldsymbol{\beta}^\intercal\boldsymbol{x}_n\), so that the regressors linearly determine the natural parameter for the \(n\)–th observation. Show that the first and second derivatives of the observed log likelihood \(\hat{\mathscr{R}}(\boldsymbol{\beta}) := \frac{1}{N} \sum_{n=1}^N\log p(y_n \vert \boldsymbol{x}_n, \boldsymbol{\beta})\) are

\[ \frac{\partial \hat{\mathscr{R}}(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} = \frac{1}{N} \sum_{n=1}^N(y_n - \mu(\eta_n)) \boldsymbol{x}_n \quad\textrm{and}\quad \frac{\partial^2 \hat{\mathscr{R}}(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}\partial \boldsymbol{\beta}^\intercal} = -\frac{1}{N} \sum_{n=1}^Nv(\eta_n) \boldsymbol{x}_n \boldsymbol{x}_n^\intercal. \]

Conclude that the negative observed log likelihood is convex as a function of \(\boldsymbol{\beta}\).

(e)

Show that the binary model \(p(y\vert \pi) = \pi^{y} (1 - \pi)^{1-y}\) is an exponential family of the above form. Identify \(\mathcal{Y}\), \(H(y)\), \(\eta\), and \(A(\eta)\). Show that the convexity of logistic regression loss minimization follows as a special case of (d).

(f)

Show that the Poisson model \(p(y\vert \lambda) = \exp(-\lambda) \lambda^y/ y!\) is an exponential family of the above form. Identify \(\mathcal{Y}\), \(H(y)\), \(\eta\), and \(A(\eta)\). Propose a method for “Poisson regression”, and show that the corresponding empirical risk is convex.

Solutions

(a)

The probability distribution sums to one, so

\[ 1 = \sum_{y\in \mathcal{Y}} p(y\vert \eta) = \exp(- A(\eta) ) \sum_{y\in \mathcal{Y}} \exp\left(y\eta + H(y) \right). \]

Solving for \(\boldsymbol{A}(\eta)\) gives the result.

(b)

Differentiating both sides of (a),

\[ \begin{aligned} \frac{\partial A(\eta)}{\partial \eta} ={}& \frac{\sum_{y\in \mathcal{Y}} y\exp\left(y\eta + H(y) \right)}{\sum_{y\in \mathcal{Y}} \exp\left(y\eta + H(y) \right)} \\={}& \sum_{y\in \mathcal{Y}} y\frac{ \exp\left(y\eta + H(y) \right)}{ \sum_{y' \in \mathcal{Y}} \exp\left(y' \eta + H(y) \right) } \\={}& \sum_{y\in \mathcal{Y}} y\exp\left(y\eta + H(y) - A(\eta)\right) \\={}& \sum_{y\in \mathcal{Y}} yp(y\vert \eta) \\={}& \mu(\eta) . \end{aligned} \]

Differentiating the third line again,

\[ \begin{aligned} \frac{\partial^2 A(\eta)}{\partial \eta^2} ={}& \frac{\partial}{\partial \eta} \sum_{y\in \mathcal{Y}} y\exp\left(y\eta + H(y) - A(\eta)\right) \\={}& \sum_{y\in \mathcal{Y}} y^2 \exp\left(y\eta + H(y) - A(\eta)\right) - \frac{\partial A(\eta)}{\partial \eta} \sum_{y\in \mathcal{Y}} y\exp\left(y\eta + H(y) - A(\eta)\right) \\={}& \mathbb{E}_{\vphantom{}}\left[y^2 | \eta\right] - \mathbb{E}_{\vphantom{}}\left[y| \eta\right]^2 \\={}& v(\eta). \end{aligned} \]

(c)

Since \(\log p(y\vert \eta) = y\eta - \boldsymbol{A}(\eta) + H(y)\), these follow directly from (b).

(d)

These follow from the chain rule, and the fact that \(\boldsymbol{x}_n \boldsymbol{x}_n^\intercal v(\eta_n)\) is positive semi–definite since \(v(\eta_n) \ge 0\).

(e)

In this case, \(\mathcal{Y}= \{0,1\}\), \(H(y) = 1\), and

\[ p(y\vert \pi) = \exp\left(y\log \pi + (1 - y) \log(1-\pi) \right) = \exp\left(y\log \frac{\pi}{1-\pi} + \log(1 - \pi)\right), \]

so \(\eta = \log (\pi/(1-\pi))\), \(\pi = \exp(\eta) / (1 + \exp(\eta))\), and \(A(\eta) = \log(1 - \exp(\eta) / (1 + \exp(\eta)))\). The convexity of logistic regression follows simply from this exponential family form.

(f)

In this case, \(\mathcal{Y}\) is the natural numbers, and

\[ p(y\vert \lambda) = \lambda^y\exp(-\lambda) \frac{1}{y!} = \exp(y\log \lambda - \lambda + \log y!), \]

so \(H(y) = -\log y!\), \(\eta = \log \lambda\), and \(A(\eta) = \exp(\eta)\). You might take \(\log \lambda = \boldsymbol{\beta}^\intercal\boldsymbol{x}\), and convexity follows again from the exponetial family expression.

8 Bibliography

Gneiting, T., and A. Raftery. 2007. “Strictly Proper Scoring Rules, Prediction, and Estimation.” Journal of the American Statistical Association 102 (477): 359–78.