Homework 1 (due Feb 14th 9pm)

Stat 154/254: Statistical Machine Learning

1 Regressing on linear transformations

Consider two linear regressions,

\[ \boldsymbol{Y}\sim \boldsymbol{X}\boldsymbol{\beta} \quad\textrm{and}\quad \boldsymbol{Y}\sim \boldsymbol{Z}\boldsymbol{\gamma}, \]

where \(\boldsymbol{z}_n = \boldsymbol{A}\boldsymbol{x}_n\) for an invertible linear transformation \(\boldsymbol{A}\). Show that these two models produce the same predictions, i.e. that, for any \(\boldsymbol{x}_\mathrm{new}\) and \(\boldsymbol{z}_\mathrm{new}= \boldsymbol{A}\boldsymbol{x}_\mathrm{new}\),

\[ \hat{y}_\mathrm{new}= \hat{\boldsymbol{\beta}}^\intercal\boldsymbol{x}_\mathrm{new}= \hat{\boldsymbol{\gamma}}^\intercal\boldsymbol{z}_\mathrm{new}. \]

Solutions

Note that \(z= \boldsymbol{A}x\) means \(z^\intercal= x^\intercal\boldsymbol{A}^\intercal\), and so \(\boldsymbol{Z}= \boldsymbol{X}\boldsymbol{A}^\intercal\). Then, writing out the estimators explicitly,

\[ \begin{aligned} \hat{\boldsymbol{\gamma}}^\intercal\boldsymbol{z}_\mathrm{new}={}& \boldsymbol{Y}^\intercal\boldsymbol{Z}(\boldsymbol{Z}^\intercal\boldsymbol{Z})^{-1} \boldsymbol{z}_\mathrm{new} \\={}& \boldsymbol{Y}^\intercal\boldsymbol{X}\boldsymbol{A}^\intercal((\boldsymbol{X}\boldsymbol{A}^\intercal)^\intercal(\boldsymbol{X}\boldsymbol{A}^\intercal))^{-1} \boldsymbol{A}\boldsymbol{x}_\mathrm{new} \\={}& \boldsymbol{Y}^\intercal\boldsymbol{X}\boldsymbol{A}^\intercal(\boldsymbol{A}\boldsymbol{X}^\intercal\boldsymbol{X}\boldsymbol{A}^\intercal)^{-1} \boldsymbol{A}\boldsymbol{x}_\mathrm{new} \\={}& \boldsymbol{Y}^\intercal\boldsymbol{X}\boldsymbol{A}^\intercal( \boldsymbol{A}^\intercal)^{-1}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{A}^{-1} \boldsymbol{A}\boldsymbol{x}_\mathrm{new} \\={}& \boldsymbol{Y}^\intercal\boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_\mathrm{new} \\={}& \hat{\boldsymbol{\beta}}^\intercal\boldsymbol{x}_\mathrm{new}. \end{aligned} \]

2 One-hot encodings

Consider a one–hot encoding of a variable \(z_n\) that takes three distinct values, “a”, “b”, and “c”. That is, let

\[ \boldsymbol{x}_n = \begin{cases} \begin{pmatrix} 1 \\ 0 \\ 0 \end{pmatrix} & \textrm{ when }z_n = a \\ \begin{pmatrix} 0 \\ 1 \\ 0 \end{pmatrix} & \textrm{ when }z_n = b \\ \begin{pmatrix} 0 \\ 0 \\ 1 \end{pmatrix} & \textrm{ when }z_n = c \\ \end{cases} \]

Let \(\boldsymbol{X}\) be the regressor matrix with \(\boldsymbol{x}_n^\intercal\) in row \(n\).

(a)

Let \(N_a\) be the number of observations with \(z_n\) = a, and let \(\sum_{n:z_n = a}\) denote a sum over rows with \(z_n\) = a, with analogous definitions for b and c. In terms of these quantities, write expressions for \(\boldsymbol{X}^\intercal\boldsymbol{X}\) and \(\boldsymbol{X}^\intercal\boldsymbol{Y}\).

(b)

When is \(\boldsymbol{X}^\intercal\boldsymbol{X}\) invertible? Explain intuitively why the regression problem cannot be solved when \(\boldsymbol{X}^\intercal\boldsymbol{X}\) is not invertible. Write an explicit expression for \((\boldsymbol{X}^\intercal\boldsymbol{X})^{-1}\) when it is invertible.

(c)

Using your previous answer, show that the least squares vector \(\hat{\boldsymbol{\beta}}\) is the mean of \(y_n\) within distinct values of \(z_n\).

(d)

Suppose now you include a constant in the regression, so that

\[ y_n = \alpha + \boldsymbol{\beta}^\intercal\boldsymbol{x}_n + \varepsilon_n, \]

and let \(\boldsymbol{X}'\) denote the regressor matrix for this regression with coefficient vector \((\alpha, \boldsymbol{\beta}^\intercal)^\intercal\). Write an expression for \(\boldsymbol{X}'^\intercal\boldsymbol{X}'\) and show that it is not invertible.

(e)

Find three distinct values of \((\alpha, \boldsymbol{\beta}^\intercal)\) that all give the exact same fit \(\alpha + \boldsymbol{\beta}^\intercal\boldsymbol{x}_n\).

Solutions

Solutions

(a)

\[ \boldsymbol{X}^\intercal\boldsymbol{X}= \begin{pmatrix} N_a & 0 & 0 \\ 0 & N_b & 0 \\ 0 & 0 & N_c \\ \end{pmatrix} \quad\quad \boldsymbol{X}^\intercal\boldsymbol{Y}= \begin{pmatrix} \sum_{n:z_n = a} y_n \\ \sum_{n:z_n = b} y_n \\ \sum_{n:z_n = c} y_n \\ \end{pmatrix}. \]

(b)

It is invertbile as long as each of \(N_a\), \(N_b\), and \(N_c\) are nonzero. If there are no observations for a particular level, you of course cannot estimate its relationship with \(y_n\). When \(\boldsymbol{X}^\intercal\boldsymbol{X}\) is invertible, then

\[ \boldsymbol{X}^\intercal\boldsymbol{X}^{-1} = \begin{pmatrix} 1/N_a & 0 & 0 \\ 0 & 1/N_b & 0 \\ 0 & 0 & 1/N_c \\ \end{pmatrix} \]

(c)

By direct multiplication,

\[ \hat{\boldsymbol{\beta}}= \boldsymbol{X}^\intercal\boldsymbol{X}^{-1} \boldsymbol{X}^\intercal\boldsymbol{Y}= \begin{pmatrix} \frac{1}{N_a} \sum_{n:z_n = a} y_n \\ \frac{1}{N_b} \sum_{n:z_n = b} y_n \\ \frac{1}{N_c} \sum_{n:z_n = c} y_n \\ \end{pmatrix}. \]

(d)

\[ (\boldsymbol{X}')^\intercal\boldsymbol{X}' = \begin{pmatrix} N & N_a & N_b & N_c \\ N_a & N_a & 0 & 0 \\ N_b & 0 & N_b & 0 \\ N_c & 0 & 0 & N_c \\ \end{pmatrix}. \]

This is not invertible because the first column is the sum of the other three. Equivalently, \((\boldsymbol{X}')^\intercal\boldsymbol{X}' \boldsymbol{v}= \boldsymbol{0}\) where \(\boldsymbol{v}= (1, -1, -1, -1)^\intercal\).

(e)

Any line of the form

\[ (\alpha + \beta_1) z_{na} + (\alpha + \beta_2) z_{nb} + (\alpha + \beta_3) z_{nc} \]

will give the same fit. Three equivalent sets that also happen to solve the least squares problem are

\[ \begin{aligned} (\alpha, \beta_1, \beta_2, \beta_3) ={} \hat{\boldsymbol{\beta}}+ (0, 0, 0, 0) \\ (\alpha, \beta_1, \beta_2, \beta_3) ={} \hat{\boldsymbol{\beta}}+ (1, -1, -1, -1) \\ (\alpha, \beta_1, \beta_2, \beta_3) ={} \hat{\boldsymbol{\beta}}+ (2, -2, -2, -2). \end{aligned} \]

These are all of the form \(\hat{\boldsymbol{\beta}}+ C \boldsymbol{v}\) for \(C = 0\), \(C = 1\), and \(C = 2\), as they must be, where \(\boldsymbol{v}\) is the null vector from (d).

3 Ridge regression

Recall the ridge regression estimator

\[ \hat{\boldsymbol{\beta}}(\lambda) := \left( \boldsymbol{X}^\intercal\boldsymbol{X}+ \lambda \boldsymbol{I}\right)^{-1} \boldsymbol{X}^\intercal\boldsymbol{Y}, \]

for \(\lambda > 0\). We will not assume that \(\boldsymbol{X}\) is full–column rank. By writing this expression, we are implicitly assuming that \(\left( \boldsymbol{X}^\intercal\boldsymbol{X}+ \lambda \boldsymbol{I}\right)\) is invertible. In this problem, we will show that it is always invertible, and so \(\hat{\boldsymbol{\beta}}_{ridge}\) always exists, even if \(\boldsymbol{X}\) is not full–column rank.

For each problem, carefully justify your answer.

(a)

What is the smallest possible eigenvalue for \(\boldsymbol{X}^\intercal\boldsymbol{X}\)?

(b)

What is the smallest possible eigenvalue for \(\boldsymbol{X}^\intercal\boldsymbol{X}+ \lambda \boldsymbol{I}\)?

(c)

Using (b), conclude that \(\boldsymbol{X}^\intercal\boldsymbol{X}+ \lambda \boldsymbol{I}\) is always invertible.

(d)

For any given \(\boldsymbol{X}\) and \(\boldsymbol{Y}\), can you find a \(\lambda_0 > 0\) such that \(\hat{\boldsymbol{\beta}}(\lambda_0) = \boldsymbol{0}\)?

Solutions

(a)

Since \(\boldsymbol{v}^\intercal\boldsymbol{X}^\intercal\boldsymbol{X}\boldsymbol{v}= \left\Vert\boldsymbol{X}\boldsymbol{v}\right\Vert_2^2 \ge 0\), any eigenvalue must be greater than zero. The eigenvalue can be zero if \(\boldsymbol{X}\) is not full column rank.

(b)

By the same reasoning a (a), the smallest an eigenvalue can be is \(\lambda\).

(c)

If \(\lambda > 0\), then the determinant, which is the product of the eigenvalues, must be greater than zero, so the matrix is invertible. (There are many different ways to argue this.)

(d)

No, in general you cannot.

4 Lasso regularization

(Based on Buchweitz (2025))

Recall the Lasso esimator,

\[ \begin{aligned} \hat{\boldsymbol{\beta}}(\lambda) :={}& \underset{\boldsymbol{\beta}}{\mathrm{argmin}}\, \hat{\mathscr{L}}(\boldsymbol{\beta}; \lambda) \\ \quad\textrm{where}\quad \hat{\mathscr{L}}(\boldsymbol{\beta}; \lambda) :={}& \sum_{n=1}^N(y_n - \boldsymbol{x}_n^\intercal\boldsymbol{\beta})^2 + \lambda \sum_{k=1}^K \left|\beta_k\right|. \end{aligned} \]

(a)

Where is \(\hat{\mathscr{L}}(\boldsymbol{\beta}; \lambda)\) differentiable as a function of \(\boldsymbol{\beta}\)? At the points of differentiability, evaluate \(\partial \hat{\mathscr{L}}(\boldsymbol{\beta}; \lambda) / \partial \boldsymbol{\beta}\).

(b)

For any given \(\boldsymbol{X}\) and \(\boldsymbol{Y}\), show that there is \(\lambda_0 \geq 0\) such that for all \(\lambda{\geq}\lambda_0\), it holds that \(\hat{\boldsymbol{\beta}}(\lambda) = \boldsymbol{0}\).

(c)

Show that, as \(\lambda\) decreases from the \(\lambda_0\) defined in (b), eventually a regressor will have a nonzero coefficient when \(\lambda\) crosses some threshold, \(\lambda_1\).

(d)

For simplicity, assume that the entries of \(\boldsymbol{X}^\intercal\boldsymbol{Y}\) are all distinct, so there are no “ties” in the sample covariance between the regressors and the responses.

Let

\[ k^* = \underset{k}{\mathrm{argmax}}\, \left|\boldsymbol{X}^T_{\cdot k} \boldsymbol{Y}\right| = \underset{k}{\mathrm{argmax}}\, \left|\sum_{n=1}^Nx_{nk} y_n\right| \]

denote the index of the regressor with the largest sample second moment with \(\boldsymbol{Y}\).

Show that:

  • The first entry of \(\hat{\boldsymbol{\beta}}(\lambda)\) to be nonzero will be \(k^*\),
  • The \(\lambda_1\) given in (c) is \(\lambda_1 = 2 \left|\sum_{n=1}^Nx_{nk^*} y_n\right|\).

(e)

Suppose you define an linear transformation of regressor \(k\):

\[ z_{nk} := a x_{nk}, \]

and then perform Lasso using \(z_{nk}\) in place of \(x_{nk}\). Recall that the unregularized linear regression predictions are invariant to linear transformations of this sort.

Assume that \(x_{nk} y_n \ne 0\) for at least one \(n\). Using (d), show that you can always choose \(a\) large enough that the regressor \(z_{nk}\) is selected as the first non–zero coefficient.

(f)

Now suppose that \(\frac{1}{N} \sum_{n=1}^Ny_n \ne 0\), and define a re–centering transformation of regressor \(k\):

\[ z_{nk} := x_{nk} + b, \]

and then perform Lasso using \(z_{nk}\) in place of \(x_{nk}\). Recall that the unregularized linear regression predictions are invariant to linear transformations of this sort if a constant is included in the regression.

Assume that \(y_n \ne 0\) for at least one \(n\). Show that you can always choose \(b\) so that \(z_{nk}\) is selected as the first non–zero coefficient.

(g)

Using (e) and (f), argue why it is standard practice to center and standardize regressors so that \(\frac{1}{N} \sum_{n=1}^Nx_{nk}^2 = 1\) and \(\frac{1}{N} \sum_{n=1}^Nx_{nk} = 0\) when using the lasso.

Solutions

(a)

It is non–differentiable when any component of \(\boldsymbol{\beta}\) is zero, and otherwise differentiable.

(b)

The gradient of the square error loss is given by

\[ \left. \frac{\partial \hat{\mathscr{L}}(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} \right|_{\boldsymbol{\beta}} = 2 \boldsymbol{X}^\intercal\boldsymbol{X}\boldsymbol{\beta}- 2 \boldsymbol{X}^\intercal\boldsymbol{Y}, \]

where \(\hat{\mathscr{L}}(\boldsymbol{\beta}) = \sum_{n=1}^N(y_n - \boldsymbol{x}^\intercal\boldsymbol{\beta})^2\) without the regularization.

At \(\boldsymbol{\beta}= \boldsymbol{0}\), this gradient is \(- 2 \boldsymbol{X}^\intercal\boldsymbol{Y}\). If \(\lambda > 2 \left|\boldsymbol{X}_{\cdot k}^\intercal\boldsymbol{Y}\right|\) for each \(k\), then the lasso loss increases in every direction as you move away from \(\boldsymbol{0}\), and so \(\boldsymbol{\beta}= \boldsymbol{0}\) is a local minimum. Further, the lasso loss is convex, so a local minimum is a global minimum. So \(\lambda_0 = 2 \left\Vert\boldsymbol{X}^\intercal\boldsymbol{Y}\right\Vert_\infty\) guarantees a zero solution.

(c)

By the reasoning from (b), once \(\lambda\) is slightly less than some entry of \(\left|2 \boldsymbol{X}^\intercal\boldsymbol{Y}\right|\), where the absolute value is applied componentwise, then that coefficient will become nonzero.

(d)

This follows by observing that \(\left|\sum_{n=1}^Nx_{nk} y_n\right|\) is the \(k--\)th entry of \(\left|\boldsymbol{X}^\intercal\boldsymbol{Y}\right|\).

(e)

Plugging in, \(\left|\sum_{n=1}^Nz_{nk} y_n\right| = \left|a\right| \left|\sum_{n=1}^Nx_{nk} y_n\right| > 0\). By making \(a\) large enough, you can ensure that \(z\) is the first regressor selected.

(f)

Plugging in, \(\left|\sum_{n=1}^Nz_{nk} y_n\right| = \left|\sum_{n=1}^N(x_{nk} + b) y_n\right| > b \left|\sum_{n=1}^Ny_n\right| - \left|\sum_{n=1}^Nx_{nk} y_n\right|\). Again, you can make \(z_{nk}\) be selected first by simply making \(b\) large enough.

(g)

You usually don’t want to be selecting a regressor simply because it has a different scale or center than another; you want it to be selected because of its capactiy to predict the response \(y\).

5 Regularization and equivalent parameterizations

Let \(\zeta_0 < \ldots < \zeta_K\) denote \(K+1\) partition boundaries in \(\mathbb{R}^{}\), let \(x\in [\zeta_0, \zeta_K]\), and define two different sets of regressors:

\[ \begin{aligned} s_{k}(x) :={}& \mathrm{I}\left(\zeta_{k - 1} < x\right) \quad\textrm{for}\quad k=1,\ldots,K \\ r_{k}(x) :={}& \mathrm{I}\left(\zeta_{k - 1} < x\le \zeta_{k} \right) \quad\textrm{for}\quad k=1,\ldots,K. \end{aligned} \]

Let \(\boldsymbol{s}_n = (s_{1}(x_n), \ldots, s_{K}(x_n))^\intercal\), and \(\boldsymbol{r}_n = (r_{1}(x_n), \ldots, r_{K}(x_n))^\intercal\). (You might say “s” is for “step” and “r” is for “region.”)

(a)

Show that \(\boldsymbol{s}_n\) is an invertible linear transformation of \(\boldsymbol{r}_n\). As a consequence, OLS predictions using \(\boldsymbol{s}_n\) and \(\boldsymbol{r}_n\) are equivalent if there is no regularization.

(b)

Consider running Lasso regression with \(\boldsymbol{s}_n\). Sketch an example of what the approximating function \(\sum_{k=1}^K \beta_k s_{k}(x)\) would look like if \(\boldsymbol{\beta}\) is sparse.

(c)

Consider running Lasso regression with \(\boldsymbol{r}_n\). Sketch an example of what the approximating function \(\sum_{k=1}^K \beta_k r_{k}(x)\) would look like if \(\boldsymbol{\beta}\) is sparse.

(d)

Between (b) and (c), which would you expect to be more useful in a typical regression situation?

(e)

Is regularized regression invariant under invertible linear transformations of the regressors?

Solutions

(a)

You can convince yourself by picture or otherwise that \(s_k(x) = \sum_{j=k}^K r_k(x)\). Similarly, \(r_k(x) = s_k(x) - s_{k+1}(x)\), except \(r_K(x) = s_K(x)\), so the relation is invertible. (Note: an early version of the problem did not assume \(x\le \zeta_K\), so if students assume this implicitly, please give them full credit.)

(b)

This should look like a step function, with wide steps, typically nonzero, with few points at which the step changes.

(c)

This should look like a function that is mostly zero with a few spikes.

(d)

Very often (b) is a more reasonable approximation.

(e)

It is not — the space in which you regularize has a big effect on what your sparse fit looks like, even if the transformations are invertible linear transformations.

6 ESS and cubic regressions

Gareth et al. (2021) 3.7 Ex 4

Solutions

(a)

The cubic RSS will be lower, because you are projecting onto a higher–dimensional subspace.

(b)

We expect the linear model test set error to be lower, because the cubic model has added “noise variables” and so increased variance without decreasing bias.

(c)

No matter what, the training RSS for the cubic regression will be lower.

(d)

There is not enough information to tell which test set error will be lower.

7 An unhelpful colleague

For this problem, we will consider the problem of predicting human bodyfat percentage from easyily measured body dimensions such as height, weight, abdomen circumference, (Height, Weight, and Abdomen) and so on. Since measuring bodyfat directly requires expensive equipment (such as immersion in a water tank to measure body volume), it would be useful to be able to approximate bodyfat using more easily obtained measures.

Assume that we have a dataset consisting of \(N\) observations, where \(\boldsymbol{x}_n \in \mathbb{R}^{P}\) consists of easily–obtained body measurements, such as Height, Weight, and Abdomen. The corresponding \(y_n \in [0,1]\) is an expensive–to–obtain but precise measure of bodyfat. You and a colleague would like to use \(\boldsymbol{x}\) to accurately predict an unseen \(y\) on a held–out dataset.

Unfortunately, your colleague has not taken a linear modeling class.

(a)

Your colleague notes that the regressors are on different scales, since they are measured in different units. For example, Height is measured in inches, and Weight in pounds. They suggest standardizing using US averages, replacing \(x_{np}\) with

\[ x'_{np} := \frac{x_{np} - \mu_p}{\sigma_p} \]

where \(\mu_p\) and \(\sigma_p\) are the US average and standard deviation of regressor \(p\), and then regressing on \(\boldsymbol{x}'_n\) instead.

When they do so, they find that they test error does not change at all. Can you explain why (in detail)?

(b)

Chastened, your colleague suggests that maybe it’s the difference between normalized height and weight that would helps us predict. After all, it makes sense that height should only matter relative to weight, and vice versa. So they run the regresson on the pairwise differences:

\[ x''_{nr} := x'_{np} - x'_{nq} \textrm{ for all pairs }p,q\textrm{ with }p > q. \]

That is, \(\boldsymbol{x}''_n \in \mathbb{R}^{P (P - 1) / 2}\), because there are \(P (P - 1) / 2\) distinct pairs of normalized covariates. Unfortunately, when they regress \(y\) on \(\boldsymbol{x}''\) using standard statistical software, they find that the fitted values do not change, and many of the coefficients are estimated as NaN. Can you explain why (in detail)?

(c)

Increasingly frustrated, your colleague suggests a research project where you improve your fit by regressing \(y_n \sim z_n\) for new regressors \(z_n\) of the form \(z_n = \boldsymbol{A}x_n\), where the matrix \(\boldsymbol{A}\) is optimized in an outer loop. That is, they suggest finding

\[ \begin{aligned} \boldsymbol{z}_n(A) :={}& \boldsymbol{A}\boldsymbol{x}_n\\ \hat{\mathscr{L}}(\boldsymbol{\beta}) :={}& \sum_{n=1}^N(y_n - \boldsymbol{z}_n(\boldsymbol{A})^\intercal\boldsymbol{\beta})^2 \\ \hat{\boldsymbol{\beta}}(A) :={}& \underset{\boldsymbol{\beta}}{\mathrm{argmin}}\, \hat{\mathscr{L}}(\boldsymbol{\beta}) \\ \hat{A} :={}& \underset{A}{\mathrm{argmin}}\, \hat{\mathscr{L}}(\hat{\boldsymbol{\beta}}(A)). \end{aligned} \]

They would then use the prediction \(\hat{y}_\mathrm{new}= \hat{\boldsymbol{\beta}}(\hat{\boldsymbol{A}})^\intercal\boldsymbol{z}_n(\boldsymbol{A})_\mathrm{new}\). Will this (complicated) procedure produce a better fit to the data than simply regressing \(y_n \sim \boldsymbol{x}_n\)? Why or why not?

Solutions

(a)

The new regressors are invertible linear transforms of the original regressors. This means \(\hat{\boldsymbol{Y}}\) does not change, but the coefficients change.

(b)

The new regressors are again invertible linear transforms of the original regressors. The difference is a linear combination of height_norm and weight_norm, so \(\boldsymbol{X}^\intercal\boldsymbol{X}\) is not invertible, and \betavhat is not uniquely defined.

(c)

Since \(\boldsymbol{A}x_n\) is a linear combination of \(\boldsymbol{x}_n\), it can never give a better fit than regressing on \(x_n\) alone. (It can give a worse fit if \(\boldsymbol{A}\) is lower rank than \(\boldsymbol{x}_n\).)
Searching over all possible \(\boldsymbol{A}\) will never give you a better fit than \(y_n \sim \boldsymbol{x}_n\).
This is not a good idea for a research project.

8 IID observations and Spotify data

A colleague has been working with the Kaggle Spotify dataset, which contains a large number of songs and the following (ML–derived) scalar–valued features, which are designed to capture humean–recognizeable aspects of a song: danceability, energy, mode, speechiness, acousticness, instrumentalness, liveness, and valence. The dataset also records the following more objective attributes of a song: key, tempo, loudness, and duration_ms.

Your colleague wants to make a music recommendation system which, given these attritbutes for a song randomly chosen from the entire Spotify corpus, will predict whether a user will like the song well enough to create custom playlists. They express this as a classification task, where a song is either liked or disliked. Note that the original Kaggle dataset does not have ``liked’’ ratings, so you either need to get other peoples’ likes or generate them yourself.

(a)

Consider a different classification task, which is to identify the genre of the song (metal, classical, disco, rap, etc.) Do you think a genre classification task is easier or harder than detecting whether a user will like a song?

(b)

Your colleague begins by training a machine learning model on this dataset. Here is how the public data was collected:

Public Spotify ratings data

I collected 100 liked songs and 95 disliked songs

For those I like , I made a playlist of my favorite 100 songs. It is mainly French Rap , sometimes American rap , rock or electro music.

For those I dislike , I collected songs from various kind of music so the model will have a broader view of what I don’t like

There is :

  • 25 metal songs ( Cannibal Corps )
  • 20 ” I don’t like ” rap songs ( PNL )
  • 25 classical songs
  • 25 Disco songs
  • I didn’t include any Pop song because I’m kinda neutral about it

Your colleagues produces at 90% out–of–sample test set accuracy on the public dataset (where the test set is a random sample taken from the public dataset, not from new data). Encouraged by this, they spend many hours producing their own dataset by randomly selecting 200 songs from the Spotify dataset, listening to and rating each, and training their own model to predict which songs they will like in the exact same way as above. Unfortunately, when trained on their own dataset, the out–of–sample performance of their model is hardly better than random chance.

How can you explain why their approach worked so well on the Kaggle training dataset, but failed so badly on their own dataset?

(c)

Express the difference between the Kaggle dataset and your colleague’s dataset in terms of IID sampling from a target population and expected loss minimization.

Solutions

(a)

Certainly genre classification is easier than indentifying “liked” songs.

(b)

Since the “public dataset” had almost entirely different genres for “liked” versus “not liked” songs, in order to differentiate the two you needed only to learn a genre classifier. In fact, you needed only to identify rap — if you classify all rap songs as “liked”, then you’ll achieve a nearly 90% accuracy rate, in and out of sample.

Typically, of course, people’s taste does not cleave so neatly into genre boundaries. If you randomly choose songs from the whole corpus and say which ones you liked and which you didn’t, then a genre classifier will typically not do very well. In fact, if the original playlist had simply contained all rap songs, some of which the rater liked and some of which they didn’t, then it would have appeared to be a much harder problem.

(c)

The “public dataset” was not sampled in a way that is representative of how you’d actually want to use a song recommendation system — specifically the genre of the song was highly correlated with whether it was “liked”, in a way that is very different from the distribution over genre and “liking” in the whole population of songs.

9 Incorrectly specified linear regression

Assume that \(y_n = \boldsymbol{x}_n^\intercal\boldsymbol{\alpha}^{*}+ \boldsymbol{z}_n^\intercal\boldsymbol{\gamma}^{*}+ \varepsilon_n\), for some \(\boldsymbol{\alpha}^{*}\) and \(\boldsymbol{\gamma}^{*}\), where \(\varepsilon_n\) is mean zero, has finite variance \(\sigma^2\), and is independent of \(\boldsymbol{x}_n\) and \(\boldsymbol{z}_n\). It follows that the optimal regrsesion function is

\[ f^{\star}(\boldsymbol{x}, \boldsymbol{z}) = \mathbb{E}_{\,}\left[y\vert \boldsymbol{x}, \boldsymbol{z}\right] = \boldsymbol{x}^\intercal\boldsymbol{\alpha}^{*}+ \boldsymbol{z}^\intercal\boldsymbol{\gamma}^{*} \].

Assume further that \(\boldsymbol{x}_n\) and \(\boldsymbol{z}_n\) are IID across \(n\), but possibly correlated for a particular \(n\), and have finite covariances, and zero mean. Let \(\boldsymbol{X}\) and \(\boldsymbol{Z}\) denote the matrices with \(\boldsymbol{x}_n^\intercal\) and \(\boldsymbol{z}_n^\intercal\) in row \(n\), respectively.

Denote

\[ \begin{aligned} M_{xx} :={}& \mathbb{E}_{\,}\left[\boldsymbol{x}_n \boldsymbol{x}_n^\intercal\right] \\ M_{zz} :={}& \mathbb{E}_{\,}\left[\boldsymbol{z}_n \boldsymbol{z}_n^\intercal\right] \\ M_{xz} :={}& \mathbb{E}_{\,}\left[\boldsymbol{x}_n \boldsymbol{z}_n^\intercal\right] =: M_{zx}^\intercal, \end{aligned} \]

and assume that \(M_{xx}\) and \(M_{zz}\) are invertible. Note that, since \(\mathbb{E}_{\,}\left[\boldsymbol{x}_n\right] = \boldsymbol{0}\) and \(\mathbb{E}_{\,}\left[\boldsymbol{z}_n\right] = \boldsymbol{0}\), that the matrices \(M\) are covariances.

We will consider the misspecified regression \(y_n \sim \boldsymbol{x}_n^\intercal\boldsymbol{\beta}\), letting \(\boldsymbol{\beta}^{*}:= \lim_{N \rightarrow \infty} \hat{\boldsymbol{\beta}}\) (in the sense that \(\hat{\boldsymbol{\beta}}\rightarrow \boldsymbol{\beta}^{*}\) in probability).

(a)

Write an expression for \(\boldsymbol{\beta}^{*}\) in terms of the quantities defined above.

(b)

Suppose we are using our linear regression for causal inference, and want to know \(\boldsymbol{\alpha}^{*}\), which we take as the causal effect of \(\boldsymbol{x}\) on \(y\).

Suppose \(M_{xz} \ne \boldsymbol{0}\). Does \(\boldsymbol{\beta}^{*}= \boldsymbol{\alpha}^{*}\)? Interpret this result in terms of unobserved confounders.

(c)

Suppose we are using linear regression for prediction, so we do not care about estimating \(\boldsymbol{\alpha}^{*}\), but rather care about the asymptotic quality of our estimates \(\hat{y}^*_\mathrm{new}= {\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x}_\mathrm{new}\). Write an expression for the expected squared limiting bias,

\[ \mathbb{E}_{\,}\left[ \left(f^{\star}(\boldsymbol{x}_\mathrm{new}, \boldsymbol{z}_\mathrm{new}) - \hat{y}^{*}_\mathrm{new}\right)^2 \right]. \]

(d)

Compare (c) with the bias of the “correct” prediction based on \(\boldsymbol{x}\) alone: \[ \mathbb{E}_{\,}\left[ \left(f^{\star}(\boldsymbol{x}_\mathrm{new}, \boldsymbol{z}_\mathrm{new}) - {\boldsymbol{\alpha}^{*}}^\intercal\boldsymbol{x}_\mathrm{new}\right)^2 \right]. \]

In particular, show that the expected squared limiting bias of the “incorrect” regression is always smaller than the expected squared bias of the prediction \({\boldsymbol{\alpha}^{*}}^\intercal\boldsymbol{x}\).

(e)

Unobserved confounders are severely problematic for causal inference. Is the same true for prediction problems? Justify your answer in terms of the results from this question.

Solutions

(a)

\(\boldsymbol{\beta}^{*}= M_{xx}^{-1} (M_{xx} \boldsymbol{\alpha}^{*}+ M_{xz} \boldsymbol{\gamma}^{*}) = \boldsymbol{\alpha}^{*}+ M_{xx}^{-1} M_{xz} \boldsymbol{\gamma}^{*}\).

(b)

In this case, \(\boldsymbol{\beta}^{*}\ne \boldsymbol{\alpha}^{*}\), because unobserved confounders bias the estimates. We cannot safely do inference in the presence of unobserved confounders.

(c)

Plugging in (and dropping \(\mathrm{new}\) for compactness),

\[ \begin{aligned} &\mathbb{E}_{\,}\left[ \left((\boldsymbol{x}^\intercal\boldsymbol{\alpha}^{*}+ \boldsymbol{z}^\intercal\boldsymbol{\gamma}^{*}) - \boldsymbol{x}^\intercal(\boldsymbol{\alpha}^{*}+ M_{xx}^{-1}M_{xz} \boldsymbol{\gamma}^{*}) \right)^2 \right] \\&={} \mathbb{E}_{\,}\left[ \left(\boldsymbol{x}^\intercal M_{xx}^{-1}M_{xz} \boldsymbol{\gamma}^{*}- \boldsymbol{z}^\intercal\boldsymbol{\gamma}^{*}\right)^2 \right] % \\&={} % \expect{ % \xv^\trans M_{xx}^{-1} M_{xz} \gammavstar {\gammavstar}^\trans M_{zx} M_{xx}^{-1} \xv + % {\gammavstar}^\trans \zv \zv^\trans \gammavstar - % 2 (\xv^\trans M_{xx}^{-1}M_{xz} \gammavstar {\gammavstar}^\trans \zv + % \res_n^2 % } \\&={} \mathbb{E}_{\,}\left[ {\boldsymbol{\gamma}^{*}}^\intercal M_{zx} M_{xx}^{-1} \boldsymbol{x}^\intercal\boldsymbol{x}M_{xx}^{-1} M_{xz} \boldsymbol{\gamma}^{*}+ {\boldsymbol{\gamma}^{*}}^\intercal\boldsymbol{z}\boldsymbol{z}^\intercal\boldsymbol{\gamma}^{*}- 2 ({\boldsymbol{\gamma}^{*}}^\intercal\boldsymbol{z}\boldsymbol{x}^\intercal M_{xx}^{-1}M_{xz} \boldsymbol{\gamma}^{*} \right] \\&={} {\boldsymbol{\gamma}^{*}}^\intercal M_{zx} M_{xx}^{-1} M_{xx} M_{xx}^{-1} M_{xz} \boldsymbol{\gamma}^{*}+ {\boldsymbol{\gamma}^{*}}^\intercal M_{zz} \boldsymbol{\gamma}^{*}- 2 {\boldsymbol{\gamma}^{*}}^\intercal M_{zx} M_{xx}^{-1}M_{xz} \boldsymbol{\gamma}^{*} \\&={} {\boldsymbol{\gamma}^{*}}^\intercal(M_{zz} - M_{zx} M_{xx}^{-1}M_{xz}) \boldsymbol{\gamma}^{*} \end{aligned} \]

(d)

\[ \begin{aligned} \mathbb{E}_{\,}\left[ \left((\boldsymbol{x}^\intercal\boldsymbol{\alpha}^{*}+ \boldsymbol{z}_\intercal\boldsymbol{\gamma}^{*}) - \boldsymbol{x}^\intercal\boldsymbol{\alpha}^{*}\right)^2 \right] ={} {\boldsymbol{\gamma}^{*}}^\intercal M_{zz} \boldsymbol{\gamma}^{*} \end{aligned} \]

The difference is \(-{\boldsymbol{\gamma}^{*}}^\intercal M_{zx} M_{xx}^{-1}M_{xz} \boldsymbol{\gamma}^{*}\) which must be negative; you do better prediction by having the counfounding than by not having it.

(e)

The unobserved confounder helps prediction by being correlated with \(\boldsymbol{x}\), but it harms inference.

10 Maximum likelihood estimator (254 only \(\star\) \(\star\) \(\star\))

(a)

Show that the OLS estimator is the maximum likelihood estimator (MLE) of the model

\[ \begin{aligned} \mathbb{P}_{\,}\left(y_n | \boldsymbol{x}_n\right) ={}& \mathcal{N}\left(\boldsymbol{\beta}^\intercal\boldsymbol{x}_n, \sigma^2\right) \\ \mathbb{P}_{\,}\left(\boldsymbol{x}_n\right) ={}& \textrm{Some unspecified distribution} \\ (\boldsymbol{x}_n, y_n) \overset{\mathrm{IID}}{\sim}{}& \mathbb{P}_{\,}\left(\boldsymbol{x}_n\right) \mathbb{P}_{\,}\left(y_n \vert \boldsymbol{x}_n\right). \end{aligned} \]

(b)

For two different densities, \(p(\boldsymbol{x}, y)\) and \(q(\boldsymbol{x}, y)\), defined with respect the same dominating measure, the KL divergence is defined as

\[ \mathrm{KL}(p(\boldsymbol{x}, y) || q(\boldsymbol{x}, y)) := \mathbb{E}_{p(\boldsymbol{x}, y)}\left[\log \frac{p(\boldsymbol{x}, y)}{q(\boldsymbol{x}, y)}\right], \]

where \(\mathbb{E}_{p(\boldsymbol{x}, y)}\left[\cdot\right]\) means the expectation is with repsect to draws from \(p(\boldsymbol{x}, y)\).

Suppose that we have a parameterized family of conditional distributions, \(q(y| \boldsymbol{x}, \theta)\), some guess for the regressor distribution \(q(\boldsymbol{x})\) and define the risk function

\[ \mathscr{L}(\theta) := \mathrm{KL}(p(\boldsymbol{x}, y) || q(y| \boldsymbol{x}, \theta) q(\boldsymbol{x})). \]

Show that the classical maximum likelihood estimator

\[ \underset{\theta}{\mathrm{argmax}}\, \sum_{n=1}^N\log q(y_n \vert \boldsymbol{x}_n, \theta) \]

is the empirical risk minimizer for \(\mathscr{L}(\theta)\).

This means that the MLE can also be written as an ML-style risk minimization, where the loss is measured by KL divergence between the true and modeled distributions.

Solutions

(a)

For the Gaussian case, we can write

\[ \begin{aligned} \log \mathbb{P}_{\,}\left(\boldsymbol{x}_n\right) \mathbb{P}_{\,}\left(y_n \vert \boldsymbol{x}_n\right) ={}& -\frac{\sigma^{-2}}{2}(y_n - \boldsymbol{\beta}^\intercal\boldsymbol{x}_n)^2 - \log \sigma^2 - \frac{1}{2}\log 2\pi + \log \mathbb{P}_{\,}\left(\boldsymbol{x}_n\right) . \end{aligned} \]

Only the quadratic term depends on \(\boldsymbol{\beta}\), and that is the negative of the OLS loss. Note that \(\mathbb{P}_{\,}\left(\boldsymbol{x}_n\right)\) does not depend on \(\boldsymbol{\beta}\), and so the MLE does not depend on whether \(\boldsymbol{x}_n\) is treated as random or fixed.

(b)

We can write \(q(\boldsymbol{x}, y) = q(\boldsymbol{x}) q(y| \boldsymbol{x}; \theta)\), and then

\[ \begin{aligned} \mathrm{KL}(p(\boldsymbol{x}, y) || q(\boldsymbol{x}) q(y| \boldsymbol{x}; \theta)) ={}& \mathbb{E}_{p(\boldsymbol{x}, y)}\left[\log \frac{p(\boldsymbol{x}, y)}{q(\boldsymbol{x}) q(y| \boldsymbol{x}; \theta)}\right] \\={}& \mathbb{E}_{p(\boldsymbol{x}, y)}\left[\log p(\boldsymbol{x}, y)\right] - \mathbb{E}_{p(\boldsymbol{x}, y)}\left[\log q(y\vert \boldsymbol{x}; \theta)\right] - \mathbb{E}_{p(\boldsymbol{x}, y)}\left[\log q(\boldsymbol{x})\right]. \end{aligned} \]

We see that the only part of this that depends on \(\theta\) is \(-\mathbb{E}_{p(\boldsymbol{x}, y)}\left[\log q(y\vert \boldsymbol{x}; \theta)\right]\), and the MLE objective is the negative of the empirical version of this loss.

11 Taylor series invariance (254 only \(\star\) \(\star\) \(\star\))

Suppose we have a scalar \(x_n \in \mathbb{R}^{}\), scalar responses \(y_n \in \mathbb{R}^{}\), and that we are using a Taylor series centered at some \(x_0\) the conditional expectation with a linear function:

\[ \mathbb{E}_{\,}\left[y| x\right] \approx \sum_{k=0}^K \beta_k (x- x_0)^k. \]

Let \(\hat{\boldsymbol{\beta}}\) denote the OLS estimate of \(\boldsymbol{\beta}\) for the preceding expansion.

(a)

Show that, for any \(\boldsymbol{x}_\mathrm{new}\), the prediction \(\hat{y}_\mathrm{new}= \hat{\boldsymbol{\beta}}^\intercal\boldsymbol{x}_\mathrm{new}\) is invariant to the choice of \(x_0\). In this sense, it does not matter for linear prediction where you center your Taylor series.

(b)

Show that the conclusion (a) is not true in general if you do not include a constant term in the regression (i.e., if you start the Taylor series sum at \(k=1\) rather than \(k=0\)).

Solutions

(a)

For any \(x_0\) we can use bionmial expansion to write

\[ \begin{aligned} \sum_{k=0}^K \beta_k (x- x_0)^k ={}& \sum_{k=0}^K \beta_k \sum_{j=0}^k {k \choose j} x^{j} (x_0)^{k-j} (-1)^{k-j}. \end{aligned} \]

Ultimately, this is a linear combination of powers of \(x^j\) irrespective of \(x_0\), and so is linearly equivalent to any other centering.

(b)

For each term with \(j=0\) you have a term that is constant in \(x\). If a constant regressor is not included, then this term could be different for different \(x_0\).

12 Uniform laws for linear regression (254 only \(\star\) \(\star\) \(\star\))

Consider special case of linear regression loss for \(\boldsymbol{\beta}\in \mathbb{R}^{P}\),

\[ \begin{aligned} \mathscr{L}(\boldsymbol{\beta}) :={}& \mathbb{E}_{\,}\left[(y_\mathrm{new}- \boldsymbol{\beta}^\intercal\boldsymbol{x}_\mathrm{new})^2\right] \\ \hat{\mathscr{L}}(\boldsymbol{\beta}) :={}& \frac{1}{N} \sum_{n=1}^N(y_n - \boldsymbol{\beta}^\intercal\boldsymbol{x}_n)^2 \\ \end{aligned} \]

and let \(K\) be a bounded subset of \(\mathbb{R}^{P}\). Show directly that regression loss obeys a uniform law of large numbers, in the sense that

\[ \sup_{\boldsymbol{\beta}\in K} \left|\hat{\mathscr{L}}(\boldsymbol{\beta}) - \mathscr{L}(\boldsymbol{\beta})\right| \rightarrow 0, \]

where the limit is in probability. You may assume that \(\mathbb{E}_{\,}\left[\boldsymbol{x}\boldsymbol{x}^\intercal\right]\) and \(\mathbb{E}_{\,}\left[y^2\right]\) are finite and the observations are IID.

Solutions

Using our usual notation \(M_{yy} = \mathbb{E}_{\,}\left[y^2\right]\), \(M_{xx} = \mathbb{E}_{\,}\left[\boldsymbol{x}\boldsymbol{x}^\intercal\right]\) and \(M_{xy} = \mathbb{E}_{\,}\left[\boldsymbol{x}y\right]\), with hats on their empirical counterparts, we can write

\[ \begin{aligned} \hat{\mathscr{L}}(\boldsymbol{\beta}) - \mathscr{L}(\boldsymbol{\beta}) ={}& \frac{1}{N} \sum_{n=1}^Ny_n^2 - 2 \frac{1}{N} \sum_{n=1}^N\boldsymbol{x}_n y_n \boldsymbol{\beta}+ \frac{1}{N} \sum_{n=1}^N\mathrm{trace}\left(\boldsymbol{x}_n \boldsymbol{x}_n^\intercal\boldsymbol{\beta}\boldsymbol{\beta}^\intercal\right) - \\{}& \left(M_{yy} - 2 M_{xy} \boldsymbol{\beta}+ \mathrm{trace}\left(M_{xx} \boldsymbol{\beta}\boldsymbol{\beta}^\intercal\right) \right) \\={}& \hat{M}_{yy} - M_{yy} - 2(\hat{M}_{xy} - M_{xy})\boldsymbol{\beta}+ \mathrm{trace}\left((\hat{M}_{xx} - M_{xx})\boldsymbol{\beta}^\intercal\boldsymbol{\beta}\right). \end{aligned} \]

Since we assume that \(\boldsymbol{\beta}\) is in a compact domain, \(\sup_{\boldsymbol{\beta}} \left\Vert\boldsymbol{\beta}\right\Vert_2 < \infty\), and all the empirical expectations converge to their sample counterparts by the LLN.

This is a nice example because squared loss and linear models exhibit exactly the kind of separation between data and parameters that we had to assume in the integrable Lipschitz assumption.

13 Bibliography

Buchweitz, E. 2025. “A Concise Course in Statistical Learning Theory.”
Gareth, J., W. Daniela, H. Trevor, and T. Robert. 2021. An Introduction to Statistical Learning: With Applications in Python. Spinger.