Cross-validation
Goals
- Introduce cross–validation as an estimator of population loss and some best practices:
- Costs and benefits of leave–one–out verus k–fold (bias / variance tradeoff)
- The importance of choosing the right independent unit
- The importance of repeating the full procedure
Reading
The primary reading for this section is Gareth et al. (2021) Chapter 5.1 and Hastie, Tibshirani, and Friedman (2009) chapter 7.10, for which these notes are just a small supplement.
Test sets
\[ \def\train{\textrm{train}} \def\test{\textrm{test}} \def\meanm{\frac{1}{M} \sum_{m=1}^M} \]
We spend some time developing theory to control the generalization error, \(\hat{\mathscr{L}}(\hat{f}) - \mathscr{L}(\hat{f})\). In practice, however, this theory is not useful because we cannot actually compute the complexity. More importantly, the worst–case bounds we computed appear in practice to be too loose and may suggest bad advice (see, e.g., Belkin et al. (2019)). In practice, we most often estimate the risk \(\hat{\mathscr{L}}(\hat{f})\) using held–out data.
Recall that the estimator \(\hat{\mathscr{L}}(\hat{f}) = \frac{1}{N} \sum_{n=1}^N\mathscr{L}(\hat{f}(\boldsymbol{x}_n), y_n)\) is not considered a reliable estimator of \(\mathscr{L}(\hat{f})\) because \(\hat{f}\) is not independent of \((\boldsymbol{x}_n, y_n)\). But suppose we have an actual new dataset, \(\mathcal{D}_\mathrm{new}= ((\boldsymbol{x}_1, y_1), \ldots (\boldsymbol{x}_M, y_M))\). Then we could just etsimate:
\[ \hat{\mathscr{L}}_\mathrm{new}(\hat{f}) \approx \frac{1}{M} \sum_{m=1}^M \mathscr{L}(\hat{f}(\boldsymbol{x}_m),y_m). \]
This is a consistent estimate of \(\mathscr{L}(\hat{f})\) since we did not use \(\mathcal{D}_\mathrm{new}\) to train \(\hat{f}\), and so \(\hat{f}\) is independent of \(\boldsymbol{x}_m, y_m\).
One might ask, if you had \(\mathcal{D}_\mathrm{new}\), why didn’t you just use it to train \(\hat{f}\)? Conversely, you might imagine taking the data you do have, and splitting it (randomly) into a “training set” \(\mathcal{D}_\train\) and a “test set” \(\mathcal{D}_\test\), using only \(\mathcal{D}_\train\) to compute \(\hat{f}\), and \(\mathcal{D}_\test\) to estimate \(\hat{\mathscr{L}}_\test(\hat{f})\) using the above formula. There are costs and benefits to doing this:
- You have less data to train \(\hat{f}\)
- You have a good estimate of \(\mathscr{L}(\hat{f})\)
Putting more data in \(\mathcal{D}_\train\) ameliorates the first cost, at the risk of making your estimate of \(\mathscr{L}(\hat{f})\) worse.
Cross validation
A clever way to use nearly all the data to compute \(\hat{f}\) and still get a reasonable test set error is to repeat the above procedure many times, each time with a different \(\test\) set.
As an extreme, let \(\hat{f}_{-n}\) denote an estimate of \(\hat{f}\) with the \(n\)–th datapoint removed. Since you have removed only one datapoint, we might hope that \(\hat{f}_{-n} \approx \hat{f}\), so we have paid little estimation cost. Of course, the estimate
\[ \hat{\mathscr{L}}_\test(\hat{f}_{-n}) = \mathscr{L}(\hat{f}_{-n}(\boldsymbol{x}_n), y_n) \]
uses a single datapoint, and so is very variable, if nearly unbiased. However, we can do this for each datapoint and average the results to reduce the variance of the test error estimate. This is called the “leave–one–out CV estimator,” or LOO-CV:
\[ \hat{\mathscr{L}}_{LOO} := \frac{1}{N} \sum_{n=1}^N\mathscr{L}(\hat{f}_{-n}(\boldsymbol{x}_n), y_n). \]
Of course, you can do the same thing leaving two points out, or three, or so on, giving “leave–k–out CV” estimators. In the extreme where you leave a fraction \(N/K\) points out, you get “K–fold” CV estimators.
How many should you leave out? There seems to be to be a striking lack of useful theory. The rules of thumb are:
- By leaving out more datapoints you “bias” your estimate if your estimate of \(\hat{f}\) depends strongly on how many datapoints you have
- By leaving out more datapoints you reduce the “variance” of your estimate of \(\mathscr{L}(\hat{f})\) by making the test set larger.
So there appears to be a sort of meta–bias–variance tradeoff in the estimation of \(\mathscr{L}(\hat{f})\). Common practice is five or ten–fold cross–validation.
Using CV as selection
Typically we actually want to use CV not just to estimate \(\mathscr{L}(\hat{f})\) for a particular procedure, but to choose among a variety of procedures. For example, we might have a family of ridge regressions:
\[ \hat{\boldsymbol{\beta}}(\lambda) := \underset{\boldsymbol{\beta}}{\mathrm{argmin}}\, \frac{1}{N} \sum_{n=1}^N(\boldsymbol{\beta}^\intercal\boldsymbol{x}_n - y_n)^2 + \lambda \left\Vert\boldsymbol{\beta}\right\Vert_2^2, \]
and we would like to choose \(\lambda\) to minimize
\[ \lambda^* := \underset{\lambda}{\mathrm{argmin}}\, \mathscr{L}(\hat{\boldsymbol{\beta}}(\lambda)). \]
Of course we do not know \(\mathscr{L}(\hat{\boldsymbol{\beta}}(\lambda))\), but we can use our standard trick, and estimate
\[ \hat{\lambda}:= \underset{\lambda}{\mathrm{argmin}}\, \hat{\mathscr{L}}_{CV}(\hat{\boldsymbol{\beta}}(\lambda)). \]
Why not simply minimize \(\boldsymbol{\beta}, \lambda := \underset{\boldsymbol{\beta}, \lambda}{\mathrm{argmin}}\, \frac{1}{N} \sum_{n=1}^N(\boldsymbol{\beta}^\intercal\boldsymbol{x}_n - y_n)^2 + \lambda \left\Vert\boldsymbol{\beta}\right\Vert_2^2?\)
Of course, once we have an estimate \(\hat{\lambda}\), we can then take it and use the whole dataset as training with that particular \(\hat{\lambda}\).
Note that we have divided the problem into “hyperparameters” (\(\lambda\)) which are estimated with cross–validation, and “parameters” (\(\boldsymbol{\beta}\)) which have been estimated with the training set. What is a parameter and what is a hyperparameter? In one sense, “hyperparameters are just vibes” (“In Defense of Typing Monkeys,” Recht (2025)); often our hyperparameters are extremely expressive. However, typical practice treats parameters as high–dimensional and capable of overfitting, and hyperparameters as a low to moderate–dimensional set of levers for controlling overfitting.
Putting it together
The scikit webpage has this picture:
Note that there are two different places where the map \(\mathcal{D}\mapsto \hat{f}\) happens (during and after hyperparameter selection), and two different uses of test sets (hyperparameter evaluation and “final” evaluation).
Choosing the right exchangeable unit
Suppose you have an imagine segmentation task where you have to identify whether each pixel is a tumor, and if so, what kind of tumor it is. Suppose the images come from multiple different hospitals. What should you leave out when you run cross–validation? There are many choices:
- Randomly chosen individual pixels
- Randomly chosen contiguous sub–images
- Randomly chosen images
- All images from a randomly chosen hospital
Which choice is correct? The answer depends on what you’re going to do with your ML procedure. Usually the coarsest answer is the best, but often that limits the number “independent” units you have.
Repeating the whole procedure
The test set is only a real test set if \(\hat{f}\) is truly independent of \(\mathcal{D}_\test\). It’s easy to accidentally induce dependence, and you should be on guard against using the whole dataset to make some decisions, leading to poor estimates of test performance.
Here is a sylized version of an example from Hastie, Tibshirani, and Friedman (2009) Chapter 7.10.3, which is easier to understand mathematically, but less realistic. Suppose we have binary \(y_n\), with 50/50 probability of being \(0\) or \(1\), and linearly independent binary regressors \(\boldsymbol{x}_n \in \{0,1\}^{2^N}\) which are totally independent of \(y_n\). Let’s use squared error for simplicity. Since the regressors and response are independent, the loss \(\mathscr{L}(\hat{f}) \ge \mathrm{Var}_{\,}\left(y^2\right) = 0.25\).
For any given regressor, the probability that \(x_{pn} = y_n\) for all \(n\) is given by \(0.5^N\), which is a very small number. However, if \(\boldsymbol{x}_n\) is very high dimensional so that \(P \gg 0.5^N\), then with high probability there will be at least one \(p\) with \(x_{pn} = y_n\).
Suppose we first comb through the dataset to choose the regressor that best matches \(y_n\), choosing \(x_{pn}\). We then use cross–validation to estimate \(\mathscr{L}(\hat{\beta})\), where \(\hat{\beta}\) is chosen using OLS. Our CV loss is zero, since \(x_{pn} = y_n\) on all the datapoints, even the left–out set!
What went wrong? The problem is that we used the whole dataset to choose the regressor first, and then pretended we didn’t. To avoid this kind of problem, always do the whole analysis from the start with every fresh train / test split.