Cross-validation cautionary notes

Author

Ryan Giordano

\[ \def\train{\textrm{train}} \def\test{\textrm{test}} \def\meanm{\frac{1}{M} \sum_{m=1}^M} \]

Goals

Point out some ways that CV can fail

Three examples of CV failure

Please discuss precisely which assumption behind CV goes wrong in each of the following scenarios.

Images example

Suppose you have an imagine segmentation task where you have to identify whether each pixel is a tumor. You cast this problem as a classification problem, in which each pixel is classified correctly. Let \(x_{ip}\) denote the information in pixel \(p\) of image \(i\), and \(y_{ip}\) the tumor classification. Your risk function thus looks like

\[ \hat{\mathscr{R}}(f) = \sum_{\textrm{image }i} \sum_{\textrm{pixel }p} \mathscr{L}(f(x_{ip}, y_{pi})). \]

Suppose you perform CV leaving out individual pixels, and find you have excellent out–of–sample risk, but then try on genuinely new images, and find that CV performs very badly.

Super useful feature example

This is a popular (but probably apocryphal example)[https://gwern.net/tank]:

“The Army trained a program to differentiate American tanks from Russian tanks with 100% [CV] accuracy. Only later did analysts realized that the American tanks had been photographed on a sunny day and the Russian tanks had been photographed on a cloudy day. The computer had learned to detect brightness.”

Whether or not this actually happened, how would you characterize the failure of CV to give a good estimate of the out-of-sample risk?

Lots of regressors example

Here is a sylized version of an example from Hastie, Tibshirani, and Friedman (2009) Chapter 7.10.3, which is easier to understand mathematically, but less realistic. Suppose we have binary \(y_n\), with 50/50 probability of being \(0\) or \(1\), and linearly independent binary regressors \(\boldsymbol{x}_n \in \{0,1\}^{2^N}\) which are totally independent of \(y_n\). Let’s use squared error for simplicity. Since the regressors and response are independent, the loss \(\mathscr{R}(\hat{f}) \ge \mathrm{Var}_{\,}\left(y^2\right) = 0.25\).

For any given regressor, the probability that \(x_{pn} = y_n\) for all \(n\) is given by \(0.5^N\), which is a very small number. However, if \(\boldsymbol{x}_n\) is very high dimensional so that \(P \gg 0.5^N\), then with high probability there will be at least one \(p\) with \(x_{pn} = y_n\).

Suppose we first comb through the dataset to choose the regressor that best matches \(y_n\), choosing \(x_{pn}\). We then use cross–validation to estimate \(\mathscr{R}(\hat{\beta})\), where \(\hat{\beta}\) is chosen using OLS. Our CV loss is zero, since \(x_{pn} = y_n\) on all the datapoints, even the left–out set, despite the fact that we know the actual risk is \(0.25\).

References

Hastie, T., R. Tibshirani, and J. Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference and Prediction. 2nd ed. Springer.