Lab 5 Presentation - Cross Validation

Author

Erez Buchweitz

1 Outstanding debt from linear regression

2 Estimating the expected test error

We have training data \(\boldsymbol{X}\in\mathbb{R}^{N \times P}\) and \(\boldsymbol{y}\in\mathbb{R}^N\) comprising independent and identically distributed observations, and we fit a prediction model \(\hat f\) over \((\boldsymbol{X}, \boldsymbol{y})\). We are interested in finding a model that minimizes the expected test loss: \[ \text{err}(\hat f) = \mathbb{E}[\mathscr{L}(\hat f(\boldsymbol{x}), y)] \] where \(\mathscr{L}\) is some loss function, and \((\boldsymbol{x},y)\) is an observation independent and identically distributed to the observations in \((\boldsymbol{X},\boldsymbol{y})\).

We may want to estimate \(\text{err}(\hat f)\) for a number of reasons:

We want to have a sense of how good \(\hat f\) is before we deploy it to production.
We want to compare two candidate models \(\hat f_1\) and \(\hat f_2\) and choose the best among them.
We want optimize hyperparameters for \(\hat f\).

2.1 Train error

We estimate the test error \(\text{err}(\hat f)\) by averaging loss over the training set \((\boldsymbol{X},\boldsymbol{y})\): \[ \hat{\text{err}}(\hat f) = \frac{1}{N}\sum_{n=1}^N \mathscr{L}(\hat f(\boldsymbol{X}_{n,:}), \boldsymbol{y}_n), \] where \(\boldsymbol{X}_{n,:}\) is the \(n\)’th row of \(\boldsymbol{X}\). This is a biased estimator. Consider the horrid case of \(N=1,000\) observations and \(P=1,000,000\) features that are independent of the target, and \(\hat f\) fits OLS on the most correlated feature in the training set. The test prediction would be random, so test error would be high, but the train error would be low because out of a million features we are likely to have one at random that is highly correlated to the target.

2.2 Validation error

We split the training data into two non-overlapping sets \((\boldsymbol{X}_{\text{train}},\boldsymbol{y}_{\text{train}})\) and \((\boldsymbol{X}_{\text{valid}}, \boldsymbol{y}_{\text{valid}})\) comprising \(N_{\text{train}}\) and \(N_{\text{valid}}\) observations, respectively. We fit \(\hat f\) over the \((\boldsymbol{X}_{\text{train}}, \boldsymbol{y}_{\text{train}})\) and we estimate the test error \(\text{err}(\hat f)\) by averaging loss over the validation set \((\boldsymbol{X}_{\text{valid}}, \boldsymbol{y}_{\text{valid}})\): \[ \hat{\text{err}}(\hat f) = \frac{1}{N_{\text{valid}}}\sum_{n=1}^{N_{\text{valid}}} \mathscr{L}\left(\hat f((\boldsymbol{X}_{\text{valid}})_{n,:}), (\boldsymbol{y}_{\text{valid}})_{n}\right). \]

This is an unbiased estimator for the test error if \(\hat f\) is trained over \(N_{\text{train}}\) observations. Because we eventually train with \(N\) (\(>N_{\text{train}}\)) observations, this is a biased estimator of the test error, that is usually too optimistic. If \(N_{\text{train}}\) is increased:

The bias of \(\hat{\text{err}}(\hat f)\) decreases.
The variance of \(\hat{\text{err}}(\hat f)\) increases, because the validation set has fewer observations.

Note that it makes no difference how the original dataset is split into train and validation.

2.3 Cross validation error

We split the training data into \(K\) non-overlapping equal-sized folds: \((\boldsymbol{X}^1,\boldsymbol{y}^1), \dots,(\boldsymbol{X}^K, \boldsymbol{y}^K)\). For each \(k=1,\dots,K\), we do:

Fit \(\hat f\) on the concatenation of all folds but the \(K\)’th.
Estimate the test error \(\text{err}(\hat f)\) by averaging loss over the \(K\)’th fold: \[ \hat{\text{err}}_k(\hat f) = \frac{K}{N}\sum_{n=1}^{N/K} \mathscr{L}\left(\hat f((\boldsymbol{X}^k)_{n,:}), (\boldsymbol{y}^k)_{n}\right). \]

Then, the final estimate of the test error \(\text{err}(\hat f)\) is given by averaging the estimates across all folds: \[ \hat{\text{err}}(\hat f) = \frac{1}{K} \sum_{k=1}^K \hat{\text{err}}_k(\hat f). \]

This is an unbiased estimator for the test error if \(\hat f\) is trained over \(N - N/K\) observations, so still biased in the same way that validation error is. However, variance of the estimator \(\hat{\text{err}}(\hat f)\) is reduced compared to the previous method (of a single train-validation split with \(N_{\text{train}}=N-N/K\)) because the error is averaged over all \(N\) observations and not \(N_{\text{valid}\) of them (there are some subtleties here).

If \(K\) is increased:

The bias of \(\hat{\text{err}}(\hat f)\) decreases.
The computational burden is increased because you need to train \(K\) different models.
The variance \(\hat{\text{err}}(\hat f)\) increases?

It is common to use \(K=5\) or \(K=10\). Note that it makes no difference how the original dataset is split into folds.

2.4 Non-IID data

On real data that involve the element of time or space, the assumption of independence and identical distribution is often violated in two common ways:

Temporal or spatial correlation - adjacent observations (in time or space) might be heavily dependent of each other. Observations that are far away are lightly dependent.
Non-stationarity - the underlying relation between \(\boldsymbol{X}\) and \(\boldsymbol{y}\) may change with time or space.

Cross validation common practice in those cases:

Folds are contiguous in time and in space. Note that folds are not independet - observations along the boundary of folds (in time and space) will be more dependent on adjacent folds, and observations in the interior will be less dependent.
Folds should be large enough (i.e. \(K\) small enough) so that most of the observations in a fold are in the interior, far from the boundary.
For each iteration, the folds that participate in training are restricted to past dates (compared to the test fold). This prevents leakage of information from the future.

Note that with non-stationary data it becomes even more important to use cross validation over a single train-validation split. Because cross validation tests the model on many different dates and geographical locations, and not just a few particular ones, so the estimate of error becomes more robust.