Lab 6 Presentation - Cross Validation for Time Series Data

Author

Josh Davis

1 Today

Methods for Cross validation fails for time series data
Different approaches to fold selection
Expanding window vs. sliding window
Applying time-aware CV to the West Nile Virus challenge

2 Recap: Cross Validation

Standard CV: split data into \(K\) non-overlapping folds and for each fold \(k\):

Train \(\hat f\) on all folds except \(k\)
Evaluate on fold \(k\)
Average the loss across all \(k\) folds

The key assumption is that the data are independent and identically distributed (IID). Each fold is an unbiased sample of the population, so it does not matter how we assign observations to folds.

3 What goes wrong with time series data?

Suppose we have observations indexed by time \(t = 1, 2, \dots, N\). Two things may be true:

Temporal correlation: As an example, you could have correlation between \((\mathbf{x}_t, y_t)\) and \((\mathbf{x}_{t+1}, y_{t+1})\) (observations close in time are more alike than observations far apart)
Non-stationarity: The relationship between \(\mathbf{x}\) and \(y\) may change over time.
Leakage: When deploying a model where time is a variable, you never get to train on future data to predict past data. We want to avoid doing this durring the model selection process.

If we ignore the time structure and assign observations to folds at random, both of these properties cause problems.

3.1 An example of leakage

Suppose mosquito trap counts from August 15th are in the validation fold, and counts from August 14th and August 16th are in the training fold. The model can effectively “see” the future when predicting on the validation fold, because the training observations from August 16th are temporally adjacent and highly correlated.

This can lead to overly optimistic CV error estimates: the model looks much better on the CV folds than it will on true future data.

4 Three approaches to fold selection

We will compare three strategies, from worst to best.

4.1 Approach 1: Plain (Random) CV

Assign observations to folds uniformly at random, with no regard for time ordering.

shuffle all observations randomly
split into K equal folds
for each fold k:
    train on all folds except k
    validate on fold k

Problem: Training folds contain observations from the future relative to the validation fold. The model leaks information from the future, and CV error is too optimistic.

4.2 Approach 2: Block CV

Sort observations by time and split into \(K\) contiguous blocks. Use each block as a validation fold in turn, training on all remaining blocks.

sort observations by time
split into K contiguous blocks: B1, B2, ..., BK
for each block Bk:
    train on all blocks except Bk
    validate on Bk

Improvement: Folds are contiguous in time, so temporally adjacent observations are less often split between train and validation.

Remaining problem: When block \(B_k\) is the validation fold, the training set includes blocks from the future relative to \(B_k\). This still allows future leakage.

4.3 Approach 3: Block + Past-Only CV

Like Block CV, split into \(K\) contiguous blocks. But when evaluating on block \(B_k\), train only on blocks \(B_1, \dots, B_{k-1}\) (the past). Never use future data for training.

sort observations by time
split into K contiguous blocks: B1, B2, ..., BK
for k = 2, 3, ..., K:
    train on B1, B2, ..., B(k-1)
    validate on Bk

Why this is correct: Every validation fold simulates predicting the future using only the past. This is the closest CV analogue to the actual deployment scenario. This strategy is implemented in scikit-learn as TimeSeriesSplit.

5 Comparing the three approaches

Strategy	Folds contiguous?	Train on past only?	Leakage risk
Plain CV	No	No	High
Block CV	Yes	No	Moderate
Block + Past CV	Yes	Yes	Low

Note that Block + Past CV uses fewer training observations per fold (only the past), so its CV error estimate is slightly more pessimistic than the truth (since the final model trains on all data). This is a small price to pay to avoid the much larger problem of leakage.

6 Expanding window vs. sliding window

Once we commit to Block + Past CV, there is still a choice: when validating on block \(B_k\), how much past data should we train on?

6.1 Expanding window

Train on all available past data up to block \(B_k\).

for k = 2, 3, ..., K:
    train on B1 ∪ B2 ∪ ... ∪ B(k-1)   ← grows with each fold
    validate on Bk

The training set expands with each fold. This is the default behavior of TimeSeriesSplit.

Pros:

Uses all available past data for training.
A reasonable choice when older data is likely still relevant.

Cons:

Early folds train on much less data than later folds, so fold-level errors are not directly comparable to one another.
Training time grows with each fold.

6.2 Sliding window

Train on a fixed-size window of the most recent \(W\) blocks before \(B_k\).

for k = 2, 3, ..., K:
    train on B(k-W), ..., B(k-1)        ← fixed size, slides forward
    validate on Bk

The training window slides forward, always keeping the same number of blocks.

Pros:

Can be a better fit when the data is non-stationary and older observations are less relevant to the future.
Training set size is constant across folds, making fold-level errors more directly comparable.

Cons:

Discards historical data that may still be informative.
Introduces an additional hyperparameter \(W\) (the window size) that must itself be chosen.

6.3 Which should you use?

	Expanding window	Sliding window
Data is stationary	Preferred	Unnecessary data discarded
Data is non-stationary	Old data may hurt	Preferred
Small dataset	Preferred (use all data)	May leave too little for training
Large dataset	Can be slow	More computationally efficient

When in doubt, the expanding window is a reasonable default. Use the sliding window when you have reason to believe the relationship between features and target has shifted over time.