/tmp/ipykernel_9191/2582740908.py:25: UserWarning: No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
ax.legend()
2.1.1 Some quick EDA to check structure of the data
Let us keep in mind we have to predict from June to September of 2013. This seems where most of our available data lies, which is good. One thing we may first do is check for trends across this windows among the three periods
We see that multiple months have no observations, but these are usually the months we are not evaluating on. This is something to have in mind when evaluating. For example,it makes no sense to have a fold with months 5-6 only, regardless of the year. A reasonable structure would be to have a structure in which we fit on all data available from the past, but predict only on the window we will be evaluated.
The prediction blocks we will use, can either contain all 4 of the relevant months, or pairs of contiguous ones. To decide how to pick the pairs of contiguous ones, let us check which choice of pairs seem to better capture the statistical behavior of the whole set of months 6-8:
The pairwise split 6-8 and 7-9 seems slightly more faithful to the whole block 6-9.
2.2 Sum up fold split structure
first folding strategy: - Fold 1: Train on 2007 5-10, predict on 2009 6-9 - Fold 2: Train on 2007 and 2009 5-10, predict on 2011 6-9
second folding strategy: - Fold 1: Train on 2007 5-10 + 2009 5, predict on 2009 6-8 - Fold 2: Train on 2007 5-10 + 2009 5-6, predict on 2009 7-9 - Fold 3: Train on 2007 5-10 + 2009 5-10 + 2011 5, predict on 2011 6-8 - Fold 4: Train on 2007 5-10 + 2009 5-10 + 2011 5-6, predict on 2011 7-9
a third simpler strategy based on expanding window with blocks with the same size on the month aspect: - Fold 1: Aug-Oct 2007 - Fold 2: May-Jul 2009 - Fold 3: Aug-Oct 2009 - Fold 4: May-Jul 2011 - Fold 5: Aug-Oct 2011 And for each fold we train on all past folds and predict on the current fold.
One thing to be mindful of here is that Folds 1-2 and 3-4 are pairwise highly correlated regarding the training algorithm they give, The main difference is the set they are being evaluated on.
Important idea: compute sds of error evaluations using stratified bootstrap. We could also use the sd of the invdividual folds we are averaging. The problem with this is that we have too few folds.
note that the folds are more or less balanced in terms of numbers of observations, the only difference being the dataset sizes. This is a good sign. It means our date splits are comparable in terms of numbers of samples.
3 Write workflow
Let’s write a function that runs our workflow:
Gets hyperparameters **params we want to test
For each of the folds and a folding strategy:
Trains model on the train set
Computes loss on the validation set
Averages the loss across the train-valid pairs
weighted average for evaluation (add option for non-weighted one) + add in weight to take into account size of training data (as option too)
import lightgbm as lgbdef logloss(y, pred, loss_weight=(1.0, 1.0)): w0, w1 =map(float, loss_weight) # class 0 weight, class 1 weight total = w0 + w1if total <=0:raiseValueError("Sum of class weights must be positive.") w0, w1 =2* w0 / total, 2* w1 / total # normalize to sum to 1 y = np.asarray(y) pred = np.clip(np.asarray(pred), 1e-15, 1-1e-15)return np.mean(-(w1 * y * np.log(pred) + w0 * (1- y) * np.log(1- pred)))def compute_loss_cv_single(cv_pairs=None, average=True, weighted_average=False, loss_weight=(1.0,1.0), **params):# Default to strategy 2 if no splits are provided explicitlyif cv_pairs isNone: _, cv_pairs = build_cv_pairs(X, y) folds = cv_pairs.values() ifisinstance(cv_pairs, dict) else cv_pairs losses = [] train_sizes = []for entry in folds: X_train, y_train = entry["X_train"], entry["y_train"] X_valid, y_valid = entry["X_valid"], entry["y_valid"] model = lgb.LGBMClassifier(**params, verbosity=-1, n_jobs=-1) model.fit(X_train, y_train, categorical_feature="Species") pred = model.predict_proba(X_valid)[:, 1] losses.append(logloss(y_valid, pred, loss_weight=loss_weight)) train_sizes.append(len(X_train))ifnot average:return losses, train_sizesif weighted_average:return np.average(losses, weights=train_sizes)return np.mean(losses)def _stratified_bootstrap_indices(X, y, no_date=False, bs_sample_size=None, rng=None):if rng isNone: rng = np.random.default_rng()ifnot no_date: dates = pd.to_datetime(X["Date"]) years = dates.dt.year.to_numpy() months = dates.dt.month.to_numpy() y_arr = np.asarray(y) n =len(X)if no_date: strata = pd.DataFrame({"y": y_arr})else: strata = pd.DataFrame({"year": years, "month": months, "y": y_arr}) stratum_counts = strata.value_counts(sort=False)if bs_sample_size isNone: target_counts = stratum_counts.astype(int)else:if bs_sample_size <=0:raiseValueError("bs_sample_size must be a positive integer.") raw = stratum_counts * (bs_sample_size / n) target_counts = np.floor(raw).astype(int) remainder =int(bs_sample_size - target_counts.sum())if remainder >0: frac = (raw - target_counts).sort_values(ascending=False)for key in frac.index[:remainder]: target_counts.loc[key] +=1elif remainder <0: frac = (raw - target_counts).sort_values(ascending=True)for key in frac.index:if remainder ==0:breakif target_counts.loc[key] >0: target_counts.loc[key] -=1 remainder +=1 sampled_idx = []ifnot no_date:for (yy_year, mm, yy), k in target_counts.items():if k <=0:continue pool = np.where((years == yy_year) & (months == mm) & (y_arr == yy))[0] sampled_idx.append(rng.choice(pool, size=int(k), replace=True))else:for (yy), k in target_counts.items():if k <=0:continue pool = np.where(y_arr == yy)[0] sampled_idx.append(rng.choice(pool, size=int(k), replace=True)) sampled_idx = np.concatenate(sampled_idx) rng.shuffle(sampled_idx)return sampled_idxdef compute_cv_loss_multiple( split_strategy=2, average=True, weighted_average=False, loss_weight=(1.0, 1.0), bootstrap_sd=False, bs_sample_size=None, n_bs_samples=100, random_state=42,**params,):if split_strategy notin (1, 2, 3):raiseValueError("split_strategy must be either 1, 2, or 3.")ifnot bootstrap_sd: cv_pairs = build_cv_pairs(X=X, y=y)[split_strategy -1]return compute_loss_cv_single( cv_pairs=cv_pairs, average=average, weighted_average=weighted_average, loss_weight=loss_weight,**params, )if n_bs_samples <=1:raiseValueError("n_bs_samples must be > 1 when bootstrap_sd=True.") rng = np.random.default_rng(random_state) bs_losses = []for _ inrange(n_bs_samples): idx = _stratified_bootstrap_indices(X, y, bs_sample_size=bs_sample_size, rng=rng) X_bs = X.iloc[idx].reset_index(drop=True) y_bs = y.iloc[idx].reset_index(drop=True) cv_pairs = build_cv_pairs(X_bs, y_bs)[split_strategy -1] cur_loss = compute_loss_cv_single( cv_pairs=cv_pairs, average=average, weighted_average=weighted_average,**params, ) bs_losses.append(np.mean(cur_loss) ifnot average else cur_loss)returnfloat(np.std(bs_losses, ddof=1))compute_loss_cv = compute_cv_loss_multiple
Q: why people don’t use the bootstrap for ML models in practice?
3.1 Dry run
Let’s make a dry run with LightGBM’s default hyperparmeters:
print("Let us compare all strategies with the default parameters from lightgbm")loss_1 = compute_loss_cv(split_strategy=1)loss_2 = compute_loss_cv(split_strategy=2)loss_3 = compute_loss_cv(split_strategy=3)loss_1_bs_sd = compute_cv_loss_multiple(split_strategy=1, bootstrap_sd=True, n_bs_samples=50)loss_2_bs_sd = compute_cv_loss_multiple(split_strategy=2, bootstrap_sd=True, n_bs_samples=50)loss_3_bs_sd = compute_cv_loss_multiple(split_strategy=3, bootstrap_sd=True, n_bs_samples=50)print(f"Strategy 1 CV Loss: {loss_1:.6f} (bootstrap SD: {loss_1_bs_sd:.6f})")print(f"Strategy 2 CV Loss: {loss_2:.6f} (bootstrap SD: {loss_2_bs_sd:.6f})")print(f"Strategy 3 CV Loss: {loss_3:.6f} (bootstrap SD: {loss_3_bs_sd:.6f})")
Let us compare all strategies with the default parameters from lightgbm
Strategy 1 CV Loss: 0.112123 (bootstrap SD: 0.007160)
Strategy 2 CV Loss: 0.118035 (bootstrap SD: 0.007720)
Strategy 3 CV Loss: 0.230665 (bootstrap SD: 0.015199)
It seems like the first two are more stable. They give similar results among each other, and also have a much lower bootstrap SD. The first thing suggests that our intuition for building the smaller splits was on spot and the errors on those two cases are statistically indistinguishable.
print("Let us compare with the results if we weight the averages")loss_1 = compute_loss_cv(split_strategy=1, weighted_average=True)loss_2 = compute_loss_cv(split_strategy=2, weighted_average=True)loss_3 = compute_loss_cv(split_strategy=3, weighted_average=True)print(f"Strategy 1 CV Loss: {loss_1:.6f}")print(f"Strategy 2 CV Loss: {loss_2:.6f}")print(f"Strategy 3 CV Loss: {loss_3:.6f}")
Let us compare with the results if we weight the averages
Strategy 1 CV Loss: 0.120357
Strategy 2 CV Loss: 0.126712
Strategy 3 CV Loss: 0.141250
Given the unbalanced nature of our data, we should check how influential the weight on the loss is. That is another way of accessing the stability of the different losses. One way you think about we are doing here is this: if we set the weights to (1.0,2.0), what we are effectively doing is evaluating on an imagined dataset in which the proportion of y=1 is twice what we actually have in our dataset. Since we imagine that the proportion of y’s in the holdout dataset is different from the one we have, it seems natural to consider these small changes in the imagined proportions once we evaluate. A good CV loss should be not too sensitive to this.
print("Let us compare with the results if we weight the averages")loss_1 = compute_loss_cv(split_strategy=1, weighted_average=False, loss_weight=(2.0, 1.0))loss_2 = compute_loss_cv(split_strategy=2, weighted_average=False, loss_weight=(2.0, 1.0))loss_3 = compute_loss_cv(split_strategy=3, weighted_average=False, loss_weight=(2.0, 1.0))print(f"Strategy 1 CV Loss: {loss_1:.6f}")print(f"Strategy 2 CV Loss: {loss_2:.6f}")print(f"Strategy 3 CV Loss: {loss_3:.6f}")
Let us compare with the results if we weight the averages
Strategy 1 CV Loss: 0.095951
Strategy 2 CV Loss: 0.102045
Strategy 3 CV Loss: 0.173395
print("Let us compare with the results if we weight the averages")loss_1 = compute_loss_cv(split_strategy=1, weighted_average=False, loss_weight=(1.0, 2.0))loss_2 = compute_loss_cv(split_strategy=2, weighted_average=False, loss_weight=(1.0, 2.0))loss_3 = compute_loss_cv(split_strategy=3, weighted_average=False, loss_weight=(1.0, 2.0))print(f"Strategy 1 CV Loss: {loss_1:.6f}")print(f"Strategy 2 CV Loss: {loss_2:.6f}")print(f"Strategy 3 CV Loss: {loss_3:.6f}")
Let us compare with the results if we weight the averages
Strategy 1 CV Loss: 0.128295
Strategy 2 CV Loss: 0.134025
Strategy 3 CV Loss: 0.287935
Here all loss functions change their values considerably in both reweights. The instability of Strategies 1 and 2 look similar: in the first reweight 1 changes more more and 2 changes less and likewise in the second reweight. In light of this it seems slightly better to pick the second loss, because the first loss seems to tend to be too lenient with models that predict less 1’s than should be predicted. But we will double check this weighting stability again in the other tests.
Here we also see that our first two strategies have better robustness properties: in moving from weighted to unweighted averages the results do not change that much. This is just a consequence of the split being more even as we discussed ealier.
print("Let us compare the raw values of the CV losses on each data fold")loss_1 = compute_loss_cv(split_strategy=1, average=False)loss_2 = compute_loss_cv(split_strategy=2, average=False)loss_3 = compute_loss_cv(split_strategy=3, average=False)print(f"Strategy 1 CV Loss: {[f'{x:.6f}'for x in loss_1[0]]} (train sizes: {loss_1[1]})")print(f"Strategy 2 CV Loss: {[f'{x:.6f}'for x in loss_2[0]]} (train sizes: {loss_2[1]})")print(f"Strategy 3 CV Loss: {[f'{x:.6f}'for x in loss_3[0]]} (train sizes: {loss_3[1]})")
Let us compare the raw values of the CV losses on each data fold
Strategy 1 CV Loss: ['0.075982', '0.148263'] (train sizes: [3811, 6060])
Strategy 2 CV Loss: ['0.062852', '0.097306', '0.124682', '0.187300'] (train sizes: [3870, 4448, 6060, 6441])
Strategy 3 CV Loss: ['0.714737', '0.030938', '0.129826', '0.074739', '0.203086'] (train sizes: [776, 3811, 5203, 6060, 7081])
As is the previous examples, we see that the Strategy 1 and 2 have a much more well-behaved trend. Firstly there is less variability among the estimates. But also the error seems to change across the folds in an ordered way. A weird behavior seems to be going on though. It seems that as we increase the training size, the algorithm performs worse.
Q: Can you guess what might be going on?
3.2 If there’s distribution shift, who guarantees that minimizing CV error will result in the lowest test error?
There is no guarantee!
Since the CV validation folds span multiple different periods, we can hope that the resulting hyperparameters will be robust to the different possibilities of how 2013 will look like
Of course, 2013 can be something we’ve never seen before, and than there really is no guarantee..
Of course, we can also overfit the hyperparameters, but that’s a different problem that exists even with IID data
4 Write grid workflow
Let’s write a function that gets multiple values for each hyperparameter, and runs over all possible combinations.
import itertoolsfrom tqdm import tqdmdef compute_loss_cv_grid(split_strategy=2, bootstrap_sd=False, n_bs_samples=20, loss_weight=(1.0,1.0), **params): keys = params.keys() out = [] total_iters = np.prod([len(v) for v in params.values()])for vals in tqdm(itertools.product(*params.values()), total=total_iters): cur_params =dict(zip(keys, vals))#print(f"Evaluating params: {cur_params}...") updated_params = cur_params.copy() updated_params["loss"] = compute_loss_cv(split_strategy=split_strategy, loss_weight=loss_weight, **cur_params)if bootstrap_sd: updated_params["sd"] = compute_loss_cv(split_strategy=split_strategy, bootstrap_sd=True, n_bs_samples=n_bs_samples, loss_weight=loss_weight, **cur_params) out.append(updated_params) out = pd.DataFrame(out)return out
4.1 How many hyperparameters/values to try at once?
Since we are fixing the model of interest, the only limit to how many hparams to run is our computational resources and how we set things up.
It is always good to take a manual look at the top performing models, which will usually be all tied within the bootstrap sd. This allows us to access trade-offs beyond the raw loss function value (e.g. complexity or training/deployment cost). Other than that, it is okay to pick the model with the lowest score, if that’s your only criteria. In ML the goal is beating the benchmark, not being entirely sure you have found the best model possible.
For this problem we can run things quickly in a notebook, so to we advise large hparams sweeps. Bear in mind that some machine learning models require hparam sweeps like we are doing to be able to perform well.
Knowing what the hparams do is important in order to access which ones you should prioritize optimizing by building finer grids, and which ones you can pass.
We can see that strategy 2 in CV gives a more or less stable result relative to reweightings of the loss. The third strategy on the other hand, gives a significantly different result. Our final assessment will have to be on the holdout data.
Note that there is a big distribution shift making the proportions of positive cases much higher. We have roughly a proportion which is twice as much. This made methods who skewed toward penalizing more false negatives fare better. But overall everyone did equally well. From the little data that was given to you, there was no way you could have predicted this distribution shift. The winners to a great extent got lucky.
compute_loss_test(X, y, X_holdout, y_holdout, **strat_3_much_more_pos)
0.289565876412703
As one can see, simply knowing that the proportion of positive cases increased, would have allowed us to get a substantial improvement by simply evaluating with a weighted loss on our same data.
However we did not have this information. For all we know, the opposite could have been true. Hence the best we could have done was accessing the sensitivity of our conclusions and at least hoping to improve on the baseline, which we did.
5.2 Why did winner do well (qwerty)?
Possible reasons:
They optimized a wider range of parameters, including handling unbalanced data which is crucial for this problem
They attempted more than one CV estimation method
Distribution shift happened to work in their favor