Lab 7 Presentation - Second Competition Round-Up

Author

Erez Buchweitz/Josh Davis/Lucas Schwengber

1 Today

  • How you did
  • How I would solve the competition
  • How the winner solved the competition
  • Other methods for hyperparameter optimization

1.1 Competition Leaderboard (lower loss is better)

Rank Group Loss
1 bbbaaaccc 0.303954
2 qwerty 0.307559
3 vivreplm 0.311442
4 yay 0.312888
5 group131416 0.313726
6 ardak 0.313726
7 T 0.313907
8 all_the_params 0.315844
9 o2 0.316239
10 green_onions 0.316997
11 ? 0.317477
12 trin 0.318070
13 Opti-bro 0.318457
14 ynwa 0.319269
15 pearl 0.319687
16 pg 0.319731
17 Reda 0.319798
18 touse 0.321013
19 08241001 0.321109
20 hihi 0.321613
21 Will 0.324020
22 S154_L5 0.324721
23 my_secret_name 0.324747
24 dp 0.324863
25 Radon-Nikodym 0.324877
26 JiJi 0.325430
27 billrichardteam 0.325523
28 pipis 0.326672
29 jj 0.326717
30 keylime_lacroix 0.326744
31 Drakeee 0.326952
32 08241001 0.327313
33 LanaDelRey 0.327571
34 kdot 0.327712
35 MJ 0.327797
36 RandomizedSearchCV 0.328165
37 pollen 0.328293
38 j4 0.329286
39 ethankungcode 0.329380
40 superman 0.329411
41 grid_search_results 0.329868
42 firefly_uwogh 0.330401
43 mvp (1) 0.331394
44 wphPaul 0.332215
45 rackhoe 0.333413
46 lovestat 0.333751
47 Dragon Warrior 0.334740
48 silly_lab06 0.337925
49 Bear 0.338308
50 LetsTry 0.338771
51 Tortas 0.338867
52 sgo 0.339244
53 msstumble 0.339443
54 group_AT 0.340554
55 aryadeshmukh 0.343958
56 hyperparams_Desai 0.344294
57 zhej 0.345240
58 cpv2-2 0.345976
59 epicfire77 0.346441
60 igy1 0.348081
61 concurrentenrollment 0.348272
62 hamster 0.356576
63 tent 0.356718
64 OJP 0.362371
65 Tony 0.362603
66 taeyoungkim 0.369162
67 submission 0.375743
68 一点节目效果 0.380171
69 IKUN_FOREVER 0.412901
70 han2159 0.419548
71 dragon 0.419548
72 Dogs 0.419548
73 lemon 0.419548
74 baseline - default 0.419548
75 VGS 0.420391
76 chiikawa 0.425958
77 4951_aai 0.428385
78 purpleapple 0.437154
79 apeman 0.507353
80 bsc1 0.601167
81 lab6_cv_competition.py 0.622568
import matplotlib.pyplot as plt
import numpy as np
scores = [float(x) for x in [
    0.303954, 0.307559, 0.311442, 0.312888, 0.313726, 0.313726, 0.313907,
    0.315844, 0.316239, 0.316997, 0.317477, 0.318070, 0.318457, 0.319269,
    0.319687, 0.319731, 0.319798, 0.321013, 0.321109, 0.321613, 0.324020,
    0.324721, 0.324747, 0.324863, 0.324877, 0.325430, 0.325523, 0.326672,
    0.326717, 0.326744, 0.326952, 0.327313, 0.327571, 0.327712, 0.327797,
    0.328165, 0.328293, 0.329286, 0.329380, 0.329411, 0.329868, 0.330401,
    0.331394, 0.332215, 0.333413, 0.333751, 0.334740, 0.337925, 0.338308,
    0.338771, 0.338867, 0.339244, 0.339443, 0.340554, 0.343958, 0.344294,
    0.345240, 0.345976, 0.346441, 0.348081, 0.348272, 0.356576, 0.356718,
    0.362371, 0.362603, 0.369162, 0.375743, 0.380171, 0.412901, 0.419548,
    0.419548, 0.419548, 0.419548, 0.419548, 0.420391, 0.425958, 0.428385,
    0.437154, 0.507353, 0.601167, 0.622568
]]

fig, ax = plt.subplots(figsize=(10, 5))
scores = np.array(scores)
#ax.hist(scores[scores < 0.37], bins=10, edgecolor="black", color="steelblue")
ax.hist(scores, bins=15, edgecolor="black", color="steelblue")
ax.set_xlabel("Log Loss")
ax.set_ylabel("Count")
ax.set_title("Distribution of Competition Scores (lower is better)")
ax.legend()
plt.tight_layout()
plt.show()
/tmp/ipykernel_9191/2582740908.py:25: UserWarning: No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
  ax.legend()

2 Approaches to solving the competition

2.1 Load data

import pandas as pd

path = "../../datasets/west_nile_virus"
X = pd.read_csv(f"{path}/X_cv_competition.csv")
X.Species = X.Species.astype("category")
y = pd.read_csv(f"{path}/y_cv_competition.csv").squeeze()

print(f"{X.shape=}")
X.shape=(8114, 10)
X_holdout = pd.read_csv(f"{path}/X_cv_holdout.csv")
X_holdout.Species = X_holdout.Species.astype("category")
y_holdout = pd.read_csv(f"{path}/y_cv_holdout.csv").squeeze()
X_holdout["Date"]
0       2013-06-07
1       2013-06-07
2       2013-06-07
3       2013-06-07
4       2013-06-07
           ...    
2387    2013-09-26
2388    2013-09-26
2389    2013-09-26
2390    2013-09-26
2391    2013-09-26
Name: Date, Length: 2392, dtype: object

2.1.1 Some quick EDA to check structure of the data

Let us keep in mind we have to predict from June to September of 2013. This seems where most of our available data lies, which is good. One thing we may first do is check for trends across this windows among the three periods

## Check the predictor variables
X.columns
Index(['Date', 'DayOfYear', 'Species', 'Latitude', 'Longitude',
       'AddressAccuracy', 'Prev_WnvPresent', 'EMA_WnvPresent',
       'Prev_NumMosquitos', 'PrevInArea_WnvPresent'],
      dtype='object')

Let us check how the dates are distributed

X.Date = pd.to_datetime(X.Date)
table_obs_counts = pd.crosstab(X.Date.dt.year, X.Date.dt.month)
from IPython.display import display, HTML

table_obs_counts = pd.crosstab(X.Date.dt.year, X.Date.dt.month)

table_y_sum = pd.crosstab(
    index=X.Date.dt.year,
    columns=X.Date.dt.month,
    values=y,
    aggfunc="sum"
).fillna(0).astype(int)


table_y_mean = pd.crosstab(
    index=X.Date.dt.year,
    columns=X.Date.dt.month,
    values=y,
    aggfunc="mean"
).fillna(0).astype(float)
display(HTML(f"""
<div style="display:flex; gap:24px; align-items:flex-start;">
    <div>{table_obs_counts.style.set_caption("table_obs_counts").to_html()}</div>
    <div>{table_y_sum.style.set_caption("table_y_sum").to_html()}</div>
    <div>{table_y_mean.style.set_caption("table_y_mean").format("{:.3f}").to_html()}</div>
</div>
"""))
Table 1: table_obs_counts
Date 5 6 7 8 9 10
Date            
2007 25 176 575 2050 774 211
2009 59 578 755 374 418 65
2011 0 381 640 493 540 0
Table 2: table_y_sum
Date 5 6 7 8 9 10
Date            
2007 0 0 6 200 28 2
2009 0 0 5 9 5 0
2011 0 0 11 24 22 0
Table 3: table_y_mean
Date 5 6 7 8 9 10
Date            
2007 0.000 0.000 0.010 0.098 0.036 0.009
2009 0.000 0.000 0.007 0.024 0.012 0.000
2011 0.000 0.000 0.017 0.049 0.041 0.000

We see that multiple months have no observations, but these are usually the months we are not evaluating on. This is something to have in mind when evaluating. For example,it makes no sense to have a fold with months 5-6 only, regardless of the year. A reasonable structure would be to have a structure in which we fit on all data available from the past, but predict only on the window we will be evaluated.

The prediction blocks we will use, can either contain all 4 of the relevant months, or pairs of contiguous ones. To decide how to pick the pairs of contiguous ones, let us check which choice of pairs seem to better capture the statistical behavior of the whole set of months 6-8:

# Aggregate months into two windows: 6-7 and 8-9 (drop 5 and 10)
table_obs_counts_2 = pd.DataFrame({
    "6-7": table_obs_counts[6] + table_obs_counts[7],
    "8-9": table_obs_counts[8] + table_obs_counts[9],
})

table_y_sum_2 = pd.DataFrame({
    "6-7": table_y_sum[6] + table_y_sum[7],
    "8-9": table_y_sum[8] + table_y_sum[9],
})

# Use weighted mean via sums/counts
table_y_mean_2 = (table_y_sum_2 / table_obs_counts_2).fillna(0.0)
# Build alternative aggregation: (6,8) and (7,9)
table_obs_counts_2_alt = pd.DataFrame({
    "6-8": table_obs_counts[6] + table_obs_counts[8],
    "7-9": table_obs_counts[7] + table_obs_counts[9],
})

table_y_sum_2_alt = pd.DataFrame({
    "6-8": table_y_sum[6] + table_y_sum[8],
    "7-9": table_y_sum[7] + table_y_sum[9],
})

table_y_mean_2_alt = (table_y_sum_2_alt / table_obs_counts_2_alt).fillna(0.0)

table_obs_counts_agg = pd.DataFrame({
    "6-9": table_obs_counts[[6, 7, 8, 9]].sum(axis=1)
})

table_y_sum_agg = pd.DataFrame({
    "6-9": table_y_sum[[6, 7, 8, 9]].sum(axis=1)
})

table_y_mean_agg = (table_y_sum_agg / table_obs_counts_agg).fillna(0.0)

display(HTML(f"""
<div style="display:flex; flex-direction:column; gap:16px;">
    <div style="display:flex; gap:24px; align-items:flex-start;">
        <div>{table_obs_counts_2.style.set_caption("table_obs_counts (6-7, 8-9)").to_html()}</div>
        <div>{table_y_sum_2.style.set_caption("table_y_sum (6-7, 8-9)").to_html()}</div>
        <div>{table_y_mean_2.style.set_caption("table_y_mean (6-7, 8-9)").format("{:.3f}").to_html()}</div>
    </div>
    <div style="display:flex; gap:24px; align-items:flex-start;">
        <div>{table_obs_counts_2_alt.style.set_caption("table_obs_counts (6-8, 7-9)").to_html()}</div>
        <div>{table_y_sum_2_alt.style.set_caption("table_y_sum (6-8, 7-9)").to_html()}</div>
        <div>{table_y_mean_2_alt.style.set_caption("table_y_mean (6-8, 7-9)").format("{:.3f}").to_html()}</div>
    </div>
    <div style="display:flex; gap:24px; align-items:flex-start;">
        <div>{table_obs_counts_agg.style.set_caption("table_obs_counts (6-9)").to_html()}</div>
        <div>{table_y_sum_agg.style.set_caption("table_y_sum (6-9)").to_html()}</div>
        <div>{table_y_mean_agg.style.set_caption("table_y_mean (6-9)").format("{:.3f}").to_html()}</div>
    </div>
</div>
"""))
Table 4: table_obs_counts (6-7, 8-9)
  6-7 8-9
Date    
2007 751 2824
2009 1333 792
2011 1021 1033
Table 5: table_y_sum (6-7, 8-9)
  6-7 8-9
Date    
2007 6 228
2009 5 14
2011 11 46
Table 6: table_y_mean (6-7, 8-9)
  6-7 8-9
Date    
2007 0.008 0.081
2009 0.004 0.018
2011 0.011 0.045
Table 7: table_obs_counts (6-8, 7-9)
  6-8 7-9
Date    
2007 2226 1349
2009 952 1173
2011 874 1180
Table 8: table_y_sum (6-8, 7-9)
  6-8 7-9
Date    
2007 200 34
2009 9 10
2011 24 33
Table 9: table_y_mean (6-8, 7-9)
  6-8 7-9
Date    
2007 0.090 0.025
2009 0.009 0.009
2011 0.027 0.028
Table 10: table_obs_counts (6-9)
  6-9
Date  
2007 3575
2009 2125
2011 2054
Table 11: table_y_sum (6-9)
  6-9
Date  
2007 234
2009 19
2011 57
Table 12: table_y_mean (6-9)
  6-9
Date  
2007 0.065
2009 0.009
2011 0.028

Check for balance of DayOfYear counts

years = [2007, 2009, 2011]
all_days = list(range(1, 32))

def ma5(s):
    return s.reindex(all_days, fill_value=0).rolling(window=5, min_periods=1, center=True).mean()

def day_proportion(df):
    return df["Date"].dt.day.value_counts(normalize=True).sort_index()

fig, axes = plt.subplots(3, 2, figsize=(16, 14), sharex=True, sharey=True)

for r, year in enumerate(years):
    x_year = X[X["Date"].dt.year == year]

    x_6_9 = x_year[x_year["Date"].dt.month.isin([6, 7, 8, 9])]
    x_6_7 = x_year[x_year["Date"].dt.month.isin([6, 7])]
    x_8_9 = x_year[x_year["Date"].dt.month.isin([8, 9])]
    x_6_8 = x_year[x_year["Date"].dt.month.isin([6, 8])]
    x_7_9 = x_year[x_year["Date"].dt.month.isin([7, 9])]

    prop_6_9 = day_proportion(x_6_9)
    prop_6_7 = day_proportion(x_6_7)
    prop_8_9 = day_proportion(x_8_9)
    prop_6_8 = day_proportion(x_6_8)
    prop_7_9 = day_proportion(x_7_9)

    # Left column: (6-7) and (8-9)
    axes[r, 0].plot(all_days, ma5(prop_6_9), label=f"{year} months 6-9 (MA5)", linewidth=2)
    axes[r, 0].plot(all_days, ma5(prop_6_7), label=f"{year} months 6-7 (MA5)", linewidth=2)
    axes[r, 0].plot(all_days, ma5(prop_8_9), label=f"{year} months 8-9 (MA5)", linewidth=2)
    axes[r, 0].set_title(f"{year} | Pairing: (6-7) and (8-9)")
    axes[r, 0].set_ylabel("Proportion (5-day moving avg)")
    axes[r, 0].grid(alpha=0.3)
    axes[r, 0].legend()

    # Right column: (6-8) and (7-9)
    axes[r, 1].plot(all_days, ma5(prop_6_9), label=f"{year} months 6-9 (MA5)", linewidth=2)
    axes[r, 1].plot(all_days, ma5(prop_6_8), label=f"{year} months 6-8 (MA5)", linewidth=2)
    axes[r, 1].plot(all_days, ma5(prop_7_9), label=f"{year} months 7-9 (MA5)", linewidth=2)
    axes[r, 1].set_title(f"{year} | Pairing: (6-8) and (7-9)")
    axes[r, 1].grid(alpha=0.3)
    axes[r, 1].legend()

axes[-1, 0].set_xlabel("Day of Month")
axes[-1, 1].set_xlabel("Day of Month")

plt.tight_layout()
plt.show()

The pairwise split 6-8 and 7-9 seems slightly more faithful to the whole block 6-9.

2.2 Sum up fold split structure

first folding strategy: - Fold 1: Train on 2007 5-10, predict on 2009 6-9 - Fold 2: Train on 2007 and 2009 5-10, predict on 2011 6-9

second folding strategy: - Fold 1: Train on 2007 5-10 + 2009 5, predict on 2009 6-8 - Fold 2: Train on 2007 5-10 + 2009 5-6, predict on 2009 7-9 - Fold 3: Train on 2007 5-10 + 2009 5-10 + 2011 5, predict on 2011 6-8 - Fold 4: Train on 2007 5-10 + 2009 5-10 + 2011 5-6, predict on 2011 7-9

a third simpler strategy based on expanding window with blocks with the same size on the month aspect: - Fold 1: Aug-Oct 2007 - Fold 2: May-Jul 2009 - Fold 3: Aug-Oct 2009 - Fold 4: May-Jul 2011 - Fold 5: Aug-Oct 2011 And for each fold we train on all past folds and predict on the current fold.

One thing to be mindful of here is that Folds 1-2 and 3-4 are pairwise highly correlated regarding the training algorithm they give, The main difference is the set they are being evaluated on.

Important idea: compute sds of error evaluations using stratified bootstrap. We could also use the sd of the invdividual folds we are averaging. The problem with this is that we have too few folds.

def build_cv_pairs(X, y):
    X_ = X.copy()

    # Ensure Date is datetime
    if not pd.api.types.is_datetime64_any_dtype(X_["Date"]):
        X_["Date"] = pd.to_datetime(X_["Date"])

    def ym_mask(years, months):
        return X_["Date"].dt.year.isin(years) & X_["Date"].dt.month.isin(months)

    def make_fold(train_mask, valid_mask):
        return {
            "X_train": X_.loc[train_mask].drop("Date", axis=1),
            "y_train": y.loc[train_mask],
            "X_valid": X_.loc[valid_mask].drop("Date", axis=1),
            "y_valid": y.loc[valid_mask],
        }

    # Strategy 1
    cv_pairs_strategy1 = {
        "fold_1": make_fold(
            train_mask=ym_mask([2007], [5, 6, 7, 8, 9, 10]),
            valid_mask=ym_mask([2009], [6, 7, 8, 9]),
        ),
        "fold_2": make_fold(
            train_mask=ym_mask([2007, 2009], [5, 6, 7, 8, 9, 10]),
            valid_mask=ym_mask([2011], [6, 7, 8, 9]),
        ),
    }

    # Strategy 2
    cv_pairs_strategy2 = {
        "fold_1": make_fold(
            train_mask=ym_mask([2007], [5, 6, 7, 8, 9, 10]) | ym_mask([2009], [5]),
            valid_mask=ym_mask([2009], [6, 7, 8]),
        ),
        "fold_2": make_fold(
            train_mask=ym_mask([2007], [5, 6, 7, 8, 9, 10]) | ym_mask([2009], [5, 6]),
            valid_mask=ym_mask([2009], [7, 8, 9]),
        ),
        "fold_3": make_fold(
            train_mask=ym_mask([2007, 2009], [5, 6, 7, 8, 9, 10]) | ym_mask([2011], [5]),
            valid_mask=ym_mask([2011], [6, 7, 8]),
        ),
        "fold_4": make_fold(
            train_mask=ym_mask([2007, 2009], [5, 6, 7, 8, 9, 10]) | ym_mask([2011], [5, 6]),
            valid_mask=ym_mask([2011], [7, 8, 9]),
        ),
    }

    # Strategy 3
    cv_pairs_strategy3 = {
        "fold_1": make_fold(
            train_mask=ym_mask([2007], [5, 6, 7]),
            valid_mask=ym_mask([2007], [8, 9, 10]),
        ),
        "fold_2": make_fold(
            train_mask=ym_mask([2007], [5, 6, 7, 8, 9, 10]),
            valid_mask=ym_mask([2009], [5, 6, 7]),
        ),
        "fold_3": make_fold(
            train_mask=ym_mask([2007], [5, 6, 7, 8, 9, 10]) | ym_mask([2009], [5, 6, 7]),
            valid_mask=ym_mask([2009], [8, 9, 10]),
        ),
        "fold_4": make_fold(
            train_mask=ym_mask([2007, 2009], [5, 6, 7, 8, 9, 10]),
            valid_mask=ym_mask([2011], [5, 6, 7]),
        ),
        "fold_5": make_fold(
            train_mask=ym_mask([2007, 2009], [5, 6, 7, 8, 9, 10]) | ym_mask([2011], [5, 6, 7]),
            valid_mask=ym_mask([2011], [8, 9, 10]),
        ),
    }

    return cv_pairs_strategy1, cv_pairs_strategy2, cv_pairs_strategy3


cv_pairs_strategy1, cv_pairs_strategy2, cv_pairs_strategy3 = build_cv_pairs(X, y)

print(f"{len(cv_pairs_strategy1)=}, {len(cv_pairs_strategy2)=}, {len(cv_pairs_strategy3)=}")
print("strategy1 fold sizes:", [(len(f["X_train"]), len(f["X_valid"])) for f in cv_pairs_strategy1.values()])
print("strategy2 fold sizes:", [(len(f["X_train"]), len(f["X_valid"])) for f in cv_pairs_strategy2.values()])
print("strategy3 fold sizes:", [(len(f["X_train"]), len(f["X_valid"])) for f in cv_pairs_strategy3.values()])
len(cv_pairs_strategy1)=2, len(cv_pairs_strategy2)=4, len(cv_pairs_strategy3)=5
strategy1 fold sizes: [(3811, 2125), (6060, 2054)]
strategy2 fold sizes: [(3870, 1707), (4448, 1547), (6060, 1514), (6441, 1673)]
strategy3 fold sizes: [(776, 3035), (3811, 1392), (5203, 857), (6060, 1021), (7081, 1033)]

note that the folds are more or less balanced in terms of numbers of observations, the only difference being the dataset sizes. This is a good sign. It means our date splits are comparable in terms of numbers of samples.

3 Write workflow

Let’s write a function that runs our workflow:

  1. Gets hyperparameters **params we want to test
  2. For each of the folds and a folding strategy:
    • Trains model on the train set
    • Computes loss on the validation set
  3. Averages the loss across the train-valid pairs

weighted average for evaluation (add option for non-weighted one) + add in weight to take into account size of training data (as option too)

import lightgbm as lgb

def logloss(y, pred, loss_weight=(1.0, 1.0)):
    w0, w1 = map(float, loss_weight)  # class 0 weight, class 1 weight
    total = w0 + w1
    if total <= 0:
        raise ValueError("Sum of class weights must be positive.")
    w0, w1 = 2 * w0 / total, 2 * w1 / total  # normalize to sum to 1

    y = np.asarray(y)
    pred = np.clip(np.asarray(pred), 1e-15, 1 - 1e-15)
    return np.mean(-(w1 * y * np.log(pred) + w0 * (1 - y) * np.log(1 - pred)))

def compute_loss_cv_single(cv_pairs=None, average=True, weighted_average=False, loss_weight=(1.0,1.0), **params):
    # Default to strategy 2 if no splits are provided explicitly
    if cv_pairs is None:
        _, cv_pairs = build_cv_pairs(X, y)

    folds = cv_pairs.values() if isinstance(cv_pairs, dict) else cv_pairs

    losses = []
    train_sizes = []
    for entry in folds:
        X_train, y_train = entry["X_train"], entry["y_train"]
        X_valid, y_valid = entry["X_valid"], entry["y_valid"]

        model = lgb.LGBMClassifier(**params, verbosity=-1, n_jobs=-1)
        model.fit(X_train, y_train, categorical_feature="Species")
        pred = model.predict_proba(X_valid)[:, 1]
        losses.append(logloss(y_valid, pred, loss_weight=loss_weight))
        train_sizes.append(len(X_train))

    if not average:
        return losses, train_sizes

    if weighted_average:
        return np.average(losses, weights=train_sizes)

    return np.mean(losses)


def _stratified_bootstrap_indices(X, y, no_date=False, bs_sample_size=None, rng=None):
    if rng is None:
        rng = np.random.default_rng()

    if not no_date:
        dates = pd.to_datetime(X["Date"])
        years = dates.dt.year.to_numpy()
        months = dates.dt.month.to_numpy()
    y_arr = np.asarray(y)
    n = len(X)
    
    if no_date:
        strata = pd.DataFrame({"y": y_arr})
    else:
        strata = pd.DataFrame({"year": years, "month": months, "y": y_arr})

    stratum_counts = strata.value_counts(sort=False)

    if bs_sample_size is None:
        target_counts = stratum_counts.astype(int)
    else:
        if bs_sample_size <= 0:
            raise ValueError("bs_sample_size must be a positive integer.")
        raw = stratum_counts * (bs_sample_size / n)
        target_counts = np.floor(raw).astype(int)
        remainder = int(bs_sample_size - target_counts.sum())

        if remainder > 0:
            frac = (raw - target_counts).sort_values(ascending=False)
            for key in frac.index[:remainder]:
                target_counts.loc[key] += 1
        elif remainder < 0:
            frac = (raw - target_counts).sort_values(ascending=True)
            for key in frac.index:
                if remainder == 0:
                    break
                if target_counts.loc[key] > 0:
                    target_counts.loc[key] -= 1
                    remainder += 1

    sampled_idx = []

    if not no_date:
        for (yy_year, mm, yy), k in target_counts.items():
            if k <= 0:
                continue
            pool = np.where((years == yy_year) & (months == mm) & (y_arr == yy))[0]
            sampled_idx.append(rng.choice(pool, size=int(k), replace=True))

    else:
        for (yy), k in target_counts.items():
            if k <= 0:
                continue
            pool = np.where(y_arr == yy)[0]
            sampled_idx.append(rng.choice(pool, size=int(k), replace=True))
            
    sampled_idx = np.concatenate(sampled_idx)
    rng.shuffle(sampled_idx)
    return sampled_idx


def compute_cv_loss_multiple(
    split_strategy=2,
    average=True,
    weighted_average=False,
    loss_weight=(1.0, 1.0),
    bootstrap_sd=False,
    bs_sample_size=None,
    n_bs_samples=100,
    random_state=42,
    **params,
):
    if split_strategy not in (1, 2, 3):
        raise ValueError("split_strategy must be either 1, 2, or 3.")

    if not bootstrap_sd:
        cv_pairs = build_cv_pairs(X=X, y=y)[split_strategy - 1]
        return compute_loss_cv_single(
            cv_pairs=cv_pairs,
            average=average,
            weighted_average=weighted_average,
            loss_weight=loss_weight,
            **params,
        )

    if n_bs_samples <= 1:
        raise ValueError("n_bs_samples must be > 1 when bootstrap_sd=True.")

    rng = np.random.default_rng(random_state)
    bs_losses = []

    for _ in range(n_bs_samples):
        idx = _stratified_bootstrap_indices(X, y, bs_sample_size=bs_sample_size, rng=rng)
        X_bs = X.iloc[idx].reset_index(drop=True)
        y_bs = y.iloc[idx].reset_index(drop=True)

        cv_pairs = build_cv_pairs(X_bs, y_bs)[split_strategy - 1]

        cur_loss = compute_loss_cv_single(
            cv_pairs=cv_pairs,
            average=average,
            weighted_average=weighted_average,
            **params,
        )
        bs_losses.append(np.mean(cur_loss) if not average else cur_loss)

    return float(np.std(bs_losses, ddof=1))


compute_loss_cv = compute_cv_loss_multiple

Q: why people don’t use the bootstrap for ML models in practice?

3.1 Dry run

Let’s make a dry run with LightGBM’s default hyperparmeters:

print("Let us compare all strategies with the default parameters from lightgbm")

loss_1 = compute_loss_cv(split_strategy=1)
loss_2 = compute_loss_cv(split_strategy=2)
loss_3 = compute_loss_cv(split_strategy=3)

loss_1_bs_sd = compute_cv_loss_multiple(split_strategy=1, bootstrap_sd=True, n_bs_samples=50)
loss_2_bs_sd = compute_cv_loss_multiple(split_strategy=2, bootstrap_sd=True, n_bs_samples=50)
loss_3_bs_sd = compute_cv_loss_multiple(split_strategy=3, bootstrap_sd=True, n_bs_samples=50)

print(f"Strategy 1 CV Loss: {loss_1:.6f} (bootstrap SD: {loss_1_bs_sd:.6f})")
print(f"Strategy 2 CV Loss: {loss_2:.6f} (bootstrap SD: {loss_2_bs_sd:.6f})")
print(f"Strategy 3 CV Loss: {loss_3:.6f} (bootstrap SD: {loss_3_bs_sd:.6f})")
Let us compare all strategies with the default parameters from lightgbm
Strategy 1 CV Loss: 0.112123 (bootstrap SD: 0.007160)
Strategy 2 CV Loss: 0.118035 (bootstrap SD: 0.007720)
Strategy 3 CV Loss: 0.230665 (bootstrap SD: 0.015199)

It seems like the first two are more stable. They give similar results among each other, and also have a much lower bootstrap SD. The first thing suggests that our intuition for building the smaller splits was on spot and the errors on those two cases are statistically indistinguishable.

print("Let us compare with the results if we weight the averages")

loss_1 = compute_loss_cv(split_strategy=1, weighted_average=True)
loss_2 = compute_loss_cv(split_strategy=2, weighted_average=True)
loss_3 = compute_loss_cv(split_strategy=3, weighted_average=True)

print(f"Strategy 1 CV Loss: {loss_1:.6f}")
print(f"Strategy 2 CV Loss: {loss_2:.6f}")
print(f"Strategy 3 CV Loss: {loss_3:.6f}")
Let us compare with the results if we weight the averages
Strategy 1 CV Loss: 0.120357
Strategy 2 CV Loss: 0.126712
Strategy 3 CV Loss: 0.141250

Given the unbalanced nature of our data, we should check how influential the weight on the loss is. That is another way of accessing the stability of the different losses. One way you think about we are doing here is this: if we set the weights to (1.0,2.0), what we are effectively doing is evaluating on an imagined dataset in which the proportion of y=1 is twice what we actually have in our dataset. Since we imagine that the proportion of y’s in the holdout dataset is different from the one we have, it seems natural to consider these small changes in the imagined proportions once we evaluate. A good CV loss should be not too sensitive to this.

print("Let us compare with the results if we weight the averages")

loss_1 = compute_loss_cv(split_strategy=1, weighted_average=False, loss_weight=(2.0, 1.0))
loss_2 = compute_loss_cv(split_strategy=2, weighted_average=False, loss_weight=(2.0, 1.0))
loss_3 = compute_loss_cv(split_strategy=3, weighted_average=False, loss_weight=(2.0, 1.0))

print(f"Strategy 1 CV Loss: {loss_1:.6f}")
print(f"Strategy 2 CV Loss: {loss_2:.6f}")
print(f"Strategy 3 CV Loss: {loss_3:.6f}")
Let us compare with the results if we weight the averages
Strategy 1 CV Loss: 0.095951
Strategy 2 CV Loss: 0.102045
Strategy 3 CV Loss: 0.173395
print("Let us compare with the results if we weight the averages")

loss_1 = compute_loss_cv(split_strategy=1, weighted_average=False, loss_weight=(1.0, 2.0))
loss_2 = compute_loss_cv(split_strategy=2, weighted_average=False, loss_weight=(1.0, 2.0))
loss_3 = compute_loss_cv(split_strategy=3, weighted_average=False, loss_weight=(1.0, 2.0))

print(f"Strategy 1 CV Loss: {loss_1:.6f}")
print(f"Strategy 2 CV Loss: {loss_2:.6f}")
print(f"Strategy 3 CV Loss: {loss_3:.6f}")
Let us compare with the results if we weight the averages
Strategy 1 CV Loss: 0.128295
Strategy 2 CV Loss: 0.134025
Strategy 3 CV Loss: 0.287935

Here all loss functions change their values considerably in both reweights. The instability of Strategies 1 and 2 look similar: in the first reweight 1 changes more more and 2 changes less and likewise in the second reweight. In light of this it seems slightly better to pick the second loss, because the first loss seems to tend to be too lenient with models that predict less 1’s than should be predicted. But we will double check this weighting stability again in the other tests.

Here we also see that our first two strategies have better robustness properties: in moving from weighted to unweighted averages the results do not change that much. This is just a consequence of the split being more even as we discussed ealier.

print("Let us compare the raw values of the CV losses on each data fold")

loss_1 = compute_loss_cv(split_strategy=1, average=False)
loss_2 = compute_loss_cv(split_strategy=2, average=False)
loss_3 = compute_loss_cv(split_strategy=3, average=False)
print(f"Strategy 1 CV Loss: {[f'{x:.6f}' for x in loss_1[0]]} (train sizes: {loss_1[1]})")
print(f"Strategy 2 CV Loss: {[f'{x:.6f}' for x in loss_2[0]]} (train sizes: {loss_2[1]})")
print(f"Strategy 3 CV Loss: {[f'{x:.6f}' for x in loss_3[0]]} (train sizes: {loss_3[1]})")
Let us compare the raw values of the CV losses on each data fold
Strategy 1 CV Loss: ['0.075982', '0.148263'] (train sizes: [3811, 6060])
Strategy 2 CV Loss: ['0.062852', '0.097306', '0.124682', '0.187300'] (train sizes: [3870, 4448, 6060, 6441])
Strategy 3 CV Loss: ['0.714737', '0.030938', '0.129826', '0.074739', '0.203086'] (train sizes: [776, 3811, 5203, 6060, 7081])

As is the previous examples, we see that the Strategy 1 and 2 have a much more well-behaved trend. Firstly there is less variability among the estimates. But also the error seems to change across the folds in an ordered way. A weird behavior seems to be going on though. It seems that as we increase the training size, the algorithm performs worse.

Q: Can you guess what might be going on?

3.2 If there’s distribution shift, who guarantees that minimizing CV error will result in the lowest test error?

  • There is no guarantee!
  • Since the CV validation folds span multiple different periods, we can hope that the resulting hyperparameters will be robust to the different possibilities of how 2013 will look like
  • Of course, 2013 can be something we’ve never seen before, and than there really is no guarantee..
  • Of course, we can also overfit the hyperparameters, but that’s a different problem that exists even with IID data

4 Write grid workflow

Let’s write a function that gets multiple values for each hyperparameter, and runs over all possible combinations.

Example:

  • Input - a=[1,2,3], b=[4,5], c=[6,7]
  • All possible combinations:
a=1, b=4, c=6
a=1, b=4, c=7
a=1, b=5, c=6
a=1, b=6, c=7
a=2, b=4, c=6
a=2, b=4, c=7
a=2, b=5, c=6
a=2, b=6, c=7
a=3, b=4, c=6
a=3, b=4, c=7
a=3, b=5, c=6
a=3, b=6, c=7
import itertools
from tqdm import tqdm

def compute_loss_cv_grid(split_strategy=2, bootstrap_sd=False, n_bs_samples=20, loss_weight=(1.0,1.0), **params):
    keys = params.keys()
    out = []
    total_iters = np.prod([len(v) for v in params.values()])
    for vals in tqdm(itertools.product(*params.values()), total=total_iters):
        cur_params = dict(zip(keys, vals))
        #print(f"Evaluating params: {cur_params}...")
        updated_params = cur_params.copy()
        updated_params["loss"] = compute_loss_cv(split_strategy=split_strategy, loss_weight=loss_weight, **cur_params)
        if bootstrap_sd:
            updated_params["sd"] = compute_loss_cv(split_strategy=split_strategy, bootstrap_sd=True, n_bs_samples=n_bs_samples, loss_weight=loss_weight, **cur_params)
        out.append(updated_params)
    out = pd.DataFrame(out)
    return out

4.1 How many hyperparameters/values to try at once?

Since we are fixing the model of interest, the only limit to how many hparams to run is our computational resources and how we set things up.

It is always good to take a manual look at the top performing models, which will usually be all tied within the bootstrap sd. This allows us to access trade-offs beyond the raw loss function value (e.g. complexity or training/deployment cost). Other than that, it is okay to pick the model with the lowest score, if that’s your only criteria. In ML the goal is beating the benchmark, not being entirely sure you have found the best model possible.

For this problem we can run things quickly in a notebook, so to we advise large hparams sweeps. Bear in mind that some machine learning models require hparam sweeps like we are doing to be able to perform well.

Knowing what the hparams do is important in order to access which ones you should prioritize optimizing by building finer grids, and which ones you can pass.

4.2 Hparam sweeps

res_strat_2_base = compute_loss_cv_grid(split_strategy=2,
    max_depth=[2, 10],
    learning_rate=[0.001, 0.1], 
    n_estimators=[50, 500], 
    min_child_samples=[5, 100], 
    num_leaves=[8, 32],
    subsample_freq=[1, 5],
    subsample=[0.5, 1.0],
    is_unbalance=[False,True]
)
100%|██████████| 256/256 [03:55<00:00,  1.09it/s]
res_strat_2_base = res_strat_2_base.sort_values("loss", ascending=True)
print(res_strat_2_base.head(5))
    max_depth  learning_rate  n_estimators  min_child_samples  num_leaves  \
78          2            0.1            50                  5          32   
74          2            0.1            50                  5          32   
66          2            0.1            50                  5           8   
70          2            0.1            50                  5           8   
94          2            0.1            50                100          32   

    subsample_freq  subsample  is_unbalance      loss  
78               5        1.0         False  0.089749  
74               1        1.0         False  0.089749  
66               1        1.0         False  0.089749  
70               5        1.0         False  0.089749  
94               5        1.0         False  0.090053  
res_strat_2_more_neg = compute_loss_cv_grid(split_strategy=2,
    loss_weight=(2.0, 1.0),
    max_depth=[2, 10],
    learning_rate=[0.001, 0.1], 
    n_estimators=[50, 500], 
    min_child_samples=[5, 100], 
    num_leaves=[8, 32],
    subsample_freq=[1, 5],
    subsample=[0.5, 1.0],
    is_unbalance=[False, True]
)
100%|██████████| 256/256 [03:52<00:00,  1.10it/s]
res_strat_2_more_pos = compute_loss_cv_grid(split_strategy=2,
    loss_weight=(1.0, 2.0),
    max_depth=[2, 10],
    learning_rate=[0.001, 0.1], 
    n_estimators=[50, 500], 
    min_child_samples=[5, 100], 
    num_leaves=[8, 32],
    subsample_freq=[1, 5],
    subsample=[0.5, 1.0],
    is_unbalance=[False, True]
)
100%|██████████| 256/256 [03:51<00:00,  1.11it/s]
res_strat_3_base = compute_loss_cv_grid(split_strategy=3,
    max_depth=[2, 10],
    learning_rate=[0.001, 0.1], 
    n_estimators=[50, 500], 
    min_child_samples=[5, 100], 
    num_leaves=[8, 32],
    subsample_freq=[1, 5],
    subsample=[0.5, 1.0],
    is_unbalance=[False, True]
)
100%|██████████| 256/256 [03:52<00:00,  1.10it/s]
strat_2_base_params = res_strat_2_base.loc[[res_strat_2_base["loss"].idxmin()]].to_dict(orient="records")[0]
strat_2_base_params.pop("loss", None)

strat_2_more_pos_params = res_strat_2_more_pos.loc[[res_strat_2_more_pos["loss"].idxmin()]].to_dict(orient="records")[0]
strat_2_more_pos_params.pop("loss", None)

strat_2_more_neg_params = res_strat_2_more_neg.loc[[res_strat_2_more_neg["loss"].idxmin()]].to_dict(orient="records")[0]
strat_2_more_neg_params.pop("loss", None)

strat_3_base_params = res_strat_3_base.loc[[res_strat_3_base["loss"].idxmin()]].to_dict(orient="records")[0]
strat_3_base_params.pop("loss", None)
0.15006635332466522
strat_2_base_params
{'max_depth': 2,
 'learning_rate': 0.1,
 'n_estimators': 50,
 'min_child_samples': 5,
 'num_leaves': 32,
 'subsample_freq': 5,
 'subsample': 1.0,
 'is_unbalance': False}
strat_2_more_pos_params
{'max_depth': 2,
 'learning_rate': 0.1,
 'n_estimators': 50,
 'min_child_samples': 5,
 'num_leaves': 8,
 'subsample_freq': 1,
 'subsample': 1.0,
 'is_unbalance': False}
strat_2_more_neg_params
{'max_depth': 2,
 'learning_rate': 0.1,
 'n_estimators': 50,
 'min_child_samples': 5,
 'num_leaves': 8,
 'subsample_freq': 1,
 'subsample': 1.0,
 'is_unbalance': False}
strat_3_base_params
{'max_depth': 2,
 'learning_rate': 0.001,
 'n_estimators': 500,
 'min_child_samples': 5,
 'num_leaves': 8,
 'subsample_freq': 5,
 'subsample': 0.5,
 'is_unbalance': False}

We can see that strategy 2 in CV gives a more or less stable result relative to reweightings of the loss. The third strategy on the other hand, gives a significantly different result. Our final assessment will have to be on the holdout data.

5 Final CV loss

5.1 Test set losses

X_holdout = pd.read_csv(f"{path}/X_cv_holdout.csv").drop("Date", axis=1)
X_holdout.Species = X_holdout.Species.astype("category")
y_holdout = pd.read_csv(f"{path}/y_cv_holdout.csv").squeeze()
josh = {
  "max_depth": 1,
  "n_estimators": 1000,
  "learning_rate": 0.008362510309503735,
  "reg_lambda": 10.0,
  "min_child_samples": 200,
  "subsample": 0.9,
  "colsample_bytree": 1.0,
  "subsample_freq": 1
}

erez = {
    "learning_rate": 0.01,
    "n_estimators": 1000,
    "min_child_samples": 50,
    "num_leaves": 2,
    "subsample_freq": 1,
    "subsample": 0.5,
}

lucas = strat_2_base_params

bbbaaaccc = {"colsample_bytree": 0.8, 
             "learning_rate": 0.01, 
             "min_child_samples": 20, 
             "n_estimators": 100, 
             "num_leaves": 63, 
             "reg_lambda": 1.0, 
             "scale_pos_weight": 25.006410256410255, 
             "subsample": 0.8, 
             "subsample_freq": 5}

qwerty = {
    "learning_rate": 0.001,
    "max_depth": 2,
    "num_leaves": 10,
    "min_child_samples": 10,
    "n_estimators": 500,
    "objective": "binary",
    "is_unbalance": True,
    "random_state": 2027
}
### Copmute test errors with BS variability checks

n_bs_samples = 100

X_holdout = pd.read_csv(f"{path}/X_cv_holdout.csv").drop("Date", axis=1)
X_holdout.Species = X_holdout.Species.astype("category")
y_holdout = pd.read_csv(f"{path}/y_cv_holdout.csv").squeeze()

def compute_loss_test(X_train,y_train, X_holdout, y_holdout, **params):
    model = lgb.LGBMClassifier(**params, verbosity=-1)
    model.fit(X_train.drop("Date", axis=1), y_train, categorical_feature="Species")
    pred = model.predict_proba(X_holdout)[:,1]
    return logloss(y_holdout, pred)

losses_erez = []
losses_josh = []
losses_lucas = []
losses_bbbaaaccc = []
losses_qwerty = []

for _ in tqdm(range(n_bs_samples)):
    idx = _stratified_bootstrap_indices(X, y, no_date=True, bs_sample_size=len(X))
    X_bs = X.iloc[idx].reset_index(drop=True)
    y_bs = y.iloc[idx].reset_index(drop=True)

    idx_holdout = _stratified_bootstrap_indices(X_holdout, y_holdout, no_date=True, bs_sample_size=len(X_holdout))
    X_holdout_bs = X_holdout.iloc[idx_holdout].reset_index(drop=True)
    y_holdout_bs = y_holdout.iloc[idx_holdout].reset_index(drop=True)

    losses_erez.append(compute_loss_test(X_bs, y_bs, X_holdout, y_holdout, **erez))
    losses_josh.append(compute_loss_test(X_bs, y_bs, X_holdout, y_holdout, **josh))
    losses_lucas.append(compute_loss_test(X_bs, y_bs, X_holdout, y_holdout, **lucas))
    losses_bbbaaaccc.append(compute_loss_test(X_bs, y_bs, X_holdout, y_holdout, **bbbaaaccc))
    losses_qwerty.append(compute_loss_test(X_bs, y_bs, X_holdout, y_holdout, **qwerty))
  0%|          | 0/100 [00:00<?, ?it/s]100%|██████████| 100/100 [02:06<00:00,  1.26s/it]
losses_erez = np.array(losses_erez)
losses_lucas = np.array(losses_lucas)
losses_josh = np.array(losses_josh)
losses_bbbaaaccc = np.array(losses_bbbaaaccc)
losses_qwerty = np.array(losses_qwerty)
plt.hist(losses_erez, bins=10, label="Erez")
plt.hist(losses_lucas, bins=10, alpha=0.75, label="Lucas")
plt.hist(losses_josh, bins=10, alpha=0.5, label="Josh")
plt.hist(losses_bbbaaaccc, bins=10, alpha=0.3, label="bbbaaaccc")
plt.hist(losses_qwerty, bins=10, alpha=0.3, label="qwerty")
plt.legend()

Let us try to understand what is going on by looking at the statistics of the holdout data

X_holdout["Date"] = pd.to_datetime(
    "2013" + X_holdout["DayOfYear"].astype(int).astype(str).str.zfill(3),
    format="%Y%j"
)


#X_holdout.Date = pd.to_datetime(X_holdout.Date)
table_obs_counts = pd.crosstab(X_holdout.Date.dt.year, X_holdout.Date.dt.month)

table_y_sum = pd.crosstab(
    index=X_holdout.Date.dt.year,
    columns=X_holdout.Date.dt.month,
    values=y_holdout,
    aggfunc="sum"
).fillna(0).astype(int)


table_y_mean = pd.crosstab(
    index=X_holdout.Date.dt.year,
    columns=X_holdout.Date.dt.month,
    values=y_holdout,
    aggfunc="mean"
).fillna(0).astype(float)

display(HTML(f"""
<div style="display:flex; gap:24px; align-items:flex-start;">
    <div>{table_obs_counts.style.set_caption("table_obs_counts").to_html()}</div>
    <div>{table_y_sum.style.set_caption("table_y_sum").to_html()}</div>
    <div>{table_y_mean.style.set_caption("table_y_mean").format("{:.3f}").to_html()}</div>
</div>
"""))
Table 13: table_obs_counts
Date 6 7 8 9
Date        
2013 436 636 834 486
Table 14: table_y_sum
Date 6 7 8 9
Date        
2013 1 24 144 70
Table 15: table_y_mean
Date 6 7 8 9
Date        
2013 0.002 0.038 0.173 0.144

Note that there is a big distribution shift making the proportions of positive cases much higher. We have roughly a proportion which is twice as much. This made methods who skewed toward penalizing more false negatives fare better. But overall everyone did equally well. From the little data that was given to you, there was no way you could have predicted this distribution shift. The winners to a great extent got lucky.

res_strat_2_much_more_pos = compute_loss_cv_grid(
    split_strategy=2,
    loss_weight=(1.0, 10.0),
    max_depth=[2, 10],
    learning_rate=[0.001, 0.1], 
    n_estimators=[50, 500], 
    min_child_samples=[5, 100], 
    num_leaves=[8, 32],
    subsample_freq=[1, 5],
    subsample=[0.5, 1.0],
    is_unbalance=[False, True]
)
100%|██████████| 256/256 [04:04<00:00,  1.05it/s]
res_strat_3_much_more_pos = compute_loss_cv_grid(
    split_strategy=3,
    loss_weight=(1.0, 10.0),
    max_depth=[2, 10],
    learning_rate=[0.001, 0.1], 
    n_estimators=[50, 500], 
    min_child_samples=[5, 100], 
    num_leaves=[8, 32],
    subsample_freq=[1, 5],
    subsample=[0.5, 1.0],
    is_unbalance=[False, True]
)
100%|██████████| 256/256 [04:33<00:00,  1.07s/it]
strat_3_much_more_pos = res_strat_3_much_more_pos.loc[[res_strat_3_much_more_pos["loss"].idxmin()]].to_dict(orient="records")[0]
strat_3_much_more_pos.pop("loss", None)

strat_2_much_more_pos = res_strat_2_much_more_pos.loc[[res_strat_2_much_more_pos["loss"].idxmin()]].to_dict(orient="records")[0]
strat_2_much_more_pos.pop("loss", None)
0.08113101940624967

X_holdout = pd.read_csv(f"{path}/X_cv_holdout.csv").drop("Date", axis=1)
X_holdout.Species = X_holdout.Species.astype("category")
y_holdout = pd.read_csv(f"{path}/y_cv_holdout.csv").squeeze()

compute_loss_test(X, y, X_holdout, y_holdout, **strat_2_much_more_pos)
0.28884527456945
compute_loss_test(X, y, X_holdout, y_holdout, **strat_3_much_more_pos)
0.289565876412703

As one can see, simply knowing that the proportion of positive cases increased, would have allowed us to get a substantial improvement by simply evaluating with a weighted loss on our same data.

However we did not have this information. For all we know, the opposite could have been true. Hence the best we could have done was accessing the sensitivity of our conclusions and at least hoping to improve on the baseline, which we did.

5.2 Why did winner do well (qwerty)?

Possible reasons:

  • They optimized a wider range of parameters, including handling unbalanced data which is crucial for this problem
  • They attempted more than one CV estimation method
  • Distribution shift happened to work in their favor