Lab 5 Exercise - Cross Validation

Author

Erez Buchweitz

1 Download data and run a gradient boosting machine

The data may be found here:

These data were gathered in order to predict the future presence of mosquitos carrying the West Nile Virus in various sampling stations located around Chicago. Information about the time of year, location of the sampling station as well results from past samples at the same location were collected and engineered into the dataset that you have.

The exercise for you today is to tune hyperparameters of a gradient boosting machine using cross validation. Here is a code snippet showing how you can train a model and compute loss.

import lightgbm as lgb
import numpy as np
import pandas as pd

def logloss(y, pred):
    return -np.mean(y * np.log(pred) + (1-y) * np.log(1-pred))

X_train = pd.read_csv("X_cv_train.csv")
X_train.Species = X_train.Species.astype("category")
X_test = pd.read_csv("X_cv_test.csv")
X_test.Species = X_test.Species.astype("category")
y_train = pd.read_csv("y_cv_train.csv").squeeze()
y_test = pd.read_csv("y_cv_test.csv").squeeze()

model = lgb.LGBMClassifier(verbosity=-1)
model.fit(X_train, y_train, categorical_feature="Species")
pred = model.predict_proba(X_test)[:,1]
logloss(y_test, pred), logloss(y_test, y_train.mean())

2 Optimize a single parameter

Find optimal values of the hyperparameter min_child_samples:

  • By minimizing train error.
  • By minimizing validation error, after making a single train-validation split.
  • By minimizing cross validation error, with different numbers of folds.

Make sure to follow the instructions for non-IID data!

Then:

  • Compare the test error of each method.
  • Compare the optimized values of min_child_samples obtained by each method. Explain the differences in test error in terms of the differences in min_child_samples.
  • Check what happens if the instructions for non-IID data are not followed.

3 Optimizing multiple parameters

Repeat the process above, this time optimizing three parameters together: min_child_samples, n_estimators and learning_rate.

4 Further improvements

Try to improve the model further by implementing greedy feature selection based on cross validation, or by optimizing further hyperparameters.