Lab 5 Exercise - Cross Validation
1 Download data and run a gradient boosting machine
The data may be found here:
These data were gathered in order to predict the future presence of mosquitos carrying the West Nile Virus in various sampling stations located around Chicago. Information about the time of year, location of the sampling station as well results from past samples at the same location were collected and engineered into the dataset that you have.
The exercise for you today is to tune hyperparameters of a gradient boosting machine using cross validation. Here is a code snippet showing how you can train a model and compute loss.
import lightgbm as lgb
import numpy as np
import pandas as pd
def logloss(y, pred):
return -np.mean(y * np.log(pred) + (1-y) * np.log(1-pred))
= pd.read_csv("X_cv_train.csv")
X_train = X_train.Species.astype("category")
X_train.Species = pd.read_csv("X_cv_test.csv")
X_test = X_test.Species.astype("category")
X_test.Species = pd.read_csv("y_cv_train.csv").squeeze()
y_train = pd.read_csv("y_cv_test.csv").squeeze()
y_test
= lgb.LGBMClassifier(verbosity=-1)
model ="Species")
model.fit(X_train, y_train, categorical_feature= model.predict_proba(X_test)[:,1]
pred logloss(y_test, pred), logloss(y_test, y_train.mean())
2 Optimize a single parameter
Find optimal values of the hyperparameter min_child_samples
:
- By minimizing train error.
- By minimizing validation error, after making a single train-validation split.
- By minimizing cross validation error, with different numbers of folds.
Make sure to follow the instructions for non-IID data!
Then:
- Compare the test error of each method.
- Compare the optimized values of
min_child_samples
obtained by each method. Explain the differences in test error in terms of the differences inmin_child_samples
. - Check what happens if the instructions for non-IID data are not followed.
3 Optimizing multiple parameters
Repeat the process above, this time optimizing three parameters together: min_child_samples
, n_estimators
and learning_rate
.
4 Further improvements
Try to improve the model further by implementing greedy feature selection based on cross validation, or by optimizing further hyperparameters.