Lab 6 Exercise - Cross Validation Competition
You are given two files:
This is the same West Nile Virus dataset from the previous lab, except that the train and test have been combined into a single dataset, and a Date
column has been added in order to help in partitioning the data into folds. The data are from the years 2007, 2009 and 2011.
You may download and read them like this:
import pandas as pd
= pd.read_csv(f"{path}/X_cv_competition.csv")
X = pd.read_csv(f"{path}/y_cv_competition.csv").squeeze() y
Your goal is to find optimal hyperparametrs for lightgbm
using cross validation that minds the fact that there is a time element in the data.
1 How to submit hyperparameters
You will submit a JSON file containing a dictionary with the hyperparameters you optimized. You can save the JSON file like this:
import json
= {
params "min_child_samples": 20,
"learning_rate": 0.1,
"n_estimators": 100,
}
= f"params_{codename}.json"
output_fn with open(f"{path}/{output_fn}", "w") as f:
f.write(json.dumps(params))
The name of the json file should be "params_{codename}.json"
where {codename}
is a secret codename of your group’s choosing, that you will be able to use later to identify your submission in the leaderboard. The file, alongside your code, should be submitted to GradeScope.
2 How your hyperparameters will be scored
I will train a lightgbm
model on the data (X,y)
using the hyperparameters you submitted, then evaluate the model on a held-out set comprising data from the year 2013. I will run this code:
import lightgbm as lgb
import numpy as np
import pandas as pd
import json
def logloss(y, pred):
return -np.mean(y * np.log(pred) + (1-y) * np.log(1-pred))
= f"params_{codename}.json"
params_fn with open(f"{path}/{params_fn}") as f:
= json.loads(f.read())
params
= pd.read_csv(f"{path}/X_cv_competition.csv").drop("Date", axis=1)
X = X.Species.astype("category")
X.Species = pd.read_csv(f"{path}/X_cv_holdout.csv").drop("Date", axis=1)
X_holdout = X_holdout.Species.astype("category")
X_holdout.Species = pd.read_csv(f"{path}/y_cv_competition.csv").squeeze()
y = pd.read_csv(f"{path}/y_cv_holdout.csv").squeeze()
y_holdout
= lgb.LGBMClassifier(**params)
model ="Species")
model.fit(X, y, categorical_feature= model.predict_proba(X_holdout)[:,1]
pred logloss(y_holdout, pred)
Your model should not use Date
as a feature. As you can see, I’m passing your hyperparameters to LGBMClassifier
’s constructor. You are free to optimize any and all of the hyperparameters that the constructor accepts.
A leaderboard will be published with every group’s codename alongside its loss on the holdout set, ranked from low to high. No other identifying information will be shared, so no other student will be able know your rank on the leaderboard. The leaderboard is for your own benefit! The group with the lowest test loss is the winner.
3 How you will be graded
A submission reflecting honest effort will receive full grade. Your rank on the leaderboard will not factor into your grade in any way, except that the students in the top two performing groups after the competition has ended (that is, with lowest test loss) will receive an bonus point to their final grade (out of 100).
4 Timeline
The competition will end at the end of this lab session.
5 Some helpful information about lightgbm
Here are some significant hyperparameters to look at:
Hyperparameter | Relates to | Meaning | Complexity when increased |
---|---|---|---|
num_leaves |
Tree | Number of leaves in each tree | Increases |
max_depth |
Tree | Depth of each tree | Increases |
learning_rate |
Boosting | Step size | Increases |
n_estimators |
Boosting | Number of steps | Increases |
min_child_samples |
Tree | Minimum samples in leaf | Decreases |
subsample , subsample_freq |
Boosting | Subsample rows at each tree, for ensembling |