Lab 6 Exercise - Cross Validation Competition

Author

Erez Buchweitz

You are given two files:

This is the same West Nile Virus dataset from the previous lab, except that the train and test have been combined into a single dataset, and a Date column has been added in order to help in partitioning the data into folds. The data are from the years 2007, 2009 and 2011.

You may download and read them like this:

import pandas as pd

X = pd.read_csv(f"{path}/X_cv_competition.csv")
y = pd.read_csv(f"{path}/y_cv_competition.csv").squeeze()

Your goal is to find optimal hyperparametrs for lightgbm using cross validation that minds the fact that there is a time element in the data.

1 How to submit hyperparameters

You will submit a JSON file containing a dictionary with the hyperparameters you optimized. You can save the JSON file like this:

import json

params = {
    "min_child_samples": 20,
    "learning_rate": 0.1,
    "n_estimators": 100,
}

output_fn = f"params_{codename}.json"
with open(f"{path}/{output_fn}", "w") as f:
    f.write(json.dumps(params))

The name of the json file should be "params_{codename}.json" where {codename} is a secret codename of your group’s choosing, that you will be able to use later to identify your submission in the leaderboard. The file, alongside your code, should be submitted to GradeScope.

2 How your hyperparameters will be scored

I will train a lightgbm model on the data (X,y) using the hyperparameters you submitted, then evaluate the model on a held-out set comprising data from the year 2013. I will run this code:

import lightgbm as lgb
import numpy as np
import pandas as pd
import json

def logloss(y, pred):
    return -np.mean(y * np.log(pred) + (1-y) * np.log(1-pred))

params_fn = f"params_{codename}.json"
with open(f"{path}/{params_fn}") as f:
    params = json.loads(f.read())

X = pd.read_csv(f"{path}/X_cv_competition.csv").drop("Date", axis=1)
X.Species = X.Species.astype("category")
X_holdout = pd.read_csv(f"{path}/X_cv_holdout.csv").drop("Date", axis=1)
X_holdout.Species = X_holdout.Species.astype("category")
y = pd.read_csv(f"{path}/y_cv_competition.csv").squeeze()
y_holdout = pd.read_csv(f"{path}/y_cv_holdout.csv").squeeze()

model = lgb.LGBMClassifier(**params)
model.fit(X, y, categorical_feature="Species")
pred = model.predict_proba(X_holdout)[:,1]
logloss(y_holdout, pred)

Your model should not use Date as a feature. As you can see, I’m passing your hyperparameters to LGBMClassifier’s constructor. You are free to optimize any and all of the hyperparameters that the constructor accepts.

A leaderboard will be published with every group’s codename alongside its loss on the holdout set, ranked from low to high. No other identifying information will be shared, so no other student will be able know your rank on the leaderboard. The leaderboard is for your own benefit! The group with the lowest test loss is the winner.

3 How you will be graded

A submission reflecting honest effort will receive full grade. Your rank on the leaderboard will not factor into your grade in any way, except that the students in the top two performing groups after the competition has ended (that is, with lowest test loss) will receive an bonus point to their final grade (out of 100).

4 Timeline

The competition will end at the end of this lab session.

5 Some helpful information about lightgbm

Here are some significant hyperparameters to look at:

Hyperparameter Relates to Meaning Complexity when increased
num_leaves Tree Number of leaves in each tree Increases
max_depth Tree Depth of each tree Increases
learning_rate Boosting Step size Increases
n_estimators Boosting Number of steps Increases
min_child_samples Tree Minimum samples in leaf Decreases
subsample, subsample_freq Boosting Subsample rows at each tree, for ensembling

6 GOOD LUCK AND HAVE FUN!