Lab 6 Exercise - Cross Validation Competition

Author

Erez Buchweitz

You are given two files:

This is the same West Nile Virus dataset from the previous lab, except that the train and test have been combined into a single dataset, and a Date column has been added in order to help in partitioning the data into folds. The data are from the years 2007, 2009 and 2011.

You may download and read them like this:

import pandas as pd

X = pd.read_csv(f"{path}/X_cv_competition.csv")
y = pd.read_csv(f"{path}/y_cv_competition.csv").squeeze()

Your goal is to find optimal hyperparametrs for lightgbm using cross validation that minds the fact that there is a time element in the data.

1 How to submit hyperparameters

You will submit a JSON file containing a dictionary with the hyperparameters you optimized. You can save the JSON file like this:

import json

params = {
    "min_child_samples": 20,
    "learning_rate": 0.1,
    "n_estimators": 100,
}

output_fn = f"params_{codename}.json"
with open(f"{path}/{output_fn}", "w") as f:
    f.write(json.dumps(params))

The name of the json file should be "params_{codename}.json" where {codename} is a secret codename of your group’s choosing, that you will be able to use later to identify your submission in the leaderboard. The file, alongside your code, should be submitted to GradeScope.

2 How your hyperparameters will be scored

I will train a lightgbm model on the data (X,y) using the hyperparameters you submitted, then evaluate the model on a held-out set comprising data from the year 2013. I will run this code:

import lightgbm as lgb
import numpy as np
import pandas as pd
import json

def logloss(y, pred):
    return -np.mean(y * np.log(pred) + (1-y) * np.log(1-pred))

params_fn = f"params_{codename}.json"
with open(f"{path}/{params_fn}") as f:
    params = json.loads(f.read())

X = pd.read_csv(f"{path}/X_cv_competition.csv").drop("Date", axis=1)
X.Species = X.Species.astype("category")
X_holdout = pd.read_csv(f"{path}/X_cv_holdout.csv").drop("Date", axis=1)
X_holdout.Species = X_holdout.Species.astype("category")
y = pd.read_csv(f"{path}/y_cv_competition.csv").squeeze()
y_holdout = pd.read_csv(f"{path}/y_cv_holdout.csv").squeeze()

model = lgb.LGBMClassifier(**params)
model.fit(X, y, categorical_feature="Species")
pred = model.predict_proba(X_holdout)[:,1]
logloss(y_holdout, pred)

Your model should not use Date as a feature. As you can see, I’m passing your hyperparameters to LGBMClassifier’s constructor. You are free to optimize any and all of the hyperparameters that the constructor accepts.

A leaderboard will be published with every group’s codename alongside its loss on the holdout set, ranked from low to high. No other identifying information will be shared, so no other student will be able know your rank on the leaderboard. The leaderboard is for your own benefit! The group with the lowest test loss is the winner.

3 How you will be graded

A submission reflecting honest effort will receive full grade. Your rank on the leaderboard will not factor into your grade in any way, except that the students in the top two performing groups after the competition has ended (that is, with lowest test loss) will receive an bonus point to their final grade (out of 100).

4 Timeline

The competition will end at the end of this lab session.

5 Some helpful information about `lightgbm`

Here are some significant hyperparameters to look at:

Hyperparameter	Relates to	Meaning	Complexity when increased
`num_leaves`	Tree	Number of leaves in each tree	Increases
`max_depth`	Tree	Depth of each tree	Increases
`learning_rate`	Boosting	Step size	Increases
`n_estimators`	Boosting	Number of steps	Increases
`min_child_samples`	Tree	Minimum samples in leaf	Decreases
`subsample`, `subsample_freq`	Boosting	Subsample rows at each tree, for ensembling

1 How to submit hyperparameters

2 How your hyperparameters will be scored

3 How you will be graded

4 Timeline

5 Some helpful information about lightgbm

6 GOOD LUCK AND HAVE FUN!

5 Some helpful information about `lightgbm`