Lab 11 Exercise - Bagging and Boosting Competition

Author

Erez Buchweitz

Your task is to predict how long a taxi ride will take. You are given three files:

You may download and read them like this:

import pandas as pd

X_train = pd.read_parquet("X_train.parquet")
y_train = pd.read_parquet("y_train.parquet")

There are two data sets, (X_train, y_train) and (X_test, y_test). However, I have hidden y_test from you. Your task is to make the best predictions of y_test possible based on the training data and on X_test.

1 How to submit predictions

You will submit an .npy file containing a numpy array (one-dimensional), whose length is the same as the number of rows in X_test, that will comprise your predictions of y_test. You can save an array to file like this:

import numpy as np

np.save(f"pred_{codename}.npy", pred)

The name of the file should be "pred_{codename}.npy" where {codename} is a secret codename of your group’s choosing, that you will be able to use later to identify your submission in the leaderboard. The file, alongside your code, should be submitted to GradeScope.

2 How your predictions will be scored

Your predictions will be compared against the real y_test using the following root mean squared logarithmic error: \[ \text{RMSLE}(y, \hat{y}) = \sqrt{\frac{1}{n} \sum_{i=1}^n \left( \log(1 + y_i) - \log(1 + \hat{y}_i) \right)^2} \] Low RMSLE indicates a better prediction.

A leaderboard will be published with every group’s codename alongside its loss on the test set, ranked from low to high. No other identifying information will be shared, so no other student will be able know your rank on the leaderboard. The leaderboard is for your own benefit! The group with the lowest test loss is the winner.

3 Allowed methods to use

You are permitted to use any and all methods.

4 How you will be graded

A submission reflecting honest effort will receive full grade. Your rank on the leaderboard will not factor into your grade in any way, except that the students in the top two performing groups after the competition has ended (that is, with lowest test loss) will receive an bonus point to their final grade (out of 100).

5 Timeline

The competition will last two consecutive lab sessions, starting from the lab session on March 14th and ending at the end of the lab session on March 21st. You will submit predictions at two times:

  • At the end of the lab session on April 11th.
  • At the end of the lab session on April 18st.

A leaderboard will be published after each submission deadline, but bonus points to the final grade will be awarded only for the final leaderboard after March 21st.

You are not required to work on the competition outside lab sessions, however it is encouraged. This is for your benefit! The more time you spend working with data you will find you get more out of this course.

6 GOOD LUCK AND HAVE FUN!