Lab 8 Exercise - Classification Competition

Author

Erez Buchweitz

Your task is predict whether a patient is diabetic (yes or no) based on various recorded metrics which are given as features. You are given three files:

You may download and read them like this:

import pandas as pd

X_train = pd.read_parquet("X_train.parquet")
y_train = pd.read_parquet("y_train.parquet")

There are two data sets, (X_train, y_train) and (X_test, y_test), which are exchangeable. However, I have hidden y_test from you. Your task is to make the best predictions of y_test possible based on the training data and on X_test.

1 How to submit predictions

You will submit an .npy file containing a numpy array (one-dimensional), whose length is the same as the number of rows in X_test, that will comprise your binary predictions of y_test. Your prediction should be either value 1 if you think that patient is diabetic, or 0 otherwise. You can save an array to file like this:

import numpy as np

np.save(f"pred_{codename}.npy", pred)

The name of the file should be "pred_{codename}.npy" where {codename} is a secret codename of your group’s choosing, that you will be able to use later to identify your submission in the leaderboard. The file, alongside your code, should be submitted to GradeScope.

2 How your predictions will be scored

Your predictions will be compared against the real y_test using the \(F_1\) score:

\[ F_1(\boldsymbol{y}, \yvhat) = \frac{2}{\frac{1}{\text{precision}}+\frac{1}{\text{recall}}}, \; \; \; \; \text{recall} = \frac{|\{n: \boldsymbol{y}_n=\yvhat_n=1\}|}{|\{n: \boldsymbol{y}_n=1\}|}, \; \; \; \; \text{precision} = \frac{|\{n: \boldsymbol{y}_n=\yvhat_n=1\}|}{|\{n: \yvhat_n=1\}|} \] Recall equals 1 when there are not false negatives, whereas precision equals 1 when there are no false positives. \(F_1\) is the harmonic average assigning equal weight to each, such that higher \(F_1\) score is better.

A leaderboard will be published with every group’s codename alongside its loss on the test set, ranked from low to high. No other identifying information will be shared, so no other student will be able know your rank on the leaderboard. The leaderboard is for your own benefit! The group with the lowest test loss is the winner.

3 Allowed methods to use

You are permitted to use only methods which you have learned in the lectures or labs, or that are adjacent to them, beside tree-based methods (e.g. GBM) as they will feature in a later competition. This includes, for example, linear regression, logistic regression, linear/quadratic discriminant analysis, linear/kernel SVM. Any use of significantly more advanced learning algorithms might result in disqualification and zero grade.

4 How you will be graded

A submission reflecting honest effort will receive full grade. Your rank on the leaderboard will not factor into your grade in any way, except that the students in the top two performing groups after the competition has ended (that is, with lowest test loss) will receive an bonus point to their final grade (out of 100).

5 Timeline

The competition will last two consecutive lab sessions, starting from the lab session on March 14th and ending at the end of the lab session on March 21st. You will submit predictions at two times:

  • At the end of the lab session on March 14th.
  • At the end of the lab session on March 21st.

A leaderboard will be published after each submission deadline, but bonus points to the final grade will be awarded only for the final leaderboard after March 21st.

You are not required to work on the competition outside lab sessions, however it is encouraged. This is for your benefit! The more time you spend working with data you will find you get more out of this course.

6 GOOD LUCK AND HAVE FUN!