Lab 8 Exercise - Classification Competition

Author

Erez Buchweitz/Josh Davis/Lucas Schwengber

Your task is predict whether a patient is diabetic (yes or no) based on various recorded metrics which are given as features. You are given three files:

You may download and read them like this:

import pandas as pd

X_train = pd.read_parquet("X_train.parquet")
y_train = pd.read_parquet("y_train.parquet")

There are two data sets, (X_train, y_train) and (X_test, y_test), which are exchangeable. However, I have hidden y_test from you. Your task is to make the best predictions of y_test possible based on the training data and on X_test.

1 How to submit predictions

You will submit an .npy file containing a numpy array (one-dimensional), whose length is the same as the number of rows in X_test, that will comprise your binary predictions of y_test. Your prediction should be either value 1 if you think that patient is diabetic, or 0 otherwise. You can save an array to file like this:

import numpy as np

np.save(f"pred_{codename}.npy", pred)

The name of the file should be "pred_{codename}.npy" where {codename} is a secret codename of your group’s choosing, that you will be able to use later to identify your submission in the leaderboard. The file, alongside your code, should be submitted to GradeScope.

2 How your predictions will be scored

Your predictions will be compared against the real y_test using the \(F_1\) score:

\[ F_1(\boldsymbol{y}, \yvhat) = \frac{2}{\frac{1}{\text{precision}}+\frac{1}{\text{recall}}}, \; \; \; \; \text{recall} = \frac{|\{n: \boldsymbol{y}_n=\yvhat_n=1\}|}{|\{n: \boldsymbol{y}_n=1\}|}, \; \; \; \; \text{precision} = \frac{|\{n: \boldsymbol{y}_n=\yvhat_n=1\}|}{|\{n: \yvhat_n=1\}|} \] Recall equals 1 when there are not false negatives, whereas precision equals 1 when there are no false positives. \(F_1\) is the harmonic average assigning equal weight to each, such that higher \(F_1\) score is better.

A leaderboard will be published with every group’s codename alongside its loss on the test set, ranked from low to high. No other identifying information will be shared, so no other student will be able know your rank on the leaderboard. The leaderboard is for your own benefit! The group with the lowest test loss is the winner.

2.1 Why \(F_1\) and not accuracy?

In a classification problem, the simplest metric is accuracy (the fraction of predictions that are correct). However, accuracy can be misleading when classes are imbalanced. For example, if only 10% of patients in the dataset are diabetic, a model that predicts “not diabetic” for everyone achieves 90% accuracy while being completely trivial (some might say useless).

The \(F_1\) score addresses this by focusing specifically on the positive class (diabetic = 1). It combines two measures of accuracy for the positive class:

Precision answers: of all the patients I flagged as diabetic, how many actually were? A model with low precision produces many false alarms.
Recall answers: of all the patients who are truly diabetic, how many did I catch? A model with low recall misses many true cases.

Neither alone is sufficient. A model that flags everyone as diabetic achieves perfect recall can often have terrible precision. A model that flags only one patient (the one it is most certain about) may achieve perfect precision but terrible recall. \(F_1\) is the harmonic mean of the two, which means both must be high to achieve a high \(F_1\):

\[ F_1 = \frac{2 \cdot \text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}} \]

The harmonic mean penalizes imbalance between precision and recall more harshly than the arithmetic mean would. For instance, precision \(= 1.0\) and recall \(= 0.1\) gives \(F_1 = 0.18\), whereas the arithmetic mean would give \(0.55\).

Implication for your predictions: Simply predicting all zeros (no one is diabetic) will score \(F_1 = 0\), since recall will be 0. You need to find a threshold or decision rule that captures enough true positives without generating too many false positives.

A point of caution: \(F_1\) implicitly assumes the negative class is large and uninteresting (baseline prediction would be just assume every new obervation is negative). A related point is that two models with different true negative counts but the same precision and recall will have the same \(F_1\) score, which is a limitation of this metric. In some applications, you might want to consider other metrics that take into account the negative class as well.

3 Allowed methods to use

You are permitted to use only methods which you have learned in the lectures or labs. This includes, for example, linear regression, logistic regression, linear/quadratic discriminant analysis, linear/kernel SVM and tree-based algorithms. Any use of significantly more advanced learning algorithms might result in disqualification and zero grade.

4 How you will be graded

A submission reflecting honest effort will receive full grade. Your rank on the leaderboard will not factor into your grade in any way, except that the students in the top two performing groups after the competition has ended (that is, with lowest test loss) will receive an bonus point to their final grade (out of 100). Do not forget to submit both your predictions but also your code.

5 Timeline

The competition will last two consecutive lab sessions, starting from the lab session on March 30th and ending at the end of the lab session on April 6th. You will submit predictions at two times:

1 hour after the end of the lab session on March 30th.
1 hour after the end of the lab session on April 6th.

A leaderboard will be published after each submission deadline, but bonus points to the final grade will be awarded only for the final leaderboard after April 6th.

You are not required to work on the competition outside lab sessions, however it is encouraged. This is for your benefit! The more time you spend working with data you will find you get more out of this course.