Trading off false positives and negatives

Author

Erez Buchweitz

Types of classification errors

We assume a binary classification setting, in which we have a binary target \(y\in \{0,1\}\) and a binary prediction \(\hat{y}\in \{0,1\}\). There are two types of errors we can make, false positives and false negatives:

  • False positives: \(\hat{y}=1\) and \(y=0\).
  • False negatives: \(\hat{y}=0\) and \(y=1\).

False positives could be:

  • Decided patient ill but actually healthy
  • Berkeley PD evacuates entire neighborhood in fear of tsunami that never comes
  • Chatbot hallucinates answers
  • Hired employee who’s not a good fit for the company

False negatives could be:

  • Decided patient healthy but actually ill
  • Entire neighborhood never given early warning of impending tsunami
  • Chatbot says “I don’t know” too much
  • Missed out on hiring superstar candidate

The former two are cases where we arguably prefer false positives to false negatives. The latter two might not be so clear cut. Because we sometimes care about one kind of error more than the other, practitioners often consider the pair of metrics, sensitivity and specificity:

\[ \text{sensitivity} = \mathbb{P}(\hat y=1|y=1) \;\;\;\; ; \;\;\;\; \text{specificity} = \mathbb{P}(\hat y=0|y=0). \]

In words:

  • Sensitivity: proportion of \(y=1\) we labeled correctly. Sensitivity=\(1\) means there are no false negatives.
  • Specificity: proportion of \(y=0\) we labeled correctly. Specificitly=\(1\) means there are no false positives.

Real data sets are often imbalanced, meaning that they contain many \(y=0\) and few \(y=1\), so practitioners often prefer to use the similar (but not identical) pair of metrics, recall and precision:

\[ \text{recall} = \text{sensitivity} \;\;\;\; ; \;\;\;\; \text{precision} = \mathbb{P}(y=1|\hat{y}=1). \]

In words, precision is the number of false positives among the observations we labeled positive. Precision=\(1\) means that there are no false positives. We’ll stick with sensitivity and specificity.

Can we trade off sensitvity and specificity?

Assume you trained a classifier and got sensitivity \(0.5\) and specificity \(0.5\). If you can more about false negatives (sensitivity) than false positives (specificity), can you trade off one for the other?

We can achieve this by tuning the threshold. Later, we will show two other methods.

Tuning the threshold

Suppose our prediction is of the form \(\hat{y}= I(\hat f(\boldsymbol{x})>t)\), where \(t\) is some threshold. For example, we’ve seen that in logistic regression, \(\hat{y}= I(\boldsymbol{x}^T\hat{\boldsymbol{\beta}}> 0)\), so in this case \(\hat f(\boldsymbol{x})=\boldsymbol{x}^T\hat{\boldsymbol{\beta}}\) and \(t=0\).

What happens if we change the threshold \(t\)?

When the threshold \(t\) is increased:

  • Less observations are classified as \(\hat{y}=1\), and more are classified as \(\hat{y}=0\)
  • Sensitivity will decrease (or at least, not increase)
  • Specificity will increase (or at least, not decrease)

This way we can trade off sensitivity and specificity.

ROC curve

We plot [\(1\)-specificity] in the x-axis and sensitivity in the y-axis, for all possible thresholds \(t\). The resulting curve is called the ROC (receiver operating characteristic) curve.

Properties:

  • For \(t=\infty\), all observations are classified \(\hat{y}=0\), so sensitivity=\(0\) and specificity=\(1\). So the point \((0,0)\) is always on the ROC curve
  • The ROC curve is monotonically non-decreasing, because when the threshold \(t\) is decreased, sensitivity goes up and specificity goes down
  • For \(t=-\infty\), all observations are classified \(\hat{y}=1\), so sensitivity=\(1\) and specificity=\(0\). So the point \((1,1)\) is always on the ROC curve

We generally want the curve to be as “up” as possible, since that would give better sensitivity for any level of specificity.

How does a perfect classifier look?

Suppose there is a threshold \(t_0\) for which the data are separated, i.e. \(y=1\) whenever \(\hat f(\boldsymbol{x})>t_0\) and \(y=0\) whenever \(\hat f(\boldsymbol{x})<t_0\) - For \(t_0\), sensitivity and specificity are both \(1\), so the point \((1,0)\) lies on the ROC curve - Increasing the threshold from \(t_0\), sensitivity drops but specificity remains \(1\). The curve drops vertically to \((0,0)\) - Decreasing the treshold from \(t_0\), specificity drops but sensitivity remains \(1\). The curve moves horizontally to \((1,1)\) - So the ROC curve goes along the left and upper edges of the box \([0,1]^2\)

You cannot get any “upper” than this.

How does a random classifier look?

For a classifier \(\hat f\) that is independent of \(y\), you will compute the ROC curve in the homework.

Which classifier is better?

Consider now two ROC curves depicting two classifiers, \(\hat f_1\) and \(\hat f_2\). If the ROC curve for \(\hat f_1\) lies entirely above the ROC curve for \(\hat f_2\):

  • For any level of specificity (x-axis), \(\hat f_1\) offers better sensitivity than \(\hat f_2\)
  • Clear-cut that \(\hat f_1\) is preferable

In economic terms, we may say that \(\hat f_1\) is Pareto-superior \(\hat f_2\). A perfect classifier is Pareto-superior to all others.

Can a classifier be Pareto-inferior to random?

The answer is in the homework.

Intersecting ROC curves

If the ROC curve for \(\hat f_1\) starts higher than the ROC curve for \(\hat f_2\), but then becomes lower:

  • \(\hat f_1\) is better for high specificity (offers higher sensitivity)
  • \(\hat f_2\) is better for high sensitivity (offers higher specificity)
  • No clear-cut answer, which is preferable depends on the use case

Can you combine them to form a classifier that is Pareto-superior to both?

The answer is in the homework.

Area under ROC curve

In general, unless one is always better than the other (Pareto-superior), there is no clear-cut answer. We could ask which is better on average: this is the area under the ROC curve (AUC).

  • Benefit - no need to think, pick the classifier with larger AUC
  • Caveat - assigns equal weight to all level of specificity (equivalently, sensitivity), even though some are likely irrelevant to the use case

Further methods of trading off false negatives and positives

Tuning the threshold trades off sensitivity and specificity post training. Here are two further methods that intervene already in training to emphasize one kind of error over the other.

Asymmetric loss

Here are two losses that we encountered:

  • Classification error: \[ \mathscr{L}(y,\hat{y})=I(y\ne\hat{y}) = \begin{cases} 1 & \text{if } y=1 \text{ and } \hat{y}=0 \\ 1 & \text{if } y=0 \text{ and } \hat{y}=1 \\ 0 & \text{if } y=\hat{y} \end{cases} \]
  • Log loss: (here, \(\hat{p}\) estimates \(\mathbb{P}(y=1)\)) \[ \mathscr{L}(y,\hat{p}) = y\log(\hat{p}) + (1-y)\log(1-\hat{p}) = \begin{cases} \log(\hat{p}) & \text{if } y=1 \\ \log(1-\hat{p}) & \text{if } y=0 \end{cases} \]

Here are their asymmetric variants, using a constant \(c>0\):

  • Classification error: \[ \mathscr{L}_c(y,\hat{y})= \begin{cases} c & \text{if } y=1 \text{ and } \hat{y}=0 \\ 1 & \text{if } y=0 \text{ and } \hat{y}=1 \\ 0 & \text{if } y=\hat{y} \end{cases} \]
  • Log loss: \[ \mathscr{L}_c(y,\hat{p}) = \begin{cases} c\log(\hat{p}) & \text{if } y=1 \\ \log(1-\hat{p}) & \text{if } y=0 \end{cases} \]

We then train a classifier by minimizing asymmetric loss: \[ \hat f_c = \arg\min_f \frac{1}{N}\sum_{n=1}^N \mathscr{L}_c(y_n, f(\boldsymbol{x}_n)). \] If \(c>1\), we can expect that \(\hat f_c\) will have higher sensitivity (i.e. less false negatives) and lower specificity (i.e. more false positives) than \(\hat f_1\) (given by minimizing \(\mathscr{L}=\mathscr{L}_1\)), and vice versa.

Observation weighting

We can instead use the original symmetric loss \(\mathscr{L}\), but instead apply different weighting to observations with \(y=1\):

\[ \hat f_c = \arg\min_f \frac{1}{N}\sum_{n=1}^N w_i\mathscr{L}(y_n, f(\boldsymbol{x}_n)) \;\;\;\; ; \;\;\;\; w_i=\begin{cases} c & \text{if } y=1 \\ 1 & \text{if } y=0. \end{cases} \]

If \(c>1\), more weight is put on observations with \(y=1\), making it more important to correctly classify them.

Claim: For the losses mentioned above, \(\hat f_c\) obtained using asymmetric loss is identical to \(\hat f_c\) obtained using observation weighting.

Proof: Easy to check that \[ \mathscr{L}_c(y_n, f(\boldsymbol{x}_n)) = w_i\mathscr{L}(y_n, f(\boldsymbol{x}_n)). \]