19  Classification

Abstract
This chapter introduces the second fundamental type of supervised machine learning, classification.

19.1 Types of Classification

Binary Classification

Binary classification: Predicting one of two classes. For example, rat burrow or no rat burrow, lead paint or no lead paint, or insured or uninsured. Classes are often recoded to 1 and 0 as in logistic regression.

Multiclass Classification:

Multiclass classification: Predicting one of three or more classes. For example, single filer, joint filer, or head of household; or on-time, delinquent, or defaulted. Classes can be recoded to integers for models like multinomial logistic regression, but many of the best models can handle factors.

19.2 Metrics

Classification problems require a different set of error metrics and diagnostics than regression problems. Assume a binary classifier for the following definitions. Let an event be the outcome 1 in a binary classification problem and a non-event be the outcome 0.

True positive

True positive: Correctly predicting an event. Predicting \(\hat{y_i} = 1\) when \(y_i = 1\)

True negative

True negative: Correctly predicting a non-event. Predicting \(\hat{y_i} = 0\) when \(y_i = 0\)

False positive

False positive: Incorrectly predicting an event for a non-event. Predicting \(\hat{y_i} = 1\) when \(y_i = 0\)

False negative

False negative: Incorrectly predicting a non-event for an event. Predicting \(\hat{y_i} = 0\) when \(y = 1\)

Confusion Matrix

Confusion matrix: A simple matrix that compares predicted outcomes with actual outcomes.

As these notes will show, there are many ways to combine true positives, false positives, true negatives, and false negatives into a given metric. It is important to understand a use-case in which a predictive algorithm will be used. That should inform the specific metric selected.

Example Confusion Matrix
True Value
\(y=1\) \(y=0\)
Predicted Value \(\hat{y}=1\) True Positive (TP) False Positive (FP)
\(\hat{y}=0\) False Negative (FN) True Negative (TN)
Accuracy:

Accuracy: The sum of the values on the main diagonal of the confusion matrix divided by the total number of predictions (\(\frac{TP + TN}{total}\)).

19.2.1 Example 1

Consider the following set of true values and predicted values from a binary classification problem:

true_value predicted_value
0 0
0 0
0 0
1 0
1 1
1 1
0 1
0 1
1 1
1 1

The confusion matrix for this data:

True Value
\(y=1\) \(y=0\)
Predicted Value \(\hat{y}=1\) 4 2
\(\hat{y}=0\) 1 3

19.2.2 Example 2

A test for breast cancer is 99.1% accurate across 1,000 tests. Is it a good test?

True Value
\(y=1\) \(y=0\)
Predicted Value \(\hat{y}=1\) 1 4
\(\hat{y}=0\) 5 990

This test only accurately predicted one cancer case. In fact, a person was more likely not to have cancer given a positive test than to have cancer. This example demonstrates the base rate fallacy and the accuracy paradox. Both are the results of high class imbalance. Clearly we need more sophisticated way of evaluating classifiers than just accuracy.

19.3 More metrics

19.3.1 Most Important Metrics

Accuracy

Accuracy: How often the classifier is correct. \(\frac{TP +TN}{total}\). All else equal, we want to maximize accuracy.

Precision

Precision: How often the classifier is correct when it predicts events. \(\frac{TP}{TP+FP}\). All else equal, we want to maximize precision.

Recall

Recall/Sensitivity/True Positive Rate: How often the classifier is correct when there is an event. \(\frac{TP}{TP+FN}\). All else equal, we want to maximize recall/sensitivity.

19.3.2 Other Metrics

Specificity

Specificity: How often the classifier is correct when there is a non-event. \(\frac{TN}{TN+FP}\). All else equal, we want to maximize specificity.

False Positive Rate

False Positive Rate: 1 - Specificity

Note that False Positive Rate is sometimes referred to as FPR.

19.3.3 Example 2 continued

True Value
\(y=1\) \(y=0\)
Predicted Value \(\hat{y}=1\) 1 4
\(\hat{y}=0\) 5 990



Precision: \(\frac{TP}{TP + FP} = \frac{1}{1 + 4} = \frac{1}{5}\)

Recall/Sensitivity: \(\frac{TP}{TP + FN} = \frac{1}{1 + 5} = \frac{1}{6}\)

The breast cancer test has poor precision and recall.

Specificity: \(\frac{TN}{FP + TN} = \frac{990}{4 + 990} = \frac{990}{994}\)

False Positive Rate (FPR): \(1 - Specificity = \frac{4}{994}\)

The breast cancer cancer test also has poor False Positive Rate.

19.3.4 Using Thresholds to Predict Classes:

Most algorithms for classification generate predicted classes and probabilities of predicted classes. A predicted probability of 0.99 for an event is very different than a predicted probability of 0.51.

To generate class predictions, usually a threshold is used. For example, if \(\hat{P}(event) > 0.5\) then predict event. It is common to adjust the threshold to values other than 0.5. As 0.5 decreases, marginal cases shift from \(\hat{y} = 0\) to \(\hat{y} = 1\) because the threshold for an event decreases.

As the threshold decreases (we are predicting more events and fewer non-events):

  • the true positive rate increases and the false positive rate increases
  • precision decreases and sensitivity/recall increases
  • sensitivity increases and specificity decreases

In general, the goal is to create a model that has a high true positive rate and a low false positive rate, high precision and high recall, and high sensitivity and high specificity. Changing the threshold is a movement along these tradeoffs. Estimating “better” models is often a way to improve these tradeoffs. Of course, there is some amount of irreducible error.

Receiver Operating Curve (ROC)

Receiver Operating Characteristics (ROC) curve: A curve that shows the trade-off between the false positive rate and true positive rate as different threshold probabilities are used for classification.

Area Under the Curve

Area Under the Curve (AUC): A one-number summary of an ROC curve where 0.5 is random guessing and the rules of thumb are 0.7 is acceptable, 0.8 is good, and 0.9 is excellent.

AUC is also sometimes referred to as AUROC (Area under the Receiver Operating Curve).

# A tibble: 3 × 4
  Quality    .metric .estimator .estimate
  <chr>      <chr>   <chr>          <dbl>
1 Acceptable roc_auc binary         0.733
2 Good       roc_auc binary         0.841
3 Excellent  roc_auc binary         0.925

19.3.5 Relative costs

True positives, true negatives, false positives, and false negatives can carry different costs, and it is important to consider the relative costs when creating models and interventions.

A false positive for a cancer test could result in more diagnostic tests. A false negative for a cancer test could lead to untreated cancer and severe health consequences. The relative differences in these outcomes should be considered.

A false positive for a rat burrow is a wasted trip for an exterminator. A false negative for an rat burrow is an untreated rat burrow. The difference in these outcomes is small, especially compared to the alternative of the exterminator guessing which alleys to visit.

19.3.6 Multiclass metrics

Consider a multiclass classification problem with three unique levels (“a”, “b”, “c”)

true_value predicted_value
a a
a a
a a
a a
b b
b a
b b
b c
c c
c b
c a
c c


Create a confusion matrix:

True Value
\(y=a\) \(y=b\) \(y=c\)
Predicted Value \(\hat{y}=a\) 4 1 1
\(\hat{y}=b\) 0 2 2
\(\hat{y}=c\) 0 1 2


Accuracy still measures how often the classifier is correct. In multiclass classification problem, the correct predictions are on the diagonal.

Accuracy: \(\frac{4 + 2 + 2}{12} = \frac{8}{12} = \frac{2}{3}\)

There are multiclass extensions of precision, recall/sensitivity, and specificity. They are beyond the scope of this class.

19.4 R Code

19.4.1 Example 1

Example 1 uses data about penguins from the Palmer Archipelago in Antarctica. The data include measurements about three different species of penguins. This example only considers two classes and does not use resampling methods because only one model is estimated.

library(tidyverse)
library(tidymodels)
library(palmerpenguins)

# drop to two species
penguins_small <- penguins %>%
  filter(species %in% c("Adelie", "Gentoo")) %>%
  mutate(species = factor(species))

# look at missing data
map_dbl(.x = penguins_small, .f = ~ sum(is.na(.x)))
          species            island    bill_length_mm     bill_depth_mm 
                0                 0                 2                 2 
flipper_length_mm       body_mass_g               sex              year 
                2                 2                11                 0 
# drop missing values
penguins_small <- penguins_small %>%
  drop_na()

Step 1. Split the data into training and testing data

set.seed(20201013)

# create a split object
penguins_small_split <- initial_split(data = penguins_small, prop = 0.8)

# create the training and testing data
penguins_small_train <- training(x = penguins_small_split) 
penguins_small_test <- testing(x = penguins_small_split)

rm(penguins_small)

Step 2. EDA

penguins_small_train %>%
  ggplot(aes(x = flipper_length_mm, y = bill_length_mm, color = species)) +
  geom_point(alpha = 0.5) +
  theme_minimal()

Step 3. Estimate a Model

# create a recipe
cart_rec <- 
  recipe(formula = species ~ ., data = penguins_small_train)

# create a cart model object
cart_mod <- 
  decision_tree() %>%
  set_engine(engine = "rpart") %>%
  set_mode(mode = "classification")

cart_wf <- workflow() %>%
  add_recipe(cart_rec) %>%
  add_model(cart_mod)
  
# fit the model
cart_fit <- cart_wf %>%
  fit(data = penguins_small_train)

# create a tree
rpart.plot::rpart.plot(x = cart_fit$fit$fit$fit)

Step 4. Evaluate a Model

# predict the predicted class and the predicted probability of each class
predictions <- bind_cols(
  penguins_small_test,
  predict(object = cart_fit, new_data = penguins_small_test),
  predict(object = cart_fit, new_data = penguins_small_test, type = "prob")
)

select(predictions, species, starts_with(".pred"))
# A tibble: 53 × 4
   species .pred_class .pred_Adelie .pred_Gentoo
   <fct>   <fct>              <dbl>        <dbl>
 1 Adelie  Adelie                 1            0
 2 Adelie  Adelie                 1            0
 3 Adelie  Adelie                 1            0
 4 Adelie  Adelie                 1            0
 5 Adelie  Adelie                 1            0
 6 Adelie  Adelie                 1            0
 7 Adelie  Adelie                 1            0
 8 Adelie  Adelie                 1            0
 9 Adelie  Adelie                 1            0
10 Adelie  Adelie                 1            0
# ℹ 43 more rows
Create a confusion matrix:
conf_mat(data = predictions,
         truth = species,
         estimate = .pred_class)
          Truth
Prediction Adelie Gentoo
    Adelie     27      1
    Gentoo      0     25
Exercise 1

“Adelie” is the “event”.

Calculate the accuracy.

\[Accuracy = \frac{TP + TN}{total} = \frac{27 + 25}{53} = \frac{52}{53} \approx 0.981\]

accuracy(data = predictions,
         truth = species,
         estimate = .pred_class)
# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.981
Exercise 2

Calculate the precision.

\[Precision = \frac{TP}{TP + FP} = \frac{27}{27 + 1} = \frac{27}{28} \approx 0.964\]

precision(data = predictions,
          truth = species,
          estimate = .pred_class)
# A tibble: 1 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 precision binary         0.964
Exercise 3

Calculate the sensitivity.

\[Sensitivity = \frac{27}{27 + 0} = \frac{27}{27} = 1\]

recall(data = predictions,
       truth = species,
       estimate = .pred_class)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 recall  binary             1

Step 5. Make a New Prediction


new_penguins <- tribble(
  ~island, 
  ~bill_length_mm,
  ~bill_depth_mm,
  ~flipper_length_mm,
  ~body_mass_g,
  ~sex,
  ~year,
  "Torgersen",
  40, 
  19, 
  190, 
  4000, 
  "male", 
  2008
)

predict(object = cart_fit, new_data = new_penguins)
# A tibble: 1 × 1
  .pred_class
  <fct>      
1 Adelie     
predict(object = cart_fit, new_data = new_penguins, type = "prob")
# A tibble: 1 × 2
  .pred_Adelie .pred_Gentoo
         <dbl>        <dbl>
1            1            0

Note that the decision tree has one split, on flipper length. Below the threshold, all penguins are Adelie. Consequently, the probability associated with .pred_Adelie is 100% or 1.0. You can see this with the code below:

penguins_small_train |> 
  filter(flipper_length_mm < 206) |>
  group_by(species) |> 
  count()
# A tibble: 1 × 2
# Groups:   species [1]
  species     n
  <fct>   <int>
1 Adelie    117