22  Interpretable Models

22.1 Planning Database

  • Motivation: The decennial census is used to allocate hundreds of billions of dollars and to redistrict political power. Ensuring an accurate count is imperative. The Census Bureau uses the planning database (PDB) and a modeled low-response score (LRS) to plan outreach to improve response rates.
  • Implementation data: Demographic and housing information for the American Community Survey at the census tract and census block group levels.
  • Modeling data: Demographic and housing information for the American Community Survey and the low-response score for the 2010 Decennial Census at the census tract and census block group levels.
  • Objective: Predict which areas will have the lowest response rates to the 2020 Decennial Census.
  • Tools: Linear regression based on the top 25 predictors from gradient-boosted trees.
  • Results: Unclear but we’ll see more in these notes!

22.1.1 LRS Set Up

We will work with the tract-level PDB instead of the block-group level PDB to simplify computation. The PDB contains more than 500 variables. For now, we only consider the top 25 variables to further simplify computation.

pdb_small <- read_csv(here::here("data", "pdb_small.csv"))
Rows: 69650 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): State_name, County_name
dbl (24): Low_Response_Score, non_return_rate, Renter_Occp_HU_ACS_13_17, Pop...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
set.seed(20231114)
pdb_split <- initial_split(pdb_small, prop = 0.8)

pdb_train <- training(pdb_split)

Lastly, we set up \(v\)-fold cross validation. We only use five folds because some of the models take a long time to fit to the data.

pdb_folds <- vfold_cv(data = pdb_train, v = 5)

22.2 Motivation

The models that won the planning database Kaggle competition all used ensembled trees. They had high predictive power but were difficult to interpret and difficult to use for inference.

In the end, the Census Bureau used the ensembled trees to inform a linear regression model that was easier to interpret. In other words, the Census Bureau sacrificed predictive power to gain interpretability.

This section will introduce extensions of linear regression models that aim to improve predictive power without quickly becoming difficult to interpret or “black box.”

Like in earlier sections, we will use the simulated example 1 data to demonstrate some model fits.

Code
library(tidymodels)

set.seed(20201004)

x <- runif(n = 1000, min = 0, max = 10)

data1 <- bind_cols(
  x = x,
  y = 10 * sin(x) + x + 20 + rnorm(n = length(x), mean = 0, sd = 2)
)

set.seed(20201007)

# create a split object
data1_split <- initial_split(data = data1, prop = 0.75)

# create the training and testing data
data1_train <- training(x = data1_split)
data1_test  <- testing(x = data1_split)

22.3 Polynomial Regression

TipPolynomial Regression

Polynomial regression includes higher-degree predictors in the design matrix. For example, a regression model will include \(x\) and \(x^2\) as predictors.

Polynomials extend linear regression to include transformations of predictors such that

\[ y_i = \beta_0 + \beta_1x_i + \beta_2x_i^2 + \cdot\cdot\cdot \beta_dx_i^d + \epsilon_i \]

where \(d\) is referred to as the degree of the polynomial.

Polynomial regression is still linear regression because the model is linear in its coefficients but can fit non-linear patterns in the data

Figure 22.1 shows four different linear regression models fit to the training data. Degree is the magnitude of the highest order term included in the model.

Code
lin_reg_model1 <- linear_reg() |>
  set_engine(engine = "lm") |>
  set_mode(mode = "regression") |>
  fit(formula = y ~ x, data = data1_train)

lin_reg_model2 <- linear_reg() |>
  set_engine(engine = "lm") |>
  set_mode(mode = "regression") |>
  fit(formula = y ~ poly(x, degree = 2, raw = TRUE), data = data1_train)

lin_reg_model3 <- linear_reg() |>
  set_engine(engine = "lm") |>
  set_mode(mode = "regression") |>
  fit(formula = y ~ poly(x, degree = 3, raw = TRUE), data = data1_train)

lin_reg_model4 <- linear_reg() |>
  set_engine(engine = "lm") |>
  set_mode(mode = "regression") |>
  fit(formula = y ~ poly(x, degree = 4, raw = TRUE), data = data1_train)

# create a grid of predictions
new_data <- tibble(x = seq(0, 10, 0.1))

predictions_grid <- tibble(
  x = seq(0, 10, 0.1),
  `Degree = 1` = predict(object = lin_reg_model1, new_data = new_data)$.pred,
  `Degree = 2` = predict(object = lin_reg_model2, new_data = new_data)$.pred,
  `Degree = 3` = predict(object = lin_reg_model3, new_data = new_data)$.pred,
  `Degree = 4` = predict(object = lin_reg_model4, new_data = new_data)$.pred
) |>
  pivot_longer(-x, names_to = "model", values_to = ".pred")

ggplot() +
  geom_point(data = data1_train, aes(x = x, y = y), alpha = 0.25) +
  geom_path(data = predictions_grid, aes(x = x, y = .pred), color = "red") +
  facet_wrap(~model) +
  labs(
    title = "Example 1: Data with Predictions (Linear Regression)",
    subtitle = "Prediction in red"
  ) +
  theme_minimal()
Figure 22.1: Simulated data with polynomial regression model predictions

22.4 Interactions

TipInteractions

Interactions include the product of predictors in the design matrix for linear regression. For example, a regression model will include \(x_1\), \(x_2\), and \(x_1x_2\) as predictors.

Interactions extend polynomials to consider multiplying predictors by other predictors in addition to multiplying them by themselves. Including pairwise or triplewise interactions complicates interpretation but can improve the fit of the model to the data.

TipHierarchal Principle

The hierarchical principle states that main effects should be included in regression models when interactions are included even if the main effects are not statistically significant. This includes polynomial regression.

Outside of applications where a small number of interactions is included because of theory, interactions can explode the number of parameters that require estimation.

“A key problem in the estimation of two-way interaction models is the explosion in the number of unknown parameters.” (Ibrahim et al. 2021)

Simple pairwise interactions for the small planning database example explodes the number of predictors from 23 to 276.

recipe(formula =  ~ ., data = pdb_train) |>
  update_role(State_name, County_name, Low_Response_Score, new_role = "id") |>
  step_interact(terms = ~ all_predictors():all_predictors()) |>
  prep(new_data = pdb_train) |>
  bake(NULL)
# A tibble: 55,720 × 279
   State_name     County_name           Low_Response_Score non_return_rate
   <chr>          <chr>                              <dbl>           <dbl>
 1 Arizona        Yavapai County                      15.2            16.5
 2 Georgia        Screven County                      23.5            23  
 3 Iowa           Polk County                         21.4            24.4
 4 Wyoming        Sweetwater County                   26              14.8
 5 Missouri       Jasper County                       22.8            17.6
 6 Virginia       Hanover County                      12.1            10.3
 7 Oregon         Clackamas County                    19.5            21.8
 8 North Carolina Chatham County                      20.1            14.7
 9 North Carolina Mecklenburg County                  17.6            15.3
10 California     San Bernardino County               28              24.9
# ℹ 55,710 more rows
# ℹ 275 more variables: Renter_Occp_HU_ACS_13_17 <dbl>,
#   Pop_18_24_ACS_13_17 <dbl>, Female_No_HB_ACS_13_17 <dbl>,
#   NH_White_alone_ACS_13_17 <dbl>, Pop_65plus_ACS_13_17 <dbl>,
#   Rel_Child_Under_6_ACS_13_17 <dbl>, Males_ACS_13_17 <dbl>,
#   MrdCple_Fmly_HHD_ACS_13_17 <dbl>, Pop_25_44_ACS_13_17 <dbl>,
#   Tot_Vacant_Units_ACS_13_17 <dbl>, College_ACS_13_17 <dbl>, …

Regularization with methods like LASSO regression and cross validation can help in the presence of many predictors.

22.5 Step Functions

Step functions are not often used directly in predictive modeling but they are an important building block for other methods.

Step functions break the range of a predictor into bins and then fit different constant (intercept only models) in each bin. Figure 22.2 shows piecewise constant regression fit to the simulated example 1 data. The example uses 9 separate regressions (8 knots).

Code
# create a simple linear regression model
lin_reg_model <- linear_reg() |>
  set_engine(engine = "lm") |>
  set_mode(mode = "regression") 

# specify a piecewise constant model
step_function_rec <- recipe(formula = y ~ x, data = data1_train) |>
  step_spline_b(all_predictors(), degree = 0, deg_free = 8)

# fit the model
step_function_mod <- workflow() |>
  add_model(spec = lin_reg_model) |>
  add_recipe(recipe = step_function_rec) |>
  fit(data = data1_train) 

# create a grid of predictions
new_data <- tibble(x = seq(0, 10, 0.1))

predictions_grid <- tibble(
  x = seq(0, 10, 0.1),
  `Piecewise Constant Regression` = predict(object = step_function_mod, new_data = new_data)$.pred
) |>
  pivot_longer(-x, names_to = "model", values_to = ".pred")

ggplot() +
  geom_point(data = data1_train, aes(x = x, y = y), alpha = 0.25) +
  geom_line(
    data = filter(predictions_grid, model != "Continuous, Cubic"),
    mapping = aes(x = x, y = .pred), 
    color = "red"
  ) +
  facet_wrap(~model) +
  labs(
    title = "Example 1: Data with Predictions (Piecewise Constant Regression)",
    subtitle = "Prediction in red"
  ) +
  theme_minimal()
Figure 22.2: Simulated data with a piecewise-constant regression model predictions

We can now see the connection between piecewise constant regression and regression trees. Regression trees picks the knots, where we must specify the knots for piecewise constant regression.

22.6 Basis Functions

TipBasis Functions

Basis functions are fixed and known transformations applied to a variable \(x\) before modeling. We can denote the basis functions as \(b_1(), b_2(), ..., b_K()\). Basis functions result in “derived features.”

Consider a few simple basis function:

  • Linear: \(b_j(x_i) = x_i\)
  • Polynomial: \(b_j(x_i) = x_i^j\)
  • Piecewise constant: \(b_j(x_i) = I(c_j \leq x_i < c_{j + 1})\)

Basis functions are advantageous because if we fit a model like

\[ y_i = \beta_0 + \beta_1b_1(x_i) + \beta_1b_1(x_i) + \cdot\cdot\cdot + \beta_kb_K(x_i) + \epsilon_i \]

then we can use least squares estimation and we have access to all of the tools available to linear regression like standard errors and statistical tests. Next, we will explore a few more basis functions.

22.7 Regression Splines

TipRegression Splines

Regression splines are piecewise polynomial basis functions with a constraint that the fitted curve must be continuous and have continuous first and second derivatives.

Regression splines have a degree, which determines the order of the polynomials (e.g. 1 is linear and and 3 is cubic). Regression splines also have knots, which are the break points between regions.

Regression splines combine piecewise step functions and polynomial regression with certain constraints using basis functions. Regression splines are called nonparametric even though they result in estimated regression coefficients.

Knots are where the regions meet with regression splines. Regression splines are continuous at knots and have continuous first and second derivatives at knots. This ensures the fitted line.

We won’t show the basis functions for simplicity. Just recall, basis functions transform the variable \(x\) before fitting the regression model.

Figure 22.3 shows several piecewise polynomial regression models fit to the simulated example 1 data. The first panel shows a piecewise linear model. Note the discontinuity. The second panel shows a piecewise square model. Again, note the discontinuity. The bottom two panels shows linear and square models with an added continuity conditions.

Code
# create a simple linear regression model
lin_reg_model <- linear_reg() |>
  set_engine(engine = "lm") |>
  set_mode(mode = "regression") 

# create a workflow with a model but no recipe
base_workflow <- workflow() |>
  add_model(spec = lin_reg_model)

# create a series of custom recipes
piecewise_linear_rec <- recipe(formula = y ~ x, data = data1_train) |>
  step_mutate(ntile = factor(ntile(x, n = 4))) |>
  step_dummy(ntile) |>
  step_interact(terms = ~ x:starts_with("ntile"))

piecewise_square_rec <- recipe(formula = y ~ x, data = data1_train) |>
  step_mutate(ntile = factor(ntile(x, n = 4))) |>
  step_dummy(ntile) |>
  step_poly(x, options = list(raw = FALSE)) |>
  step_interact(terms = ~ starts_with("x"):starts_with("n"))

continuous_linear_rec <- recipe(formula = y ~ x, data = data1_train) |>
  step_spline_b(all_predictors(), degree = 1, deg_free = 5)

continuous_square_rec <- recipe(formula = y ~ x, data = data1_train) |>
  step_spline_b(all_predictors(), degree = 2, deg_free = 6)

continuous_cubic_rec <- recipe(formula = y ~ x, data = data1_train) |>
  step_spline_b(all_predictors(), degree = 3, deg_free = 7)

# add the custom recipes and fit the models
piecewise_linear_mod <- base_workflow |>
  add_recipe(recipe = piecewise_linear_rec) |>
  fit(data = data1_train) 

piecewise_square_mod <- base_workflow |>
  add_recipe(recipe = piecewise_square_rec) |>
  fit(data = data1_train) 

continuous_linear_mod <- base_workflow |>
  add_recipe(recipe = continuous_linear_rec) |>
  fit(data = data1_train)

continuous_square_mod <- base_workflow |>
  add_recipe(recipe = continuous_square_rec) |>
  fit(data = data1_train)

continuous_cubic_mod <- base_workflow |>
  add_recipe(recipe = continuous_cubic_rec) |>
  fit(data = data1_train)

# create a grid of predictions
new_data <- tibble(x = seq(0, 10, 0.1))

predictions_grid <- tibble(
  x = seq(0, 10, 0.1),
  `Piecewise, Linear` = predict(object = piecewise_linear_mod, new_data = new_data)$.pred,
  `Piecewise, Square` = predict(object = piecewise_square_mod, new_data = new_data)$.pred,
  `Continuous, Linear` = predict(object = continuous_linear_mod, new_data = new_data)$.pred,
  `Continuous, Square` = predict(object = continuous_square_mod, new_data = new_data)$.pred,
  `Continuous, Cubic` = predict(object = continuous_cubic_mod, new_data = new_data)$.pred
) |>
  pivot_longer(-x, names_to = "model", values_to = ".pred") |>
  mutate(model = factor(model, levels = c("Piecewise, Linear", "Piecewise, Square", "Continuous, Linear", "Continuous, Square", "Continuous, Cubic")))

ggplot() +
  geom_point(data = data1_train, aes(x = x, y = y), alpha = 0.25) +
  geom_line(
    data = filter(predictions_grid, model != "Continuous, Cubic"),
    mapping = aes(x = x, y = .pred), 
    color = "red"
  ) +
  facet_wrap(~model) +
  labs(
    title = "Example 1: Data with Predictions (Piecewise Polynomials)",
    subtitle = "Prediction in red"
  ) +
  theme_minimal()
Figure 22.3: Simulated data with predictions using different regression splines

Finally, Figure 22.4 shows a cubic spline model fit to the data. The model uses cubic polynomials and four knots. Recall that it took a 4th-degree polynomial to fit the pattern in the data.

Code
ggplot() +
  geom_point(data = data1_train, aes(x = x, y = y), alpha = 0.25) +
  geom_line(
    data = filter(predictions_grid, model == "Continuous, Cubic"),
    mapping = aes(x = x, y = .pred), 
    color = "red"
  ) +
  facet_wrap(~model) +
  labs(
    title = "Example 1: Data with Predictions (Cubic Regression splines)",
    subtitle = "Prediction in red"
  ) +
  theme_minimal()
Figure 22.4: Regression splines with cubic polynomials and four knots. Recall that it took a 4th-degree polynomial to fit the pattern in the data.

Higher degrees and more knots will fit the training data better but run the risk of resulting in an overfit model.

Most data scientists don’t use degree > 3 because it is said that the human eye can’t detect discontinuity for any degree \(\geq\) 3.

This leaves picking the number and location of the knots. The location of knots is typically evenly spaced out. The number of knots is limited by the number of observations in the data set. One popular approach is to propose many knots and then use regularization methods like LASSO regression with cross validation to drop unnecessary knots.

TipBasis splines

Basis splines are regression splines where every region has the same polynomial.

TipNatural splines

Natural splines are basis splines with the added constraint that the minimum region and maximum region (the area outside the most extreme knots) are constrained to be linear.

High-degree polynomials have high variance in sparse regions of the data and at the extremes of data. This means minor changes in the data can dramatically change the fitted line.1

22.8 Generalized Additive Models

TipAdditivity assumption

The additivity assumption of linear regression states that the contribution of each predictor to the outcome variable is independent of the other predictors. Thus, the combined effect of the predictors is just the sum of their individual effects.

Interactions violate the additivity assumption.

We only considered basis functions in simple linear regression applications (starting with exactly one predictor) until now. Generalized additive models (GAMs) allow us to use non-linear functions of multiple predictors while maintaining additivity.

TipGeneralized Additive Models

Generalized additive models are of the form

\[ y_i = \beta_0 + \sum_{j = 1}^pf_j(x_{ij}) + \epsilon_i \]

In other words, the fitted line/conditional mean is the sum of the contributions of each of the predictors. The contributions can be modeled as non-linear functions (basis functions) of each predictor (polynomials, splines, etc.)

22.9 Multivariate Adaptive Regression Splines (MARS)

Multivariate Adaptive Regression Splines (MARS) are an alternative to GAMs for capturing non-linear partners in data using linear regression and splines.

TipHinge Function

\[ h(x) = \begin{cases} x & x > 0 \\ 0 & x \leq 0 \end{cases} \]

or

\[ h(x) = \begin{cases} max(0, x - c) & x > c\\ max(0, c - x) & x \leq c \end{cases} \]

TipMultivariate Adaptive Regression Splines (MARS)

MARS is an additive model of the form

\[ y_i = \sum_{j= 1}^pc_jB_j(x_i) + \epsilon_i \]

where \(c_j\) is a linear regression coefficient and \(B_j(x_i)\) is

  1. 1
  2. A hinge function
  3. A product of hinge functions

MARS proceeds in three steps

  1. Find the knot that creates a piecewise regression model of the following form with the lowest SSE.

\[ y = \beta_0 + \beta_1h(x) \]

  1. Repeat this process recursively until some stopping parameter is achieved. For example, the second step will have the form.

\[ y = \beta_0 + \beta_1h_1(x) + \beta_2h_2(x) \]

  1. Prune with cross-validation.

Figure 22.5 demonstrates MARS with 1, 2, 3, and 4 knots using the simulated example 1 data.

Code
# fit MARS with an increasing number of knots
mars1 <- mars(num_terms = 2) |>
  set_engine(engine = "earth") |>
  set_mode(mode = "regression") |>
  fit(formula = y ~ x, data = data1_train)

mars2 <- mars(num_terms = 3) |>
  set_engine(engine = "earth") |>
  set_mode(mode = "regression") |>
  fit(formula = y ~ x, data = data1_train)

mars3 <- mars(num_terms = 4) |>
  set_engine(engine = "earth") |>
  set_mode(mode = "regression") |>
  fit(formula = y ~ x, data = data1_train)

mars4 <- mars(num_terms = 5) |>
  set_engine(engine = "earth") |>
  set_mode(mode = "regression") |>
  fit(formula = y ~ x, data = data1_train)

# create a grid of predictions
new_data <- tibble(x = seq(0, 10, 0.1))

predictions_grid <- tibble(
  x = seq(0, 10, 0.1),
  `Knots = 1` = predict(object = mars1, new_data = new_data)$.pred,
  `Knots = 2` = predict(object = mars2, new_data = new_data)$.pred,
  `Knots = 3` = predict(object = mars3, new_data = new_data)$.pred,
  `Knots = 4` = predict(object = mars4, new_data = new_data)$.pred
) |>
  pivot_longer(-x, names_to = "model", values_to = ".pred")

ggplot() +
  geom_point(data = data1_train, aes(x = x, y = y), alpha = 0.25) +
  geom_path(data = predictions_grid, aes(x = x, y = .pred), color = "red") +
  facet_wrap(~model) +
  labs(
    title = "Example 1: Data with Predictions (MARS)",
    subtitle = "Prediction in red"
  ) +
  theme_minimal()
Figure 22.5

We can print the table of hinges and coefficients for the four models like this:

mars1$fit$coefficients
                    y
(Intercept)  22.82586
h(x-4.50551)  2.25023
mars2$fit$coefficients
                     y
(Intercept)  28.152720
h(x-4.50551)  9.135052
h(x-2.3873)  -5.294739
mars3$fit$coefficients
                      y
(Intercept)   28.813593
h(x-4.50551)  17.161781
h(x-2.3873)   -8.075389
h(x-7.37819) -12.883468
mars4$fit$coefficients
                      y
(Intercept)   37.598528
h(x-4.50551)  19.679189
h(4.50551-x)  -2.964782
h(x-8.12048) -16.167899
h(x-2.3873)  -11.946754

22.10 Case Study

We now return to the PDB example with 25 variables and five fold cross validation.

22.10.1 Linear Regression

We first consider multiple linear regression. We need a recipe, a model, and a workflow to fit the model five times and evaluate its predictive accuracy.

The data set contains a few variables that shouldn’t be included in the model. We use update_role() to turn them into “ids” to remove them from consideration.

lm1_rec <- recipe(non_return_rate ~ ., pdb_train) |>
  update_role(State_name, County_name, Low_Response_Score, new_role = "id")

lm1_mod <- linear_reg() |>
  set_mode(mode = "regression") |>
  set_engine(engine = "lm")

lm1_wf <- workflow() |>
  add_recipe(lm1_rec) |>
  add_model(lm1_mod)

lm1_resamples <- lm1_wf |>
  fit_resamples(resamples = pdb_folds)

lm1_resamples |>
  collect_metrics()
# A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard   5.33      5 0.0580  Preprocessor1_Model1
2 rsq     standard   0.485     5 0.00906 Preprocessor1_Model1

We can use last_fit() to fit the model to all of the training data and extract_fit_parsnip() to examine the estimated coefficients.

lm1_wf |>
  last_fit(pdb_split) |>
  extract_fit_parsnip()
parsnip model object


Call:
stats::lm(formula = ..y ~ ., data = data)

Coefficients:
                (Intercept)     Renter_Occp_HU_ACS_13_17  
               25.852142128                  0.000527581  
        Pop_18_24_ACS_13_17       Female_No_HB_ACS_13_17  
                0.001415666                  0.000738697  
   NH_White_alone_ACS_13_17         Pop_65plus_ACS_13_17  
               -0.000940139                 -0.001967564  
Rel_Child_Under_6_ACS_13_17              Males_ACS_13_17  
                0.001897582                  0.000884922  
 MrdCple_Fmly_HHD_ACS_13_17          Pop_25_44_ACS_13_17  
               -0.002668457                  0.001216952  
 Tot_Vacant_Units_ACS_13_17            College_ACS_13_17  
                0.002089677                 -0.000344927  
      Med_HHD_Inc_ACS_13_17          Pop_45_64_ACS_13_17  
               -0.000097679                  0.001730143  
     HHD_Moved_in_ACS_13_17           Hispanic_ACS_13_17  
                0.003871744                 -0.000371730  
      Single_Unit_ACS_13_17    Diff_HU_1yr_Ago_ACS_13_17  
               -0.002632994                 -0.000848577  
         Pop_5_17_ACS_13_17       NH_Blk_alone_ACS_13_17  
                0.001696031                  0.000546687  
    Sngl_Prns_HHD_ACS_13_17        Not_HS_Grad_ACS_13_17  
               -0.004322504                 -0.000210310  
  Med_House_Value_ACS_13_17  
                0.000009283  

The following code creates a grid to visualize the marginal effect of Renter_Occp_HU_ACS_13_17 while holding other predictors at 0. Because these models are additive, the value of other predictors only shifts the line vertically and doesn’t change the shape of the relationship.

grid <- expand_grid(
  State_name = "dummy",
  County_name = "dummy",
  Low_Response_Score = 0,
  Renter_Occp_HU_ACS_13_17 = seq(0, 7400, 1),
  Pop_18_24_ACS_13_17 = 0,
  Female_No_HB_ACS_13_17 = 0,
  NH_White_alone_ACS_13_17 = 0,
  Pop_65plus_ACS_13_17 = 0,
  Rel_Child_Under_6_ACS_13_17 = 0,
  Males_ACS_13_17 = 0,
  MrdCple_Fmly_HHD_ACS_13_17 = 0,
  Pop_25_44_ACS_13_17 = 0,
  Tot_Vacant_Units_ACS_13_17 = 0,
  College_ACS_13_17 = 0,
  Med_HHD_Inc_ACS_13_17 = 0,
  Pop_45_64_ACS_13_17 = 0,
  HHD_Moved_in_ACS_13_17 = 0,
  Hispanic_ACS_13_17 = 0,
  Single_Unit_ACS_13_17 = 0,
  Diff_HU_1yr_Ago_ACS_13_17 = 0,
  Pop_5_17_ACS_13_17 = 0,
  NH_Blk_alone_ACS_13_17 = 0,
  Sngl_Prns_HHD_ACS_13_17 = 0,
  Not_HS_Grad_ACS_13_17 = 0,
  Med_House_Value_ACS_13_17 = 0
)

# linear regression
bind_cols(
  grid,
  lm1_wf |>
  last_fit(pdb_split) |>
  extract_workflow() |>
  predict(new_data = grid)
) |>
  ggplot(aes(Renter_Occp_HU_ACS_13_17, .pred)) +
  geom_line() +
  labs(title = "Fitted Function for Linear Regression")

22.10.2 Linear Regression

Erdman and Bates (2017) included square root and logit transformations in their final specification. We next update the recipe to include their functional form.

# create a complex recipe with square root, log, and logit transformations
lm2_rec <- recipe(non_return_rate ~ ., pdb_train) |>
  update_role(State_name, County_name, Low_Response_Score, new_role = "id") |>
  step_sqrt(
    Pop_18_24_ACS_13_17,
    Female_No_HB_ACS_13_17,
    Pop_65plus_ACS_13_17,
    Rel_Child_Under_6_ACS_13_17,
    Pop_25_44_ACS_13_17,
    Pop_45_64_ACS_13_17,
    HHD_Moved_in_ACS_13_17,
    Diff_HU_1yr_Ago_ACS_13_17,
    Pop_5_17_ACS_13_17,
    Sngl_Prns_HHD_ACS_13_17,
    Not_HS_Grad_ACS_13_17
  ) |>
  step_log(
    Med_HHD_Inc_ACS_13_17,
    Med_House_Value_ACS_13_17,
    offset = 0.001
  ) |>
  step_log(
    Tot_Vacant_Units_ACS_13_17,
    offset = 0.001
  ) |>
  add_role(
    Renter_Occp_HU_ACS_13_17,
    NH_White_alone_ACS_13_17,
    College_ACS_13_17,
    Hispanic_ACS_13_17,
    Single_Unit_ACS_13_17,
    NH_Blk_alone_ACS_13_17,
    new_role = "logit"
  ) |>
step_range(
    has_role("logit")
  ) |>
  step_logit(
    has_role("logit"),
    offset = 0.01
  )

# create a linear regression model
lm2_mod <- linear_reg() |>
  set_mode(mode = "regression") |>
  set_engine(engine = "lm")

# combine the recipe and model into a workflow
lm2_wf <- workflow() |>
  add_recipe(lm2_rec) |>
  add_model(lm2_mod)

# fit the model
lm2_resamples <- lm2_wf |>
  fit_resamples(resamples = pdb_folds)

Changing the functional form improves the model predictions.

lm2_resamples |>
  collect_metrics()
# A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard   4.98      5 0.0489  Preprocessor1_Model1
2 rsq     standard   0.548     5 0.00655 Preprocessor1_Model1

We can use last_fit() to fit the model to all of the training data and extract_fit_parsnip() to examine the estimated coefficients.

lm2_wf |>
  last_fit(pdb_split) |>
  extract_fit_parsnip()
parsnip model object


Call:
stats::lm(formula = ..y ~ ., data = data)

Coefficients:
                (Intercept)     Renter_Occp_HU_ACS_13_17  
                 32.0377933                    1.5626842  
        Pop_18_24_ACS_13_17       Female_No_HB_ACS_13_17  
                  0.0585177                    0.1315009  
   NH_White_alone_ACS_13_17         Pop_65plus_ACS_13_17  
                 -0.6853445                   -0.2488444  
Rel_Child_Under_6_ACS_13_17              Males_ACS_13_17  
                 -0.0171017                    0.0003647  
 MrdCple_Fmly_HHD_ACS_13_17          Pop_25_44_ACS_13_17  
                  0.0000302                    0.0725572  
 Tot_Vacant_Units_ACS_13_17            College_ACS_13_17  
                  0.3607191                   -1.0962875  
      Med_HHD_Inc_ACS_13_17          Pop_45_64_ACS_13_17  
                 -3.0933436                    0.0067678  
     HHD_Moved_in_ACS_13_17           Hispanic_ACS_13_17  
                  0.0115401                    0.5113979  
      Single_Unit_ACS_13_17    Diff_HU_1yr_Ago_ACS_13_17  
                 -1.4987864                   -0.0247277  
         Pop_5_17_ACS_13_17       NH_Blk_alone_ACS_13_17  
                 -0.0079106                    0.4762311  
    Sngl_Prns_HHD_ACS_13_17        Not_HS_Grad_ACS_13_17  
                 -0.0727898                    0.0097039  
  Med_House_Value_ACS_13_17  
                  1.7301741  
bind_cols(
  grid,
  lm2_wf |>
    last_fit(pdb_split) |>
    extract_workflow() |>
    predict(new_data = grid)
) |>
  ggplot(aes(Renter_Occp_HU_ACS_13_17, .pred)) +
  geom_line() +
  labs(title = "Fitted Function for Linear Regression")

22.10.3 Elastic Net

Let’s try the (Erdman and Bates 2017) functional form with elastic net regularization.

enet_rec <- recipe(non_return_rate ~ ., pdb_train) |>
  update_role(State_name, County_name, Low_Response_Score, new_role = "id") |>
  step_sqrt(
    Pop_18_24_ACS_13_17,
    Female_No_HB_ACS_13_17,
    Pop_65plus_ACS_13_17,
    Rel_Child_Under_6_ACS_13_17,
    Pop_25_44_ACS_13_17,
    Pop_45_64_ACS_13_17,
    HHD_Moved_in_ACS_13_17,
    Diff_HU_1yr_Ago_ACS_13_17,
    Pop_5_17_ACS_13_17,
    Sngl_Prns_HHD_ACS_13_17,
    Not_HS_Grad_ACS_13_17
  ) |>
  step_log(
    Med_HHD_Inc_ACS_13_17,
    Med_House_Value_ACS_13_17,
    offset = 0.001
  ) |>
  step_log(
    Tot_Vacant_Units_ACS_13_17,
    offset = 0.001
  ) |>
  add_role(
    Renter_Occp_HU_ACS_13_17,
    NH_White_alone_ACS_13_17,
    College_ACS_13_17,
    Hispanic_ACS_13_17,
    Single_Unit_ACS_13_17,
    NH_Blk_alone_ACS_13_17,
    new_role = "logit"
  ) |>
  step_range(
    has_role("logit")
  ) |>
  step_logit(
    has_role("logit"),
    offset = 0.01
  ) |>
  step_normalize(all_predictors())

enet_mod <- linear_reg(penalty = tune(), mixture = tune()) |>
  set_mode(mode = "regression") |>
  set_engine(engine = "glmnet")  

enet_wf <- workflow() |>
  add_recipe(recipe = enet_rec) |>
  add_model(spec = enet_mod)
  
enet_grid <- grid_regular(
  penalty(), 
  mixture(),
  levels = 20
)

enet_tuning <- tune_grid(
  enet_wf,
  resamples = pdb_folds,
  grid = enet_grid
)

show_best(enet_tuning)
Warning in show_best(enet_tuning): No value of `metric` was given; "rmse" will
be used.
# A tibble: 5 × 8
   penalty mixture .metric .estimator  mean     n std_err .config               
     <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                 
1 7.85e- 3   0.474 rmse    standard    4.98     5  0.0489 Preprocessor1_Model196
2 1   e-10   1     rmse    standard    4.98     5  0.0489 Preprocessor1_Model381
3 3.36e-10   1     rmse    standard    4.98     5  0.0489 Preprocessor1_Model382
4 1.13e- 9   1     rmse    standard    4.98     5  0.0489 Preprocessor1_Model383
5 3.79e- 9   1     rmse    standard    4.98     5  0.0489 Preprocessor1_Model384
autoplot(enet_tuning)

enet_tuned_wf <-
  enet_wf |>
  tune::finalize_workflow(tune::select_best(x = enet_tuning, metric = "rmse"))

enet_resamples <- enet_tuned_wf |>
  fit_resamples(resamples = pdb_folds)

Regularization doesn’t help much in this case. This makes sense because the regression specification is parsimonious and deliberate.

enet_resamples |>
  collect_metrics()
# A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard   4.98      5 0.0489  Preprocessor1_Model1
2 rsq     standard   0.548     5 0.00656 Preprocessor1_Model1
bind_cols(
  grid,
  enet_tuned_wf |>
    last_fit(pdb_split) |>
    extract_workflow() |>
    predict(new_data = grid)
) |>
  ggplot(aes(Renter_Occp_HU_ACS_13_17, .pred)) +
  geom_line() +
  labs(title = "Fitted Function for Linear Regression")

22.10.4 Polynomial Regression

Next we add a 2nd-degree polynomial for all predictors.

polynomial_rec <- recipe(non_return_rate ~ ., pdb_train) |>
  update_role(State_name, County_name, Low_Response_Score, new_role = "id") |>
  step_poly(all_predictors())

lm_mod <- linear_reg() |>
  set_mode(mode = "regression") |>
  set_engine(engine = "lm")

polynomial_wf <- workflow() |>
  add_recipe(polynomial_rec) |>
  add_model(lm_mod)

polynomial_resamples <- polynomial_wf |>
  fit_resamples(resamples = pdb_folds)

The results are terrible. The polynomials are poorly behaved and the model suffers from multicollinearity. High-degree polynomials without regularization are numerically unstable—small changes in the data create large changes in coefficients. In particular, note the high std_err of the mean RMSE across folds. This instability is why regularization (shown next) is critical when using polynomials.

polynomial_resamples |>
  collect_metrics()
# A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard   6.22      5  1.15   Preprocessor1_Model1
2 rsq     standard   0.450     5  0.0823 Preprocessor1_Model1
bind_cols(
  grid,
  polynomial_wf |>
    last_fit(pdb_split) |>
    extract_workflow() |>
    predict(new_data = grid)
) |>
  ggplot(aes(Renter_Occp_HU_ACS_13_17, .pred)) +
  geom_line() +
  labs(title = "Fitted Function for Polynomial Regression")

22.10.5 Polynomial Regression with Elastic Net

Let’s try adding 2nd degree polynomials and interactions to create a huge design matrix and then use elastic net regression for regularization.

poly_enet_rec <- recipe(non_return_rate ~ ., pdb_train) |>
  update_role(State_name, County_name, Low_Response_Score, new_role = "id") |>
  step_sqrt(
    Pop_18_24_ACS_13_17,
    Female_No_HB_ACS_13_17,
    Pop_65plus_ACS_13_17,
    Rel_Child_Under_6_ACS_13_17,
    Pop_25_44_ACS_13_17,
    Pop_45_64_ACS_13_17,
    HHD_Moved_in_ACS_13_17,
    Diff_HU_1yr_Ago_ACS_13_17,
    Pop_5_17_ACS_13_17,
    Sngl_Prns_HHD_ACS_13_17,
    Not_HS_Grad_ACS_13_17
  ) |>
  step_log(
    Med_HHD_Inc_ACS_13_17,
    Med_House_Value_ACS_13_17,
    offset = 0.001
  ) |>
  step_log(
    Tot_Vacant_Units_ACS_13_17,
    offset = 0.001
  ) |>
  add_role(
    Renter_Occp_HU_ACS_13_17,
    NH_White_alone_ACS_13_17,
    College_ACS_13_17,
    Hispanic_ACS_13_17,
    Single_Unit_ACS_13_17,
    NH_Blk_alone_ACS_13_17,
    new_role = "logit"
  ) |>
  step_range(
    has_role("logit")
  ) |>
  step_logit(
    has_role("logit"),
    offset = 0.01
  ) |>
  step_normalize(all_predictors()) |>
  step_poly(all_predictors(), degree = 3) |>
  step_interact(terms = ~starts_with("Renter"):all_predictors())

poly_enet_mod <- linear_reg(penalty = tune(), mixture = tune()) |>
  set_mode(mode = "regression") |>
  set_engine(engine = "glmnet")  

poly_enet_wf <- workflow() |>
  add_recipe(recipe = poly_enet_rec) |>
  add_model(spec = poly_enet_mod)
  
poly_enet_grid <- grid_regular(
  penalty(), 
  mixture(),
  levels = 20
)

poly_enet_tuning <- tune_grid(
  poly_enet_wf,
  resamples = pdb_folds,
  grid = poly_enet_grid
)

show_best(poly_enet_tuning)
Warning in show_best(poly_enet_tuning): No value of `metric` was given; "rmse"
will be used.
# A tibble: 5 × 8
  penalty mixture .metric .estimator  mean     n std_err .config               
    <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                 
1  0.0264   0.632 rmse    standard    4.73     5  0.0491 Preprocessor1_Model257
2  0.0264   0.579 rmse    standard    4.73     5  0.0495 Preprocessor1_Model237
3  0.0886   0.211 rmse    standard    4.73     5  0.0500 Preprocessor1_Model098
4  0.0264   0.684 rmse    standard    4.73     5  0.0492 Preprocessor1_Model277
5  0.0886   0.158 rmse    standard    4.73     5  0.0522 Preprocessor1_Model078
autoplot(poly_enet_tuning)

poly_enet_tuned_wf <-
  poly_enet_wf |>
  tune::finalize_workflow(tune::select_best(x = poly_enet_tuning, metric = "rmse"))

poly_enet_resamples <- poly_enet_tuned_wf |>
  fit_resamples(resamples = pdb_folds)

Here, elastic net regularization dramatically improves the model performance over the simple model and the saturated model with regularization.

poly_enet_resamples |>
  collect_metrics()
# A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard   4.73      5 0.0491  Preprocessor1_Model1
2 rsq     standard   0.593     5 0.00632 Preprocessor1_Model1
bind_cols(
  grid,
  poly_enet_tuned_wf |>
    last_fit(pdb_split) |>
    extract_workflow() |>
    predict(new_data = grid)
) |>
  ggplot(aes(Renter_Occp_HU_ACS_13_17, .pred)) +
  geom_line() +
  labs(title = "Fitted Function for Polynomial Regression")

22.10.6 GAM/Regression Splines

Let’s explore regression splines and a GAM. The syntax for regression splines in GAMs is slightly different because the tuning isn’t a hyperparameter.

Also, the model takes a long time to fit. To keep things simple, we only consider a few splines. Choosing the correct predictors for splines requires EDA and theory.

gam_spec <- gen_additive_mod(adjust_deg_free = tune()) |>
  set_mode(mode = "regression") |>
  set_engine(engine = "mgcv")

gam_wf <- workflow() |>
  add_recipe(lm1_rec) |>
  add_model(
    gam_spec,
    formula = non_return_rate ~ 
      s(Renter_Occp_HU_ACS_13_17) +
      s(Pop_18_24_ACS_13_17) +
      s(Female_No_HB_ACS_13_17) +
      s(NH_White_alone_ACS_13_17) +
      s(Pop_65plus_ACS_13_17) +
      s(Rel_Child_Under_6_ACS_13_17) +
      s(Males_ACS_13_17) +
      s(MrdCple_Fmly_HHD_ACS_13_17) +
      s(Pop_25_44_ACS_13_17) +
      s(Tot_Vacant_Units_ACS_13_17) +
      s(College_ACS_13_17) +
      s(Med_HHD_Inc_ACS_13_17) +
      s(Pop_45_64_ACS_13_17) +
      s(HHD_Moved_in_ACS_13_17) +
      Hispanic_ACS_13_17 +
      Single_Unit_ACS_13_17 +
      Diff_HU_1yr_Ago_ACS_13_17 +
      Pop_5_17_ACS_13_17 +
      NH_Blk_alone_ACS_13_17 +
      Sngl_Prns_HHD_ACS_13_17 +
      Not_HS_Grad_ACS_13_17 +
      Med_House_Value_ACS_13_17
  )

gam_grid <-
  grid_regular(adjust_deg_free(range = c(0.25, 5)), levels = 5)

gam_tune <-
  tune_grid(
    gam_wf, 
    resamples = pdb_folds,
    grid = gam_grid
  )

autoplot(gam_tune)

show_best(gam_tune)
Warning in show_best(gam_tune): No value of `metric` was given; "rmse" will be
used.
# A tibble: 5 × 7
  adjust_deg_free .metric .estimator  mean     n std_err .config             
            <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1            2.62 rmse    standard    4.87     5  0.0453 Preprocessor1_Model3
2            3.81 rmse    standard    4.87     5  0.0451 Preprocessor1_Model4
3            1.44 rmse    standard    4.87     5  0.0460 Preprocessor1_Model2
4            5    rmse    standard    4.87     5  0.0451 Preprocessor1_Model5
5            0.25 rmse    standard    4.88     5  0.0484 Preprocessor1_Model1
gam_final_wf <-
  finalize_workflow(gam_wf,
                    select_best(gam_tune, metric = "rmse"))

gam_resamples <- gam_final_wf |>
  fit_resamples(resamples = pdb_folds)

collect_metrics(gam_resamples)
# A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard   4.87      5 0.0453  Preprocessor1_Model1
2 rsq     standard   0.569     5 0.00569 Preprocessor1_Model1
bind_cols(
  grid,
  gam_final_wf |>
    last_fit(pdb_split) |>
    extract_workflow() |>
    predict(new_data = grid)
) |>
  ggplot(aes(Renter_Occp_HU_ACS_13_17, .pred)) +
  geom_line() +
  labs(title = "Fitted Function for GAM/Regression Splines")

22.10.7 MARS

Finally, let’s explore MARS with hyperparameter tuning. Note the simplicity of the recipe.

Code
mars_rec <- recipe(non_return_rate ~ ., pdb_train) |>
  update_role(State_name, County_name, Low_Response_Score, new_role = "id")

# we will tune the number of knots and the degree of the polynomials in each
# piecewise region
mars_mod <- mars(
  num_terms = tune(), 
  prod_degree = 1
) |>
  set_mode(mode = "regression") |>
  set_engine(engine = "earth")

mars_wf <- workflow() |>
  add_recipe(recipe = mars_rec) |>
  add_model(spec = mars_mod)

mars_grid <- grid_regular(
  num_terms(range = c(20, 100)),
  levels = 10
)

mars_resamples <- tune_grid(
  mars_wf,
  resamples = pdb_folds,
  grid = mars_grid
)

collect_metrics(mars_resamples)

show_best(mars_resamples)

autoplot(mars_resamples)
mars_rec <- recipe(non_return_rate ~ ., pdb_train) |>
  update_role(State_name, County_name, Low_Response_Score, new_role = "id")

# the best parameters were chosen using hyperparameter tuning
# limiting prod_degree to 1 maintains additivity and prevents overfitting
mars_mod <- mars(
  num_terms = 46,
  prod_degree = 1
) |>
  set_mode(mode = "regression") |>
  set_engine(engine = "earth")

mars_wf <- workflow() |>
  add_recipe(recipe = mars_rec) |>
  add_model(spec = mars_mod)

mars_resamples <- mars_wf |>
  fit_resamples(resamples = pdb_folds)

collect_metrics(mars_resamples)
# A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard   4.95      5 0.0484  Preprocessor1_Model1
2 rsq     standard   0.555     5 0.00642 Preprocessor1_Model1
bind_cols(
  grid,
  mars_wf |>
    last_fit(pdb_split) |>
    extract_workflow() |>
    predict(new_data = grid)
) |>
  ggplot(aes(Renter_Occp_HU_ACS_13_17, .pred)) +
  geom_line() +
  labs(title = "Fitted Function for MARS")

22.10.8 Final Comparison

Finally, let’s compare the RMSE and \(r\)-squared for linear regression and MARS (with hyperparameter tuning).

bind_rows(
  `Linear regression` = collect_metrics(lm1_resamples) |>
    filter(.metric == "rmse"),
  `Linear regression (Prepped)` = collect_metrics(lm2_resamples) |>
    filter(.metric == "rmse"),  
  `Elastic Net` = collect_metrics(enet_resamples) |>
    filter(.metric == "rmse"),  
  `Polynomial ENet regression` = collect_metrics(poly_enet_resamples) |>
    filter(.metric == "rmse"),
  `GAM` = collect_metrics(gam_resamples) |>
    filter(.metric == "rmse"),  
  `MARS` = collect_metrics(mars_resamples) |>
    filter(.metric == "rmse"),
  .id = "model"
)
# A tibble: 6 × 7
  model                       .metric .estimator  mean     n std_err .config    
  <chr>                       <chr>   <chr>      <dbl> <int>   <dbl> <chr>      
1 Linear regression           rmse    standard    5.33     5  0.0580 Preprocess…
2 Linear regression (Prepped) rmse    standard    4.98     5  0.0489 Preprocess…
3 Elastic Net                 rmse    standard    4.98     5  0.0489 Preprocess…
4 Polynomial ENet regression  rmse    standard    4.73     5  0.0491 Preprocess…
5 GAM                         rmse    standard    4.87     5  0.0453 Preprocess…
6 MARS                        rmse    standard    4.95     5  0.0484 Preprocess…
bind_rows(
  `Linear regression` = collect_metrics(lm1_resamples) |>
    filter(.metric == "rsq"),
  `Linear regression (Prepped)` = collect_metrics(lm2_resamples) |>
    filter(.metric == "rsq"),    
  `Elastic Net` = collect_metrics(enet_resamples) |>
    filter(.metric == "rsq"),   
  `Polynomial ENet regression` = collect_metrics(poly_enet_resamples) |>
    filter(.metric == "rsq"), 
  `GAM` = collect_metrics(gam_resamples) |>
    filter(.metric == "rsq"),  
  `MARS` = collect_metrics(mars_resamples) |>
    filter(.metric == "rsq"),
  .id = "model"
)
# A tibble: 6 × 7
  model                       .metric .estimator  mean     n std_err .config    
  <chr>                       <chr>   <chr>      <dbl> <int>   <dbl> <chr>      
1 Linear regression           rsq     standard   0.485     5 0.00906 Preprocess…
2 Linear regression (Prepped) rsq     standard   0.548     5 0.00655 Preprocess…
3 Elastic Net                 rsq     standard   0.548     5 0.00656 Preprocess…
4 Polynomial ENet regression  rsq     standard   0.593     5 0.00632 Preprocess…
5 GAM                         rsq     standard   0.569     5 0.00569 Preprocess…
6 MARS                        rsq     standard   0.555     5 0.00642 Preprocess…

  1. The Council of Economic Advisers “cubic model” of Covid-19 cases is one high profile example of extrapolating cubic models gone awry.↩︎