Rows: 69650 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): State_name, County_name
dbl (24): Low_Response_Score, non_return_rate, Renter_Occp_HU_ACS_13_17, Pop...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
22 Interpretable Models
22.1 Planning Database
- Motivation: The decennial census is used to allocate hundreds of billions of dollars and to redistrict political power. Ensuring an accurate count is imperative. The Census Bureau uses the planning database (PDB) and a modeled low-response score (LRS) to plan outreach to improve response rates.
- Implementation data: Demographic and housing information for the American Community Survey at the census tract and census block group levels.
- Modeling data: Demographic and housing information for the American Community Survey and the low-response score for the 2010 Decennial Census at the census tract and census block group levels.
- Objective: Predict which areas will have the lowest response rates to the 2020 Decennial Census.
- Tools: Linear regression based on the top 25 predictors from gradient-boosted trees.
- Results: Unclear but we’ll see more in these notes!
22.1.1 LRS Set Up
We will work with the tract-level PDB instead of the block-group level PDB to simplify computation. The PDB contains more than 500 variables. For now, we only consider the top 25 variables to further simplify computation.
Lastly, we set up \(v\)-fold cross validation. We only use five folds because some of the models take a long time to fit to the data.
22.2 Motivation
The models that won the planning database Kaggle competition all used ensembled trees. They had high predictive power but were difficult to interpret and difficult to use for inference.
In the end, the Census Bureau used the ensembled trees to inform a linear regression model that was easier to interpret. In other words, the Census Bureau sacrificed predictive power to gain interpretability.
This section will introduce extensions of linear regression models that aim to improve predictive power without quickly becoming difficult to interpret or “black box.”
Like in earlier sections, we will use the simulated example 1 data to demonstrate some model fits.
Code
library(tidymodels)
set.seed(20201004)
x <- runif(n = 1000, min = 0, max = 10)
data1 <- bind_cols(
x = x,
y = 10 * sin(x) + x + 20 + rnorm(n = length(x), mean = 0, sd = 2)
)
set.seed(20201007)
# create a split object
data1_split <- initial_split(data = data1, prop = 0.75)
# create the training and testing data
data1_train <- training(x = data1_split)
data1_test <- testing(x = data1_split)22.3 Polynomial Regression
Polynomial regression includes higher-degree predictors in the design matrix. For example, a regression model will include \(x\) and \(x^2\) as predictors.
Polynomials extend linear regression to include transformations of predictors such that
\[ y_i = \beta_0 + \beta_1x_i + \beta_2x_i^2 + \cdot\cdot\cdot \beta_dx_i^d + \epsilon_i \]
where \(d\) is referred to as the degree of the polynomial.
Polynomial regression is still linear regression because the model is linear in its coefficients but can fit non-linear patterns in the data
Figure 22.1 shows four different linear regression models fit to the training data. Degree is the magnitude of the highest order term included in the model.
Code
lin_reg_model1 <- linear_reg() |>
set_engine(engine = "lm") |>
set_mode(mode = "regression") |>
fit(formula = y ~ x, data = data1_train)
lin_reg_model2 <- linear_reg() |>
set_engine(engine = "lm") |>
set_mode(mode = "regression") |>
fit(formula = y ~ poly(x, degree = 2, raw = TRUE), data = data1_train)
lin_reg_model3 <- linear_reg() |>
set_engine(engine = "lm") |>
set_mode(mode = "regression") |>
fit(formula = y ~ poly(x, degree = 3, raw = TRUE), data = data1_train)
lin_reg_model4 <- linear_reg() |>
set_engine(engine = "lm") |>
set_mode(mode = "regression") |>
fit(formula = y ~ poly(x, degree = 4, raw = TRUE), data = data1_train)
# create a grid of predictions
new_data <- tibble(x = seq(0, 10, 0.1))
predictions_grid <- tibble(
x = seq(0, 10, 0.1),
`Degree = 1` = predict(object = lin_reg_model1, new_data = new_data)$.pred,
`Degree = 2` = predict(object = lin_reg_model2, new_data = new_data)$.pred,
`Degree = 3` = predict(object = lin_reg_model3, new_data = new_data)$.pred,
`Degree = 4` = predict(object = lin_reg_model4, new_data = new_data)$.pred
) |>
pivot_longer(-x, names_to = "model", values_to = ".pred")
ggplot() +
geom_point(data = data1_train, aes(x = x, y = y), alpha = 0.25) +
geom_path(data = predictions_grid, aes(x = x, y = .pred), color = "red") +
facet_wrap(~model) +
labs(
title = "Example 1: Data with Predictions (Linear Regression)",
subtitle = "Prediction in red"
) +
theme_minimal()22.4 Interactions
Interactions include the product of predictors in the design matrix for linear regression. For example, a regression model will include \(x_1\), \(x_2\), and \(x_1x_2\) as predictors.
Interactions extend polynomials to consider multiplying predictors by other predictors in addition to multiplying them by themselves. Including pairwise or triplewise interactions complicates interpretation but can improve the fit of the model to the data.
The hierarchical principle states that main effects should be included in regression models when interactions are included even if the main effects are not statistically significant. This includes polynomial regression.
Outside of applications where a small number of interactions is included because of theory, interactions can explode the number of parameters that require estimation.
“A key problem in the estimation of two-way interaction models is the explosion in the number of unknown parameters.” (Ibrahim et al. 2021)
Simple pairwise interactions for the small planning database example explodes the number of predictors from 23 to 276.
# A tibble: 55,720 × 279
State_name County_name Low_Response_Score non_return_rate
<chr> <chr> <dbl> <dbl>
1 Arizona Yavapai County 15.2 16.5
2 Georgia Screven County 23.5 23
3 Iowa Polk County 21.4 24.4
4 Wyoming Sweetwater County 26 14.8
5 Missouri Jasper County 22.8 17.6
6 Virginia Hanover County 12.1 10.3
7 Oregon Clackamas County 19.5 21.8
8 North Carolina Chatham County 20.1 14.7
9 North Carolina Mecklenburg County 17.6 15.3
10 California San Bernardino County 28 24.9
# ℹ 55,710 more rows
# ℹ 275 more variables: Renter_Occp_HU_ACS_13_17 <dbl>,
# Pop_18_24_ACS_13_17 <dbl>, Female_No_HB_ACS_13_17 <dbl>,
# NH_White_alone_ACS_13_17 <dbl>, Pop_65plus_ACS_13_17 <dbl>,
# Rel_Child_Under_6_ACS_13_17 <dbl>, Males_ACS_13_17 <dbl>,
# MrdCple_Fmly_HHD_ACS_13_17 <dbl>, Pop_25_44_ACS_13_17 <dbl>,
# Tot_Vacant_Units_ACS_13_17 <dbl>, College_ACS_13_17 <dbl>, …
Regularization with methods like LASSO regression and cross validation can help in the presence of many predictors.
22.5 Step Functions
Step functions are not often used directly in predictive modeling but they are an important building block for other methods.
Step functions break the range of a predictor into bins and then fit different constant (intercept only models) in each bin. Figure 22.2 shows piecewise constant regression fit to the simulated example 1 data. The example uses 9 separate regressions (8 knots).
Code
# create a simple linear regression model
lin_reg_model <- linear_reg() |>
set_engine(engine = "lm") |>
set_mode(mode = "regression")
# specify a piecewise constant model
step_function_rec <- recipe(formula = y ~ x, data = data1_train) |>
step_spline_b(all_predictors(), degree = 0, deg_free = 8)
# fit the model
step_function_mod <- workflow() |>
add_model(spec = lin_reg_model) |>
add_recipe(recipe = step_function_rec) |>
fit(data = data1_train)
# create a grid of predictions
new_data <- tibble(x = seq(0, 10, 0.1))
predictions_grid <- tibble(
x = seq(0, 10, 0.1),
`Piecewise Constant Regression` = predict(object = step_function_mod, new_data = new_data)$.pred
) |>
pivot_longer(-x, names_to = "model", values_to = ".pred")
ggplot() +
geom_point(data = data1_train, aes(x = x, y = y), alpha = 0.25) +
geom_line(
data = filter(predictions_grid, model != "Continuous, Cubic"),
mapping = aes(x = x, y = .pred),
color = "red"
) +
facet_wrap(~model) +
labs(
title = "Example 1: Data with Predictions (Piecewise Constant Regression)",
subtitle = "Prediction in red"
) +
theme_minimal()We can now see the connection between piecewise constant regression and regression trees. Regression trees picks the knots, where we must specify the knots for piecewise constant regression.
22.6 Basis Functions
Basis functions are fixed and known transformations applied to a variable \(x\) before modeling. We can denote the basis functions as \(b_1(), b_2(), ..., b_K()\). Basis functions result in “derived features.”
Consider a few simple basis function:
- Linear: \(b_j(x_i) = x_i\)
- Polynomial: \(b_j(x_i) = x_i^j\)
- Piecewise constant: \(b_j(x_i) = I(c_j \leq x_i < c_{j + 1})\)
Basis functions are advantageous because if we fit a model like
\[ y_i = \beta_0 + \beta_1b_1(x_i) + \beta_1b_1(x_i) + \cdot\cdot\cdot + \beta_kb_K(x_i) + \epsilon_i \]
then we can use least squares estimation and we have access to all of the tools available to linear regression like standard errors and statistical tests. Next, we will explore a few more basis functions.
22.7 Regression Splines
Regression splines are piecewise polynomial basis functions with a constraint that the fitted curve must be continuous and have continuous first and second derivatives.
Regression splines have a degree, which determines the order of the polynomials (e.g. 1 is linear and and 3 is cubic). Regression splines also have knots, which are the break points between regions.
Regression splines combine piecewise step functions and polynomial regression with certain constraints using basis functions. Regression splines are called nonparametric even though they result in estimated regression coefficients.
Knots are where the regions meet with regression splines. Regression splines are continuous at knots and have continuous first and second derivatives at knots. This ensures the fitted line.
We won’t show the basis functions for simplicity. Just recall, basis functions transform the variable \(x\) before fitting the regression model.
Figure 22.3 shows several piecewise polynomial regression models fit to the simulated example 1 data. The first panel shows a piecewise linear model. Note the discontinuity. The second panel shows a piecewise square model. Again, note the discontinuity. The bottom two panels shows linear and square models with an added continuity conditions.
Code
# create a simple linear regression model
lin_reg_model <- linear_reg() |>
set_engine(engine = "lm") |>
set_mode(mode = "regression")
# create a workflow with a model but no recipe
base_workflow <- workflow() |>
add_model(spec = lin_reg_model)
# create a series of custom recipes
piecewise_linear_rec <- recipe(formula = y ~ x, data = data1_train) |>
step_mutate(ntile = factor(ntile(x, n = 4))) |>
step_dummy(ntile) |>
step_interact(terms = ~ x:starts_with("ntile"))
piecewise_square_rec <- recipe(formula = y ~ x, data = data1_train) |>
step_mutate(ntile = factor(ntile(x, n = 4))) |>
step_dummy(ntile) |>
step_poly(x, options = list(raw = FALSE)) |>
step_interact(terms = ~ starts_with("x"):starts_with("n"))
continuous_linear_rec <- recipe(formula = y ~ x, data = data1_train) |>
step_spline_b(all_predictors(), degree = 1, deg_free = 5)
continuous_square_rec <- recipe(formula = y ~ x, data = data1_train) |>
step_spline_b(all_predictors(), degree = 2, deg_free = 6)
continuous_cubic_rec <- recipe(formula = y ~ x, data = data1_train) |>
step_spline_b(all_predictors(), degree = 3, deg_free = 7)
# add the custom recipes and fit the models
piecewise_linear_mod <- base_workflow |>
add_recipe(recipe = piecewise_linear_rec) |>
fit(data = data1_train)
piecewise_square_mod <- base_workflow |>
add_recipe(recipe = piecewise_square_rec) |>
fit(data = data1_train)
continuous_linear_mod <- base_workflow |>
add_recipe(recipe = continuous_linear_rec) |>
fit(data = data1_train)
continuous_square_mod <- base_workflow |>
add_recipe(recipe = continuous_square_rec) |>
fit(data = data1_train)
continuous_cubic_mod <- base_workflow |>
add_recipe(recipe = continuous_cubic_rec) |>
fit(data = data1_train)
# create a grid of predictions
new_data <- tibble(x = seq(0, 10, 0.1))
predictions_grid <- tibble(
x = seq(0, 10, 0.1),
`Piecewise, Linear` = predict(object = piecewise_linear_mod, new_data = new_data)$.pred,
`Piecewise, Square` = predict(object = piecewise_square_mod, new_data = new_data)$.pred,
`Continuous, Linear` = predict(object = continuous_linear_mod, new_data = new_data)$.pred,
`Continuous, Square` = predict(object = continuous_square_mod, new_data = new_data)$.pred,
`Continuous, Cubic` = predict(object = continuous_cubic_mod, new_data = new_data)$.pred
) |>
pivot_longer(-x, names_to = "model", values_to = ".pred") |>
mutate(model = factor(model, levels = c("Piecewise, Linear", "Piecewise, Square", "Continuous, Linear", "Continuous, Square", "Continuous, Cubic")))
ggplot() +
geom_point(data = data1_train, aes(x = x, y = y), alpha = 0.25) +
geom_line(
data = filter(predictions_grid, model != "Continuous, Cubic"),
mapping = aes(x = x, y = .pred),
color = "red"
) +
facet_wrap(~model) +
labs(
title = "Example 1: Data with Predictions (Piecewise Polynomials)",
subtitle = "Prediction in red"
) +
theme_minimal()Finally, Figure 22.4 shows a cubic spline model fit to the data. The model uses cubic polynomials and four knots. Recall that it took a 4th-degree polynomial to fit the pattern in the data.
Code
ggplot() +
geom_point(data = data1_train, aes(x = x, y = y), alpha = 0.25) +
geom_line(
data = filter(predictions_grid, model == "Continuous, Cubic"),
mapping = aes(x = x, y = .pred),
color = "red"
) +
facet_wrap(~model) +
labs(
title = "Example 1: Data with Predictions (Cubic Regression splines)",
subtitle = "Prediction in red"
) +
theme_minimal()Higher degrees and more knots will fit the training data better but run the risk of resulting in an overfit model.
Most data scientists don’t use degree > 3 because it is said that the human eye can’t detect discontinuity for any degree \(\geq\) 3.
This leaves picking the number and location of the knots. The location of knots is typically evenly spaced out. The number of knots is limited by the number of observations in the data set. One popular approach is to propose many knots and then use regularization methods like LASSO regression with cross validation to drop unnecessary knots.
Basis splines are regression splines where every region has the same polynomial.
Natural splines are basis splines with the added constraint that the minimum region and maximum region (the area outside the most extreme knots) are constrained to be linear.
High-degree polynomials have high variance in sparse regions of the data and at the extremes of data. This means minor changes in the data can dramatically change the fitted line.1
22.8 Generalized Additive Models
The additivity assumption of linear regression states that the contribution of each predictor to the outcome variable is independent of the other predictors. Thus, the combined effect of the predictors is just the sum of their individual effects.
Interactions violate the additivity assumption.
We only considered basis functions in simple linear regression applications (starting with exactly one predictor) until now. Generalized additive models (GAMs) allow us to use non-linear functions of multiple predictors while maintaining additivity.
Generalized additive models are of the form
\[ y_i = \beta_0 + \sum_{j = 1}^pf_j(x_{ij}) + \epsilon_i \]
In other words, the fitted line/conditional mean is the sum of the contributions of each of the predictors. The contributions can be modeled as non-linear functions (basis functions) of each predictor (polynomials, splines, etc.)
22.9 Multivariate Adaptive Regression Splines (MARS)
Multivariate Adaptive Regression Splines (MARS) are an alternative to GAMs for capturing non-linear partners in data using linear regression and splines.
\[ h(x) = \begin{cases} x & x > 0 \\ 0 & x \leq 0 \end{cases} \]
or
\[ h(x) = \begin{cases} max(0, x - c) & x > c\\ max(0, c - x) & x \leq c \end{cases} \]
MARS is an additive model of the form
\[ y_i = \sum_{j= 1}^pc_jB_j(x_i) + \epsilon_i \]
where \(c_j\) is a linear regression coefficient and \(B_j(x_i)\) is
- 1
- A hinge function
- A product of hinge functions
MARS proceeds in three steps
- Find the knot that creates a piecewise regression model of the following form with the lowest SSE.
\[ y = \beta_0 + \beta_1h(x) \]
- Repeat this process recursively until some stopping parameter is achieved. For example, the second step will have the form.
\[ y = \beta_0 + \beta_1h_1(x) + \beta_2h_2(x) \]
- Prune with cross-validation.
Figure 22.5 demonstrates MARS with 1, 2, 3, and 4 knots using the simulated example 1 data.
Code
# fit MARS with an increasing number of knots
mars1 <- mars(num_terms = 2) |>
set_engine(engine = "earth") |>
set_mode(mode = "regression") |>
fit(formula = y ~ x, data = data1_train)
mars2 <- mars(num_terms = 3) |>
set_engine(engine = "earth") |>
set_mode(mode = "regression") |>
fit(formula = y ~ x, data = data1_train)
mars3 <- mars(num_terms = 4) |>
set_engine(engine = "earth") |>
set_mode(mode = "regression") |>
fit(formula = y ~ x, data = data1_train)
mars4 <- mars(num_terms = 5) |>
set_engine(engine = "earth") |>
set_mode(mode = "regression") |>
fit(formula = y ~ x, data = data1_train)
# create a grid of predictions
new_data <- tibble(x = seq(0, 10, 0.1))
predictions_grid <- tibble(
x = seq(0, 10, 0.1),
`Knots = 1` = predict(object = mars1, new_data = new_data)$.pred,
`Knots = 2` = predict(object = mars2, new_data = new_data)$.pred,
`Knots = 3` = predict(object = mars3, new_data = new_data)$.pred,
`Knots = 4` = predict(object = mars4, new_data = new_data)$.pred
) |>
pivot_longer(-x, names_to = "model", values_to = ".pred")
ggplot() +
geom_point(data = data1_train, aes(x = x, y = y), alpha = 0.25) +
geom_path(data = predictions_grid, aes(x = x, y = .pred), color = "red") +
facet_wrap(~model) +
labs(
title = "Example 1: Data with Predictions (MARS)",
subtitle = "Prediction in red"
) +
theme_minimal()We can print the table of hinges and coefficients for the four models like this:
y
(Intercept) 22.82586
h(x-4.50551) 2.25023
y
(Intercept) 28.152720
h(x-4.50551) 9.135052
h(x-2.3873) -5.294739
y
(Intercept) 28.813593
h(x-4.50551) 17.161781
h(x-2.3873) -8.075389
h(x-7.37819) -12.883468
y
(Intercept) 37.598528
h(x-4.50551) 19.679189
h(4.50551-x) -2.964782
h(x-8.12048) -16.167899
h(x-2.3873) -11.946754
22.10 Case Study
We now return to the PDB example with 25 variables and five fold cross validation.
22.10.1 Linear Regression
We first consider multiple linear regression. We need a recipe, a model, and a workflow to fit the model five times and evaluate its predictive accuracy.
The data set contains a few variables that shouldn’t be included in the model. We use update_role() to turn them into “ids” to remove them from consideration.
lm1_rec <- recipe(non_return_rate ~ ., pdb_train) |>
update_role(State_name, County_name, Low_Response_Score, new_role = "id")
lm1_mod <- linear_reg() |>
set_mode(mode = "regression") |>
set_engine(engine = "lm")
lm1_wf <- workflow() |>
add_recipe(lm1_rec) |>
add_model(lm1_mod)
lm1_resamples <- lm1_wf |>
fit_resamples(resamples = pdb_folds)
lm1_resamples |>
collect_metrics()# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 5.33 5 0.0580 Preprocessor1_Model1
2 rsq standard 0.485 5 0.00906 Preprocessor1_Model1
We can use last_fit() to fit the model to all of the training data and extract_fit_parsnip() to examine the estimated coefficients.
parsnip model object
Call:
stats::lm(formula = ..y ~ ., data = data)
Coefficients:
(Intercept) Renter_Occp_HU_ACS_13_17
25.852142128 0.000527581
Pop_18_24_ACS_13_17 Female_No_HB_ACS_13_17
0.001415666 0.000738697
NH_White_alone_ACS_13_17 Pop_65plus_ACS_13_17
-0.000940139 -0.001967564
Rel_Child_Under_6_ACS_13_17 Males_ACS_13_17
0.001897582 0.000884922
MrdCple_Fmly_HHD_ACS_13_17 Pop_25_44_ACS_13_17
-0.002668457 0.001216952
Tot_Vacant_Units_ACS_13_17 College_ACS_13_17
0.002089677 -0.000344927
Med_HHD_Inc_ACS_13_17 Pop_45_64_ACS_13_17
-0.000097679 0.001730143
HHD_Moved_in_ACS_13_17 Hispanic_ACS_13_17
0.003871744 -0.000371730
Single_Unit_ACS_13_17 Diff_HU_1yr_Ago_ACS_13_17
-0.002632994 -0.000848577
Pop_5_17_ACS_13_17 NH_Blk_alone_ACS_13_17
0.001696031 0.000546687
Sngl_Prns_HHD_ACS_13_17 Not_HS_Grad_ACS_13_17
-0.004322504 -0.000210310
Med_House_Value_ACS_13_17
0.000009283
The following code creates a grid to visualize the marginal effect of Renter_Occp_HU_ACS_13_17 while holding other predictors at 0. Because these models are additive, the value of other predictors only shifts the line vertically and doesn’t change the shape of the relationship.
grid <- expand_grid(
State_name = "dummy",
County_name = "dummy",
Low_Response_Score = 0,
Renter_Occp_HU_ACS_13_17 = seq(0, 7400, 1),
Pop_18_24_ACS_13_17 = 0,
Female_No_HB_ACS_13_17 = 0,
NH_White_alone_ACS_13_17 = 0,
Pop_65plus_ACS_13_17 = 0,
Rel_Child_Under_6_ACS_13_17 = 0,
Males_ACS_13_17 = 0,
MrdCple_Fmly_HHD_ACS_13_17 = 0,
Pop_25_44_ACS_13_17 = 0,
Tot_Vacant_Units_ACS_13_17 = 0,
College_ACS_13_17 = 0,
Med_HHD_Inc_ACS_13_17 = 0,
Pop_45_64_ACS_13_17 = 0,
HHD_Moved_in_ACS_13_17 = 0,
Hispanic_ACS_13_17 = 0,
Single_Unit_ACS_13_17 = 0,
Diff_HU_1yr_Ago_ACS_13_17 = 0,
Pop_5_17_ACS_13_17 = 0,
NH_Blk_alone_ACS_13_17 = 0,
Sngl_Prns_HHD_ACS_13_17 = 0,
Not_HS_Grad_ACS_13_17 = 0,
Med_House_Value_ACS_13_17 = 0
)
# linear regression
bind_cols(
grid,
lm1_wf |>
last_fit(pdb_split) |>
extract_workflow() |>
predict(new_data = grid)
) |>
ggplot(aes(Renter_Occp_HU_ACS_13_17, .pred)) +
geom_line() +
labs(title = "Fitted Function for Linear Regression")22.10.2 Linear Regression
Erdman and Bates (2017) included square root and logit transformations in their final specification. We next update the recipe to include their functional form.
# create a complex recipe with square root, log, and logit transformations
lm2_rec <- recipe(non_return_rate ~ ., pdb_train) |>
update_role(State_name, County_name, Low_Response_Score, new_role = "id") |>
step_sqrt(
Pop_18_24_ACS_13_17,
Female_No_HB_ACS_13_17,
Pop_65plus_ACS_13_17,
Rel_Child_Under_6_ACS_13_17,
Pop_25_44_ACS_13_17,
Pop_45_64_ACS_13_17,
HHD_Moved_in_ACS_13_17,
Diff_HU_1yr_Ago_ACS_13_17,
Pop_5_17_ACS_13_17,
Sngl_Prns_HHD_ACS_13_17,
Not_HS_Grad_ACS_13_17
) |>
step_log(
Med_HHD_Inc_ACS_13_17,
Med_House_Value_ACS_13_17,
offset = 0.001
) |>
step_log(
Tot_Vacant_Units_ACS_13_17,
offset = 0.001
) |>
add_role(
Renter_Occp_HU_ACS_13_17,
NH_White_alone_ACS_13_17,
College_ACS_13_17,
Hispanic_ACS_13_17,
Single_Unit_ACS_13_17,
NH_Blk_alone_ACS_13_17,
new_role = "logit"
) |>
step_range(
has_role("logit")
) |>
step_logit(
has_role("logit"),
offset = 0.01
)
# create a linear regression model
lm2_mod <- linear_reg() |>
set_mode(mode = "regression") |>
set_engine(engine = "lm")
# combine the recipe and model into a workflow
lm2_wf <- workflow() |>
add_recipe(lm2_rec) |>
add_model(lm2_mod)
# fit the model
lm2_resamples <- lm2_wf |>
fit_resamples(resamples = pdb_folds)Changing the functional form improves the model predictions.
# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 4.98 5 0.0489 Preprocessor1_Model1
2 rsq standard 0.548 5 0.00655 Preprocessor1_Model1
We can use last_fit() to fit the model to all of the training data and extract_fit_parsnip() to examine the estimated coefficients.
parsnip model object
Call:
stats::lm(formula = ..y ~ ., data = data)
Coefficients:
(Intercept) Renter_Occp_HU_ACS_13_17
32.0377933 1.5626842
Pop_18_24_ACS_13_17 Female_No_HB_ACS_13_17
0.0585177 0.1315009
NH_White_alone_ACS_13_17 Pop_65plus_ACS_13_17
-0.6853445 -0.2488444
Rel_Child_Under_6_ACS_13_17 Males_ACS_13_17
-0.0171017 0.0003647
MrdCple_Fmly_HHD_ACS_13_17 Pop_25_44_ACS_13_17
0.0000302 0.0725572
Tot_Vacant_Units_ACS_13_17 College_ACS_13_17
0.3607191 -1.0962875
Med_HHD_Inc_ACS_13_17 Pop_45_64_ACS_13_17
-3.0933436 0.0067678
HHD_Moved_in_ACS_13_17 Hispanic_ACS_13_17
0.0115401 0.5113979
Single_Unit_ACS_13_17 Diff_HU_1yr_Ago_ACS_13_17
-1.4987864 -0.0247277
Pop_5_17_ACS_13_17 NH_Blk_alone_ACS_13_17
-0.0079106 0.4762311
Sngl_Prns_HHD_ACS_13_17 Not_HS_Grad_ACS_13_17
-0.0727898 0.0097039
Med_House_Value_ACS_13_17
1.7301741
22.10.3 Elastic Net
Let’s try the (Erdman and Bates 2017) functional form with elastic net regularization.
enet_rec <- recipe(non_return_rate ~ ., pdb_train) |>
update_role(State_name, County_name, Low_Response_Score, new_role = "id") |>
step_sqrt(
Pop_18_24_ACS_13_17,
Female_No_HB_ACS_13_17,
Pop_65plus_ACS_13_17,
Rel_Child_Under_6_ACS_13_17,
Pop_25_44_ACS_13_17,
Pop_45_64_ACS_13_17,
HHD_Moved_in_ACS_13_17,
Diff_HU_1yr_Ago_ACS_13_17,
Pop_5_17_ACS_13_17,
Sngl_Prns_HHD_ACS_13_17,
Not_HS_Grad_ACS_13_17
) |>
step_log(
Med_HHD_Inc_ACS_13_17,
Med_House_Value_ACS_13_17,
offset = 0.001
) |>
step_log(
Tot_Vacant_Units_ACS_13_17,
offset = 0.001
) |>
add_role(
Renter_Occp_HU_ACS_13_17,
NH_White_alone_ACS_13_17,
College_ACS_13_17,
Hispanic_ACS_13_17,
Single_Unit_ACS_13_17,
NH_Blk_alone_ACS_13_17,
new_role = "logit"
) |>
step_range(
has_role("logit")
) |>
step_logit(
has_role("logit"),
offset = 0.01
) |>
step_normalize(all_predictors())
enet_mod <- linear_reg(penalty = tune(), mixture = tune()) |>
set_mode(mode = "regression") |>
set_engine(engine = "glmnet")
enet_wf <- workflow() |>
add_recipe(recipe = enet_rec) |>
add_model(spec = enet_mod)
enet_grid <- grid_regular(
penalty(),
mixture(),
levels = 20
)
enet_tuning <- tune_grid(
enet_wf,
resamples = pdb_folds,
grid = enet_grid
)
show_best(enet_tuning)Warning in show_best(enet_tuning): No value of `metric` was given; "rmse" will
be used.
# A tibble: 5 × 8
penalty mixture .metric .estimator mean n std_err .config
<dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 7.85e- 3 0.474 rmse standard 4.98 5 0.0489 Preprocessor1_Model196
2 1 e-10 1 rmse standard 4.98 5 0.0489 Preprocessor1_Model381
3 3.36e-10 1 rmse standard 4.98 5 0.0489 Preprocessor1_Model382
4 1.13e- 9 1 rmse standard 4.98 5 0.0489 Preprocessor1_Model383
5 3.79e- 9 1 rmse standard 4.98 5 0.0489 Preprocessor1_Model384
Regularization doesn’t help much in this case. This makes sense because the regression specification is parsimonious and deliberate.
# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 4.98 5 0.0489 Preprocessor1_Model1
2 rsq standard 0.548 5 0.00656 Preprocessor1_Model1
22.10.4 Polynomial Regression
Next we add a 2nd-degree polynomial for all predictors.
polynomial_rec <- recipe(non_return_rate ~ ., pdb_train) |>
update_role(State_name, County_name, Low_Response_Score, new_role = "id") |>
step_poly(all_predictors())
lm_mod <- linear_reg() |>
set_mode(mode = "regression") |>
set_engine(engine = "lm")
polynomial_wf <- workflow() |>
add_recipe(polynomial_rec) |>
add_model(lm_mod)
polynomial_resamples <- polynomial_wf |>
fit_resamples(resamples = pdb_folds)The results are terrible. The polynomials are poorly behaved and the model suffers from multicollinearity. High-degree polynomials without regularization are numerically unstable—small changes in the data create large changes in coefficients. In particular, note the high std_err of the mean RMSE across folds. This instability is why regularization (shown next) is critical when using polynomials.
# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 6.22 5 1.15 Preprocessor1_Model1
2 rsq standard 0.450 5 0.0823 Preprocessor1_Model1
22.10.5 Polynomial Regression with Elastic Net
Let’s try adding 2nd degree polynomials and interactions to create a huge design matrix and then use elastic net regression for regularization.
poly_enet_rec <- recipe(non_return_rate ~ ., pdb_train) |>
update_role(State_name, County_name, Low_Response_Score, new_role = "id") |>
step_sqrt(
Pop_18_24_ACS_13_17,
Female_No_HB_ACS_13_17,
Pop_65plus_ACS_13_17,
Rel_Child_Under_6_ACS_13_17,
Pop_25_44_ACS_13_17,
Pop_45_64_ACS_13_17,
HHD_Moved_in_ACS_13_17,
Diff_HU_1yr_Ago_ACS_13_17,
Pop_5_17_ACS_13_17,
Sngl_Prns_HHD_ACS_13_17,
Not_HS_Grad_ACS_13_17
) |>
step_log(
Med_HHD_Inc_ACS_13_17,
Med_House_Value_ACS_13_17,
offset = 0.001
) |>
step_log(
Tot_Vacant_Units_ACS_13_17,
offset = 0.001
) |>
add_role(
Renter_Occp_HU_ACS_13_17,
NH_White_alone_ACS_13_17,
College_ACS_13_17,
Hispanic_ACS_13_17,
Single_Unit_ACS_13_17,
NH_Blk_alone_ACS_13_17,
new_role = "logit"
) |>
step_range(
has_role("logit")
) |>
step_logit(
has_role("logit"),
offset = 0.01
) |>
step_normalize(all_predictors()) |>
step_poly(all_predictors(), degree = 3) |>
step_interact(terms = ~starts_with("Renter"):all_predictors())
poly_enet_mod <- linear_reg(penalty = tune(), mixture = tune()) |>
set_mode(mode = "regression") |>
set_engine(engine = "glmnet")
poly_enet_wf <- workflow() |>
add_recipe(recipe = poly_enet_rec) |>
add_model(spec = poly_enet_mod)
poly_enet_grid <- grid_regular(
penalty(),
mixture(),
levels = 20
)
poly_enet_tuning <- tune_grid(
poly_enet_wf,
resamples = pdb_folds,
grid = poly_enet_grid
)
show_best(poly_enet_tuning)Warning in show_best(poly_enet_tuning): No value of `metric` was given; "rmse"
will be used.
# A tibble: 5 × 8
penalty mixture .metric .estimator mean n std_err .config
<dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 0.0264 0.632 rmse standard 4.73 5 0.0491 Preprocessor1_Model257
2 0.0264 0.579 rmse standard 4.73 5 0.0495 Preprocessor1_Model237
3 0.0886 0.211 rmse standard 4.73 5 0.0500 Preprocessor1_Model098
4 0.0264 0.684 rmse standard 4.73 5 0.0492 Preprocessor1_Model277
5 0.0886 0.158 rmse standard 4.73 5 0.0522 Preprocessor1_Model078
Here, elastic net regularization dramatically improves the model performance over the simple model and the saturated model with regularization.
# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 4.73 5 0.0491 Preprocessor1_Model1
2 rsq standard 0.593 5 0.00632 Preprocessor1_Model1
22.10.6 GAM/Regression Splines
Let’s explore regression splines and a GAM. The syntax for regression splines in GAMs is slightly different because the tuning isn’t a hyperparameter.
Also, the model takes a long time to fit. To keep things simple, we only consider a few splines. Choosing the correct predictors for splines requires EDA and theory.
gam_spec <- gen_additive_mod(adjust_deg_free = tune()) |>
set_mode(mode = "regression") |>
set_engine(engine = "mgcv")
gam_wf <- workflow() |>
add_recipe(lm1_rec) |>
add_model(
gam_spec,
formula = non_return_rate ~
s(Renter_Occp_HU_ACS_13_17) +
s(Pop_18_24_ACS_13_17) +
s(Female_No_HB_ACS_13_17) +
s(NH_White_alone_ACS_13_17) +
s(Pop_65plus_ACS_13_17) +
s(Rel_Child_Under_6_ACS_13_17) +
s(Males_ACS_13_17) +
s(MrdCple_Fmly_HHD_ACS_13_17) +
s(Pop_25_44_ACS_13_17) +
s(Tot_Vacant_Units_ACS_13_17) +
s(College_ACS_13_17) +
s(Med_HHD_Inc_ACS_13_17) +
s(Pop_45_64_ACS_13_17) +
s(HHD_Moved_in_ACS_13_17) +
Hispanic_ACS_13_17 +
Single_Unit_ACS_13_17 +
Diff_HU_1yr_Ago_ACS_13_17 +
Pop_5_17_ACS_13_17 +
NH_Blk_alone_ACS_13_17 +
Sngl_Prns_HHD_ACS_13_17 +
Not_HS_Grad_ACS_13_17 +
Med_House_Value_ACS_13_17
)
gam_grid <-
grid_regular(adjust_deg_free(range = c(0.25, 5)), levels = 5)
gam_tune <-
tune_grid(
gam_wf,
resamples = pdb_folds,
grid = gam_grid
)
autoplot(gam_tune)Warning in show_best(gam_tune): No value of `metric` was given; "rmse" will be
used.
# A tibble: 5 × 7
adjust_deg_free .metric .estimator mean n std_err .config
<dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 2.62 rmse standard 4.87 5 0.0453 Preprocessor1_Model3
2 3.81 rmse standard 4.87 5 0.0451 Preprocessor1_Model4
3 1.44 rmse standard 4.87 5 0.0460 Preprocessor1_Model2
4 5 rmse standard 4.87 5 0.0451 Preprocessor1_Model5
5 0.25 rmse standard 4.88 5 0.0484 Preprocessor1_Model1
# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 4.87 5 0.0453 Preprocessor1_Model1
2 rsq standard 0.569 5 0.00569 Preprocessor1_Model1
22.10.7 MARS
Finally, let’s explore MARS with hyperparameter tuning. Note the simplicity of the recipe.
Code
mars_rec <- recipe(non_return_rate ~ ., pdb_train) |>
update_role(State_name, County_name, Low_Response_Score, new_role = "id")
# we will tune the number of knots and the degree of the polynomials in each
# piecewise region
mars_mod <- mars(
num_terms = tune(),
prod_degree = 1
) |>
set_mode(mode = "regression") |>
set_engine(engine = "earth")
mars_wf <- workflow() |>
add_recipe(recipe = mars_rec) |>
add_model(spec = mars_mod)
mars_grid <- grid_regular(
num_terms(range = c(20, 100)),
levels = 10
)
mars_resamples <- tune_grid(
mars_wf,
resamples = pdb_folds,
grid = mars_grid
)
collect_metrics(mars_resamples)
show_best(mars_resamples)
autoplot(mars_resamples)mars_rec <- recipe(non_return_rate ~ ., pdb_train) |>
update_role(State_name, County_name, Low_Response_Score, new_role = "id")
# the best parameters were chosen using hyperparameter tuning
# limiting prod_degree to 1 maintains additivity and prevents overfitting
mars_mod <- mars(
num_terms = 46,
prod_degree = 1
) |>
set_mode(mode = "regression") |>
set_engine(engine = "earth")
mars_wf <- workflow() |>
add_recipe(recipe = mars_rec) |>
add_model(spec = mars_mod)
mars_resamples <- mars_wf |>
fit_resamples(resamples = pdb_folds)
collect_metrics(mars_resamples)# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 4.95 5 0.0484 Preprocessor1_Model1
2 rsq standard 0.555 5 0.00642 Preprocessor1_Model1
22.10.8 Final Comparison
Finally, let’s compare the RMSE and \(r\)-squared for linear regression and MARS (with hyperparameter tuning).
bind_rows(
`Linear regression` = collect_metrics(lm1_resamples) |>
filter(.metric == "rmse"),
`Linear regression (Prepped)` = collect_metrics(lm2_resamples) |>
filter(.metric == "rmse"),
`Elastic Net` = collect_metrics(enet_resamples) |>
filter(.metric == "rmse"),
`Polynomial ENet regression` = collect_metrics(poly_enet_resamples) |>
filter(.metric == "rmse"),
`GAM` = collect_metrics(gam_resamples) |>
filter(.metric == "rmse"),
`MARS` = collect_metrics(mars_resamples) |>
filter(.metric == "rmse"),
.id = "model"
)# A tibble: 6 × 7
model .metric .estimator mean n std_err .config
<chr> <chr> <chr> <dbl> <int> <dbl> <chr>
1 Linear regression rmse standard 5.33 5 0.0580 Preprocess…
2 Linear regression (Prepped) rmse standard 4.98 5 0.0489 Preprocess…
3 Elastic Net rmse standard 4.98 5 0.0489 Preprocess…
4 Polynomial ENet regression rmse standard 4.73 5 0.0491 Preprocess…
5 GAM rmse standard 4.87 5 0.0453 Preprocess…
6 MARS rmse standard 4.95 5 0.0484 Preprocess…
bind_rows(
`Linear regression` = collect_metrics(lm1_resamples) |>
filter(.metric == "rsq"),
`Linear regression (Prepped)` = collect_metrics(lm2_resamples) |>
filter(.metric == "rsq"),
`Elastic Net` = collect_metrics(enet_resamples) |>
filter(.metric == "rsq"),
`Polynomial ENet regression` = collect_metrics(poly_enet_resamples) |>
filter(.metric == "rsq"),
`GAM` = collect_metrics(gam_resamples) |>
filter(.metric == "rsq"),
`MARS` = collect_metrics(mars_resamples) |>
filter(.metric == "rsq"),
.id = "model"
)# A tibble: 6 × 7
model .metric .estimator mean n std_err .config
<chr> <chr> <chr> <dbl> <int> <dbl> <chr>
1 Linear regression rsq standard 0.485 5 0.00906 Preprocess…
2 Linear regression (Prepped) rsq standard 0.548 5 0.00655 Preprocess…
3 Elastic Net rsq standard 0.548 5 0.00656 Preprocess…
4 Polynomial ENet regression rsq standard 0.593 5 0.00632 Preprocess…
5 GAM rsq standard 0.569 5 0.00569 Preprocess…
6 MARS rsq standard 0.555 5 0.00642 Preprocess…
The Council of Economic Advisers “cubic model” of Covid-19 cases is one high profile example of extrapolating cubic models gone awry.↩︎