9  Assertive Testing

A multiple-choice test

Photo by Ryan McGilchrist

Table 9.1:

Opinionated Analysis Development

Opinionated Analysis Development
Opinionated Approach Question Addressed Tool1 Section1
Modular, tested, code Can you re-use logic in different parts of the analysis? Well-tested functions Programming
Modular, tested, code If you decide to change logic, can you change it in just one place? Well-tested functions Programming
Modular, tested, code If your code is not performing as expected, will you know? Well-tested functions Programming
Assertive testing of data, assumptions, and results If your data are corrupted, do you notice? library(assertr) Programming
Source: Parker, Hilary. n.d. “Opinionated Analysis Development.” https://doi.org/10.7287/peerj.preprints.3210v1.
1 Added by Aaron R. Williams

9.1 Assertive Testing of Data

While reproducibility drastically reduces the number of errors and opacity of analysis, without assertive testing it runs the risk of applying an analysis to corrupted data, or applying an analysis to data that have drifted too far from assumptions. ~ (Parker, n.d.)

Assertions are useful for verifying the quality of data. Many of the principles from assertions and unit testing for functions apply:

  • Fail fast, fail often
  • Fail loudly
  • Fail clearly

Assertive testing of data and assumptions is often much squishier than the unit testing and assertions from the previous section. We must now rely on subject matter expertise and experience with the data to develop assertions that can catch corruptions of the data or data processing mistakes.

Assertive testing means establishing these quality-control checks – usually based on past knowledge of possible corruptions of the data – and halting an analysis if the quality-control checks are not passed, so the analyst can investigate and hopefully fix (or at least account for) the problem. ~ (Parker, n.d.)

9.1.1 library(assertr)

library(assertr) is a framework for applying assertions to data frames in R. It works well with the pipe (%>% or |>) because the first argument of the five main functions is always a data frame.

Predicate Function

A predicate function is a function that returns a single TRUE or FALSE.

verify() takes a logical expression. If the all values are TRUE for the logical expression, the code proceeds. If any value is FALSE for the logical expression, the code terminates and returns a diagnostic tibble.

library(assertr)

msleep %>%
  verify(nrow(.) == 83) |>
  verify(sleep_total < 24) |>
  verify(has_class("sleep_total", class = "numeric"))
# A tibble: 83 × 11
   name   genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
   <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
 1 Cheet… Acin… carni Carn… lc                  12.1      NA        NA      11.9
 2 Owl m… Aotus omni  Prim… <NA>                17         1.8      NA       7  
 3 Mount… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6
 4 Great… Blar… omni  Sori… lc                  14.9       2.3       0.133   9.1
 5 Cow    Bos   herbi Arti… domesticated         4         0.7       0.667  20  
 6 Three… Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6
 7 North… Call… carni Carn… vu                   8.7       1.4       0.383  15.3
 8 Vespe… Calo… <NA>  Rode… <NA>                 7        NA        NA      17  
 9 Dog    Canis carni Carn… domesticated        10.1       2.9       0.333  13.9
10 Roe d… Capr… herbi Arti… lc                   3        NA        NA      21  
# ℹ 73 more rows
# ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>
msleep %>%
  verify(nrow(.) == 82) |>
  verify(sleep_total < 14) |>
  verify(has_class("sleep_total", class = "character"))
verification [nrow(.) == 82] failed! (1 failure)

    verb redux_fn     predicate column index value
1 verify       NA nrow(.) == 82     NA     1    NA

Error: assertr stopped execution

assert() takes a predicate function and an arbitrary number of variables. assert() will terminate if any values violate the predicate functions. Can apply tests to multiple variables.

msleep %>%
  assert(within_bounds(0, 24), c(sleep_total, sleep_rem, sleep_cycle))
# A tibble: 83 × 11
   name   genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
   <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
 1 Cheet… Acin… carni Carn… lc                  12.1      NA        NA      11.9
 2 Owl m… Aotus omni  Prim… <NA>                17         1.8      NA       7  
 3 Mount… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6
 4 Great… Blar… omni  Sori… lc                  14.9       2.3       0.133   9.1
 5 Cow    Bos   herbi Arti… domesticated         4         0.7       0.667  20  
 6 Three… Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6
 7 North… Call… carni Carn… vu                   8.7       1.4       0.383  15.3
 8 Vespe… Calo… <NA>  Rode… <NA>                 7        NA        NA      17  
 9 Dog    Canis carni Carn… domesticated        10.1       2.9       0.333  13.9
10 Roe d… Capr… herbi Arti… lc                   3        NA        NA      21  
# ℹ 73 more rows
# ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>

insist() is like assert(), but insist() can make assertions based on the observed data (e.g. throw an error is any value exceed four sample standard deviations from the sample mean).

msleep %>%
  insist(within_n_sds(n = 3), sleep_total)
# A tibble: 83 × 11
   name   genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
   <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
 1 Cheet… Acin… carni Carn… lc                  12.1      NA        NA      11.9
 2 Owl m… Aotus omni  Prim… <NA>                17         1.8      NA       7  
 3 Mount… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6
 4 Great… Blar… omni  Sori… lc                  14.9       2.3       0.133   9.1
 5 Cow    Bos   herbi Arti… domesticated         4         0.7       0.667  20  
 6 Three… Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6
 7 North… Call… carni Carn… vu                   8.7       1.4       0.383  15.3
 8 Vespe… Calo… <NA>  Rode… <NA>                 7        NA        NA      17  
 9 Dog    Canis carni Carn… domesticated        10.1       2.9       0.333  13.9
10 Roe d… Capr… herbi Arti… lc                   3        NA        NA      21  
# ℹ 73 more rows
# ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>

assert_rows() extends assert() so the assertion can rely on values from multiple columns (e.g. row means within a bound or row must have a certain number of non-missing values).

msleep |>
  assert_rows(num_row_NAs, within_bounds(0, 5), everything())
# A tibble: 83 × 11
   name   genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
   <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
 1 Cheet… Acin… carni Carn… lc                  12.1      NA        NA      11.9
 2 Owl m… Aotus omni  Prim… <NA>                17         1.8      NA       7  
 3 Mount… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6
 4 Great… Blar… omni  Sori… lc                  14.9       2.3       0.133   9.1
 5 Cow    Bos   herbi Arti… domesticated         4         0.7       0.667  20  
 6 Three… Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6
 7 North… Call… carni Carn… vu                   8.7       1.4       0.383  15.3
 8 Vespe… Calo… <NA>  Rode… <NA>                 7        NA        NA      17  
 9 Dog    Canis carni Carn… domesticated        10.1       2.9       0.333  13.9
10 Roe d… Capr… herbi Arti… lc                   3        NA        NA      21  
# ℹ 73 more rows
# ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>

insist_rows()extends insist() so the assertion can rely on values from multiple columns. This is less common but can be used to see if any observation exceeds a certain mahalanobis distance from other rows.

  • verify() predicate functions
    • has_all_names()
    • has_only_names()
    • has_class()
  • assert() predicate functions
    • not_na()
    • within_bounds()
    • in_set()
    • is_uniq()
  • insist() predicate functions
    • within_n_sds()
    • within_n_mads()
Exercise 1
  1. Add a new code chunk to analysis.qmd.
  2. Run glimpse(trees).
  3. verify() that the variables Girth is numeric.
  4. assert() that all three variables are in the interval \([0, \infty)\).

This vignette demonstrates additional functionality.

library(assertr) is designed to be used early in a workflow. If you want to run the assertions at the end of the workflow and you don’t want to see printed tibble after printed tibble, end the chain of code with the following custom function.

#' Helper function to silence output from testing code
#'
#' @param data A data frame
#'
quiet <- function(data) {
  
  quiet <- data
  
}

Example: Boosting Upward Mobility from Poverty

9.1.2 Other Assertions

library(tidylog) prints diagnostic information when functions from library(dplyr) and library(tidylog) are used.

library(tidylog)
math_scores <- tribble(
  ~name, ~math_score,
  "Alec", 95,
  "Bart", 97,
  "Carrie", 100
)

reading_scores <- tribble(
  ~name, ~reading_score,
  "Alec", 88,
  "Bart", 67,
  "Carrie", 100,
  "Zeta", 100
)

left_join(x = math_scores, y = reading_scores, by = "name")
left_join: added one column (reading_score)
           > rows only in x   0
           > rows only in y  (1)
           > matched rows     3
           >                 ===
           > rows total       3
# A tibble: 3 × 3
  name   math_score reading_score
  <chr>       <dbl>         <dbl>
1 Alec           95            88
2 Bart           97            67
3 Carrie        100           100
full_join(x = math_scores, y = reading_scores, by = "name")
full_join: added one column (reading_score)
           > rows only in x   0
           > rows only in y   1
           > matched rows     3
           >                 ===
           > rows total       4
# A tibble: 4 × 3
  name   math_score reading_score
  <chr>       <dbl>         <dbl>
1 Alec           95            88
2 Bart           97            67
3 Carrie        100           100
4 Zeta           NA           100

We’ll detach tidylog to keep the rest of this document clean.

detach("package:tidylog", unload = TRUE)
Note

library(tidylog) is excellent for interactive development of data analyses.

If you look at library(tidylog) output more than once, then write an assertion to capture the same information.

Missing Values

The following throws an error if the data set contains any missing values.

missing_values <- map_dbl(.x = trees, ~sum(is.na(.x)))

stopifnot(sum(missing_values) == 0)

Joins

Joins are one of the most dangerous parts of any data analysis. We can think of many different types of joins:

  • “one-to-one”
  • “one-to-many”
  • “many-to-one”
  • “many-to-many”

We can provide an expectation for the type of join using the relationship argument in *_join() functions. This is an assertion.

Consider the test scores data sets from earlier. This should be a one-to-one join because each row in x matches at most 1 row in y and each row in y matches at most 1 row in x.

math_scores <- tribble(
  ~name, ~math_score,
  "Alec", 95,
  "Bart", 97,
  "Carrie", 100
)

reading_scores <- tribble(
  ~name, ~reading_score,
  "Alec", 88,
  "Bart", 67,
  "Carrie", 100,
  "Zeta", 100
)

left_join(
  x = math_scores,
  y = reading_scores,
  by = "name",
  relationship = "one-to-one"
)
# A tibble: 3 × 3
  name   math_score reading_score
  <chr>       <dbl>         <dbl>
1 Alec           95            88
2 Bart           97            67
3 Carrie        100           100

Suppose there were two "Alec" in either data set. Then this code would throw a loud error.

Pivots

Pivots are also one of the most dangerous parts of any data analysis. We can write tests for the number of rows and the class for the output of pivots.

Consider table4a from library(tidyr).

table4a
# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

We want to pivot this data set to be longer because the data set isn’t tidy. Before writing code to tidy the data, we can probably come up with a few assertions:

  • There should be six rows.
  • year and cases should be numeric.
table4a_tidy <- table4a |> 
  pivot_longer(
    cols = c(`1999`, `2000`), 
    names_to = "year", 
    values_to = "cases"
  ) |>
  mutate(year = as.numeric(year))

stopifnot(nrow(table4a_tidy) == 6)
stopifnot(class(pull(table4a_tidy, year)) == "numeric")
stopifnot(class(pull(table4a_tidy, cases)) == "numeric")

It’s easy to get tired and to cut corners. Assertions never rest.

Understand, that your assertion is out there. It can’t be bargained with. It can’t be reasoned with. It doesn’t feel pity or remorse or fear. It absolutely will not stop ever until your analysis is correct. ~ [Terminator (sort of)]https://www.youtube.com/watch?v=zu0rP2VWLWw)

9.2 Assertive Testing of Assumptions

Assertive testing of assumptions is the squishiest of everything we’ve considered testing. We don’t want to apply an analysis to data that have drifted too far from the assumptions of analysis. We also don’t want to inappropriately apply a set of binary tests (think mechanical null hypothesis testing with p-values).

At the very least, we should include visualizations and diagnostic tests that systematically explore the assumptions of an analysis in our Quarto documents. Then, we can use version control to track if anything changed unexpectedly.

Beyond that, we need to rely on subject matter expertise to come up with heuristics for assertions.

9.3 Profiling and Benchmarking

We skipped the questions “If you are not using efficient code, will you be able to identify it.”

Human time is expensive. Machine time is cheap. All else equal, we shouldn’t worry too much about making our code more efficient.

Sometimes, it is necessary to make our code more efficient. After all, who cares if our analysis is reproducible if it takes two weeks to run?

Profiling

Profiling is the systematic measurement of the run-time of each line of code.

Benchmarking

Benchmarking is the precise measurement of the performance of a small piece of code. Typically, the code is run multiple times to improve the precision of the measurement.

Systematically making code more efficient generally proceeds in three steps:

  • Step 1: Profile the entire set of code to identify bottlenecks.
  • Step 2: Benchmark small pieces of code that are responsible for the bottleneck.
  • Step 3: Try to improve the slow pieces of code. Return to step 2 to evaluate the result.

RStudio has built-in tools for profiling the run time and memory usage of large chunks of code. See this section of Advanced R to learn more.

library(microbenchmark) has robust tools for benchmarking code. See this section of Advanced R to learn more.