Opinionated Analysis Development | |||
Opinionated Approach | Question Addressed | Tool1 | Section1 |
---|---|---|---|
Modular, tested, code | Can you re-use logic in different parts of the analysis? | Well-tested functions | Programming |
Modular, tested, code | If you decide to change logic, can you change it in just one place? | Well-tested functions | Programming |
Modular, tested, code | If your code is not performing as expected, will you know? | Well-tested functions | Programming |
Assertive testing of data, assumptions, and results | If your data are corrupted, do you notice? | library(assertr) | Programming |
Source: Parker, Hilary. n.d. “Opinionated Analysis Development.” https://doi.org/10.7287/peerj.preprints.3210v1. | |||
1 Added by Aaron R. Williams |
9 Assertive Testing
-
Photo by Ryan McGilchrist
9.1 Assertive Testing of Data
While reproducibility drastically reduces the number of errors and opacity of analysis, without assertive testing it runs the risk of applying an analysis to corrupted data, or applying an analysis to data that have drifted too far from assumptions. ~ (Parker, n.d.)
Assertions are useful for verifying the quality of data. Many of the principles from assertions and unit testing for functions apply:
- Fail fast, fail often
- Fail loudly
- Fail clearly
Assertive testing of data and assumptions is often much squishier than the unit testing and assertions from the previous section. We must now rely on subject matter expertise and experience with the data to develop assertions that can catch corruptions of the data or data processing mistakes.
Assertive testing means establishing these quality-control checks – usually based on past knowledge of possible corruptions of the data – and halting an analysis if the quality-control checks are not passed, so the analyst can investigate and hopefully fix (or at least account for) the problem. ~ (Parker, n.d.)
9.1.1 library(assertr)
library(assertr)
is a framework for applying assertions to data frames in R. It works well with the pipe (%>%
or |>
) because the first argument of the five main functions is always a data frame.
A predicate function is a function that returns a single TRUE
or FALSE
.
verify()
takes a logical expression. If the all values are TRUE
for the logical expression, the code proceeds. If any value is FALSE
for the logical expression, the code terminates and returns a diagnostic tibble.
library(assertr)
msleep %>%
verify(nrow(.) == 83) |>
verify(sleep_total < 24) |>
verify(has_class("sleep_total", class = "numeric"))
# A tibble: 83 × 11
name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Cheet… Acin… carni Carn… lc 12.1 NA NA 11.9
2 Owl m… Aotus omni Prim… <NA> 17 1.8 NA 7
3 Mount… Aplo… herbi Rode… nt 14.4 2.4 NA 9.6
4 Great… Blar… omni Sori… lc 14.9 2.3 0.133 9.1
5 Cow Bos herbi Arti… domesticated 4 0.7 0.667 20
6 Three… Brad… herbi Pilo… <NA> 14.4 2.2 0.767 9.6
7 North… Call… carni Carn… vu 8.7 1.4 0.383 15.3
8 Vespe… Calo… <NA> Rode… <NA> 7 NA NA 17
9 Dog Canis carni Carn… domesticated 10.1 2.9 0.333 13.9
10 Roe d… Capr… herbi Arti… lc 3 NA NA 21
# ℹ 73 more rows
# ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>
verification [nrow(.) == 82] failed! (1 failure)
verb redux_fn predicate column index value
1 verify NA nrow(.) == 82 NA 1 NA
Error: assertr stopped execution
assert()
takes a predicate function and an arbitrary number of variables. assert()
will terminate if any values violate the predicate functions. Can apply tests to multiple variables.
# A tibble: 83 × 11
name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Cheet… Acin… carni Carn… lc 12.1 NA NA 11.9
2 Owl m… Aotus omni Prim… <NA> 17 1.8 NA 7
3 Mount… Aplo… herbi Rode… nt 14.4 2.4 NA 9.6
4 Great… Blar… omni Sori… lc 14.9 2.3 0.133 9.1
5 Cow Bos herbi Arti… domesticated 4 0.7 0.667 20
6 Three… Brad… herbi Pilo… <NA> 14.4 2.2 0.767 9.6
7 North… Call… carni Carn… vu 8.7 1.4 0.383 15.3
8 Vespe… Calo… <NA> Rode… <NA> 7 NA NA 17
9 Dog Canis carni Carn… domesticated 10.1 2.9 0.333 13.9
10 Roe d… Capr… herbi Arti… lc 3 NA NA 21
# ℹ 73 more rows
# ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>
insist()
is like assert()
, but insist()
can make assertions based on the observed data (e.g. throw an error is any value exceed four sample standard deviations from the sample mean).
# A tibble: 83 × 11
name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Cheet… Acin… carni Carn… lc 12.1 NA NA 11.9
2 Owl m… Aotus omni Prim… <NA> 17 1.8 NA 7
3 Mount… Aplo… herbi Rode… nt 14.4 2.4 NA 9.6
4 Great… Blar… omni Sori… lc 14.9 2.3 0.133 9.1
5 Cow Bos herbi Arti… domesticated 4 0.7 0.667 20
6 Three… Brad… herbi Pilo… <NA> 14.4 2.2 0.767 9.6
7 North… Call… carni Carn… vu 8.7 1.4 0.383 15.3
8 Vespe… Calo… <NA> Rode… <NA> 7 NA NA 17
9 Dog Canis carni Carn… domesticated 10.1 2.9 0.333 13.9
10 Roe d… Capr… herbi Arti… lc 3 NA NA 21
# ℹ 73 more rows
# ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>
assert_rows()
extends assert()
so the assertion can rely on values from multiple columns (e.g. row means within a bound or row must have a certain number of non-missing values).
# A tibble: 83 × 11
name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Cheet… Acin… carni Carn… lc 12.1 NA NA 11.9
2 Owl m… Aotus omni Prim… <NA> 17 1.8 NA 7
3 Mount… Aplo… herbi Rode… nt 14.4 2.4 NA 9.6
4 Great… Blar… omni Sori… lc 14.9 2.3 0.133 9.1
5 Cow Bos herbi Arti… domesticated 4 0.7 0.667 20
6 Three… Brad… herbi Pilo… <NA> 14.4 2.2 0.767 9.6
7 North… Call… carni Carn… vu 8.7 1.4 0.383 15.3
8 Vespe… Calo… <NA> Rode… <NA> 7 NA NA 17
9 Dog Canis carni Carn… domesticated 10.1 2.9 0.333 13.9
10 Roe d… Capr… herbi Arti… lc 3 NA NA 21
# ℹ 73 more rows
# ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>
insist_rows()
extends insist()
so the assertion can rely on values from multiple columns. This is less common but can be used to see if any observation exceeds a certain mahalanobis distance from other rows.
verify()
predicate functionshas_all_names()
has_only_names()
has_class()
assert()
predicate functionsnot_na()
within_bounds()
in_set()
is_uniq()
insist()
predicate functionswithin_n_sds()
within_n_mads()
This vignette demonstrates additional functionality.
library(assertr)
is designed to be used early in a workflow. If you want to run the assertions at the end of the workflow and you don’t want to see printed tibble after printed tibble, end the chain of code with the following custom function.
#' Helper function to silence output from testing code
#'
#' @param data A data frame
#'
quiet <- function(data) {
quiet <- data
}
9.1.2 Other Assertions
library(tidylog)
prints diagnostic information when functions from library(dplyr)
and library(tidylog)
are used.
math_scores <- tribble(
~name, ~math_score,
"Alec", 95,
"Bart", 97,
"Carrie", 100
)
reading_scores <- tribble(
~name, ~reading_score,
"Alec", 88,
"Bart", 67,
"Carrie", 100,
"Zeta", 100
)
left_join(x = math_scores, y = reading_scores, by = "name")
left_join: added one column (reading_score)
> rows only in x 0
> rows only in y (1)
> matched rows 3
> ===
> rows total 3
# A tibble: 3 × 3
name math_score reading_score
<chr> <dbl> <dbl>
1 Alec 95 88
2 Bart 97 67
3 Carrie 100 100
full_join: added one column (reading_score)
> rows only in x 0
> rows only in y 1
> matched rows 3
> ===
> rows total 4
# A tibble: 4 × 3
name math_score reading_score
<chr> <dbl> <dbl>
1 Alec 95 88
2 Bart 97 67
3 Carrie 100 100
4 Zeta NA 100
We’ll detach tidylog to keep the rest of this document clean.
library(tidylog)
is excellent for interactive development of data analyses.
If you look at library(tidylog)
output more than once, then write an assertion to capture the same information.
Missing Values
The following throws an error if the data set contains any missing values.
Joins
Joins are one of the most dangerous parts of any data analysis. We can think of many different types of joins:
- “one-to-one”
- “one-to-many”
- “many-to-one”
- “many-to-many”
We can provide an expectation for the type of join using the relationship
argument in *_join()
functions. This is an assertion.
Consider the test scores data sets from earlier. This should be a one-to-one join because each row in x
matches at most 1 row in y
and each row in y
matches at most 1 row in x
.
math_scores <- tribble(
~name, ~math_score,
"Alec", 95,
"Bart", 97,
"Carrie", 100
)
reading_scores <- tribble(
~name, ~reading_score,
"Alec", 88,
"Bart", 67,
"Carrie", 100,
"Zeta", 100
)
left_join(
x = math_scores,
y = reading_scores,
by = "name",
relationship = "one-to-one"
)
# A tibble: 3 × 3
name math_score reading_score
<chr> <dbl> <dbl>
1 Alec 95 88
2 Bart 97 67
3 Carrie 100 100
Suppose there were two "Alec"
in either data set. Then this code would throw a loud error.
Pivots
Pivots are also one of the most dangerous parts of any data analysis. We can write tests for the number of rows and the class for the output of pivots.
Consider table4a
from library(tidyr)
.
# A tibble: 3 × 3
country `1999` `2000`
<chr> <dbl> <dbl>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
We want to pivot this data set to be longer because the data set isn’t tidy. Before writing code to tidy the data, we can probably come up with a few assertions:
- There should be six rows.
year
andcases
should be numeric.
It’s easy to get tired and to cut corners. Assertions never rest.
Understand, that your assertion is out there. It can’t be bargained with. It can’t be reasoned with. It doesn’t feel pity or remorse or fear. It absolutely will not stop ever until your analysis is correct. ~ [Terminator (sort of)]https://www.youtube.com/watch?v=zu0rP2VWLWw)
9.2 Assertive Testing of Assumptions
Assertive testing of assumptions is the squishiest of everything we’ve considered testing. We don’t want to apply an analysis to data that have drifted too far from the assumptions of analysis. We also don’t want to inappropriately apply a set of binary tests (think mechanical null hypothesis testing with p-values).
At the very least, we should include visualizations and diagnostic tests that systematically explore the assumptions of an analysis in our Quarto documents. Then, we can use version control to track if anything changed unexpectedly.
Beyond that, we need to rely on subject matter expertise to come up with heuristics for assertions.
9.3 Profiling and Benchmarking
We skipped the questions “If you are not using efficient code, will you be able to identify it.”
Human time is expensive. Machine time is cheap. All else equal, we shouldn’t worry too much about making our code more efficient.
Sometimes, it is necessary to make our code more efficient. After all, who cares if our analysis is reproducible if it takes two weeks to run?
Profiling is the systematic measurement of the run-time of each line of code.
Benchmarking is the precise measurement of the performance of a small piece of code. Typically, the code is run multiple times to improve the precision of the measurement.
Systematically making code more efficient generally proceeds in three steps:
- Step 1: Profile the entire set of code to identify bottlenecks.
- Step 2: Benchmark small pieces of code that are responsible for the bottleneck.
- Step 3: Try to improve the slow pieces of code. Return to step 2 to evaluate the result.
RStudio has built-in tools for profiling the run time and memory usage of large chunks of code. See this section of Advanced R to learn more.
library(microbenchmark)
has robust tools for benchmarking code. See this section of Advanced R to learn more.