2  Embrace Your Fallibility

Abstract
This section will motivate and outline the workflows and tools we will adopt to promote reproducible research.

Figure 2.1: The Duck-Rabbit Illusion

2.1 Embrace Your Fallibility

2.1.1 Why are we here?

The unifying interest of data scientists and statisticians is that we want to learn about the world using data.

Working with data has always been tough. It has always been difficult to create analyses that are

  1. Accurate
  2. Reproducible and auditable
  3. Collaborative

Working with data has gotten tougher with time. Data sources, methods, and tools have become more sophisticated. This leaves a lot of us stressed out because errors and mistakes feel inevitable and are embarrassing.

2.1.2 What are we going to do?

Errors and mistakes are inevitable. It’s time to embrace our fallibility.

In The Field Guide to Understanding Human Error (Dekker 2014), the author argues that there are two paradigms:

  1. Old-World View: errors are the fault of individuals
  2. New-World View: errors are the fault of flawed systems that fail individuals
Note

Errors and mistakes are inevitable. This is our gestalt moment like the famous duck-rabbit illusion in Figure 2.1.

Looking at the demands of research and data analysis, we must no longer see grit as sufficient. Instead, we need to see and use systems that don’t fail us as statisticians and data scientists.

2.1.3 How are we going to do this?

Errors and mistakes are inevitable. We want to adopt evidence-based best practices that minimize the probability of making an error and maximize the probability of catching an inevitable error.

(Parker, n.d.) describes a process called opinionated analysis development.

We’re going to adopt the approaches outlined in Opinionated Analysis Development and then actually implement them using modern data science tools.

Through years of applied data analysis, I’ve found these tools to be essential for creating analyses that are

  1. Accurate
  2. Reproducible and auditable
  3. Collaborative

2.2 Why is Modern Data Analysis Difficult to do Well?

Working with data has gotten tougher with time.

  • Data are larger on average. For example, The Billion Prices Project scraped prices from all around the world to build inflation indices (Cavallo and Rigobon 2016).
  • Complex data collection efforts are more common. For example, (Chetty and Friedman 2019) have gained access to massive administrative data sets and used formal privacy to understand inter-generational mobility.
  • Open source packages that provide incredible functionality for free change over time.
  • Papers like “The garden of forking paths: Why multiple comparisons can be a problem, even when there is no ‘fishing expedition’ or ‘p-hacking’ (Gelman and Loken 2013) and the research hypothesis was posited ahead of time∗” and “Why Most Published Research Findings Are False” (Ioannidis 2005) have motivated huge increases in transparency including focuses on pre-registration and computational reproducibility.

There is a growing realization that statistically significant claims in scientific publications are routinely mistaken. A dataset can be analyzed in so many different ways (with the choices being not just what statistical test to perform but also decisions on what data to exclude or exclude, what measures to study, what interactions to consider, etc.), that very little information is provided by the statement that a study came up with a p < .05 result. The short version is that it’s easy to find a p < .05 comparison even if nothing is going on, if you look hard enough—and good scientists are skilled at looking hard enough and subsequently coming up with good stories (plausible even to themselves, as well as to their colleagues and peer reviewers) to back up any statistically-significant comparisons they happen to come up with. ~ Gelman and Loken

From this perspective, what we need is Truman Show for every researcher where we can watch their decisionmaking and the nuances of the decisions they make. That’s impractical! But from the point of good science, pulling back the curtain with computational reproducibility is a way to mitigate these concerns.

Even for simple analysis, we can ask ourselves an entire set of questions at the end of the analysis. Table 2.1 lists a few of these questions.

Table 2.1:

Opinionated Analysis Development

Opinionated Analysis Development
Question Addressed
Can you re-run the analysis and get the same results?
If an external library you’re using is updated, can you still reproduce your original results?
If you change code, do you know which downstream code need to be re-executed?
If the data or code change but the analysis is not re-executed, will your analysis reflect that it is out-of-date?
Can you re-run your analysis with new data and compare it to previous results?
Can you surface the code changes that resulted in a different analysis results?
Can a second analyst easily understand your code?
Can you re-use logic in different parts of the analysis?
If you decide to change logic, can you change it in just one place?
If your code is not performing as expected, will you know?
If your data are corrupted, do you notice?
If you make a mistake in your code, will someone notice it?
If you are not using efficient code, will you be able to identify it?
Can a second analyst easily contribute code to the analysis?
If two analysts are developing code simultaneously, can they easily combine them?
Can you easily track next steps in your analysis?
Can your collaborators make requests outside of meetings or email?
Parker, Hilary. n.d. “Opinionated Analysis Development.” https://doi.org/10.7287/peerj.preprints.3210v1.

2.3 What are we going to do?

Replication

Replication is the recreation of findings across repeated studies. It is a cornerstone of science.

Reproducibility

Reproducibility is the ability to access data, source code, tools, and documentation and recreate all calculations, visualizations, and artifacts of an analysis.

Computational reproducibility should be the minimum standard for computational social sciences and statistical programming.

We are going to center reproducibility in the practices we maintain, the tools we use, and culture we foster. By centering reproducibility, we will be able to create analyses that are

  1. Accurate
  2. Reproducible and auditable
  3. Collaborative

Table 2.2 groups these questions in analysis features and suggests opinionated approaches to each question.

Table 2.2:

Opinionated Analysis Development

Opinionated Analysis Development
Opinionated Approach Question Addressed
Reproducible and Auditable
Executable analysis scripts Can you re-run the analysis and get the same results?
Defined dependencies If an external library you’re using is updated, can you still reproduce your original results?
Watchers for changed code and data If you change code, do you know which downstream code need to be re-executed?
Watchers for changed code and data If the data or code change but the analysis is not re-executed, will your analysis reflect that it is out-of-date?
Version control (individual) Can you re-run your analysis with new data and compare it to previous results?
Version control (individual) Can you surface the code changes that resulted in a different analysis results?
Literate programming1 Can a second analyst easily understand your code?
Accurate Code
Modular, tested, code Can you re-use logic in different parts of the analysis?
Modular, tested, code If you decide to change logic, can you change it in just one place?
Modular, tested, code If your code is not performing as expected, will you know?
Assertive testing of data, assumptions, and results If your data are corrupted, do you notice?
Code review If you make a mistake in your code, will someone notice it?
Code review If you are not using efficient code, will you be able to identify it?
Collaborative
Version control (collaborative) Can a second analyst easily contribute code to the analysis?
Version control (collaborative) If two analysts are developing code simultaneously, can they easily combine them?
Version control (collaborative) Can you easily track next steps in your analysis?
Version control (collaborative) Can your collaborators make requests outside of meetings or email?
Source: Parker, Hilary. n.d. “Opinionated Analysis Development.” https://doi.org/10.7287/peerj.preprints.3210v1.
1 This was originally 'Code Review'

2.4 How are we going to do this?

Table 2.3 lists specific tools we can use to adopt each opinionated approach.

Table 2.3:

Opinionated Analysis Development

Opinionated Analysis Development
Opinionated Approach Question Addressed Tool1
Reproducible and Auditable
Executable analysis scripts Can you re-run the analysis and get the same results? Code first
Defined dependencies If an external library you’re using is updated, can you still reproduce your original results? library(renv)
Watchers for changed code and data If you change code, do you know which downstream code need to be re-executed? library(targets)2
Watchers for changed code and data If the data or code change but the analysis is not re-executed, will your analysis reflect that it is out-of-date? library(targets)2
Version control (individual) Can you re-run your analysis with new data and compare it to previous results? Git
Version control (individual) Can you surface the code changes that resulted in a different analysis results? Git
Literate programming3 Can a second analyst easily understand your code? Quarto
Accurate Code
Modular, tested, code Can you re-use logic in different parts of the analysis? Well-tested functions
Modular, tested, code If you decide to change logic, can you change it in just one place? Well-tested functions
Modular, tested, code If your code is not performing as expected, will you know? Well-tested functions
Assertive testing of data, assumptions, and results If your data are corrupted, do you notice? library(assertr)
Code review If you make a mistake in your code, will someone notice it? GitHub
Code review If you are not using efficient code, will you be able to identify it? library(microbenchmark)2
Collaborative
Version control (collaborative) Can a second analyst easily contribute code to the analysis? GitHub
Version control (collaborative) If two analysts are developing code simultaneously, can they easily combine them? GitHub
Version control (collaborative) Can you easily track next steps in your analysis? GitHub
Version control (collaborative) Can your collaborators make requests outside of meetings or email? GitHub
Source: Parker, Hilary. n.d. “Opinionated Analysis Development.” https://doi.org/10.7287/peerj.preprints.3210v1.
1 Added by Aaron R. Williams
2 We will not spend much time on these topics
3 This was originally 'Code Review'

2.4.1 Bonus stuff!

Adopting these opinionated approaches and tools promotes reproducible research. Adopting these opinionated approaches and tools also provides a bunch of great bonuses.

  • Reproducible analyses are easy to scale. Using tools we will cover, we created almost 4,000 county- and city-level websites.
  • GitHub offers free web hosting for hosting books and web pages like the notes we’re viewing right now.
  • Quarto makes it absurdly easy to build beautiful websites and PDFs.

2.5 Roadmap

The day roughly follows the process for setting up a reproducible data analysis.

  • Project organization will cover how to organize all of the files of a data analysis so they are clear and so they work well with other tools.
  • Literate programming will cover Quarto, which will allow us to combine narrative text, code, and the output of code into clear artifacts of our data analysis.
  • Version control will cover Git and GitHub. These will allow us to organize the process of reviewing and merging code.
  • In programming, we’ll discuss best practices for writing code for data analysis like writing modular, well-tested functions and assertive testing of data, assumptions, and results.
  • Environment management will cover library(renv) and the process of managing package dependencies while using open-source code.
  • If we have time, we can discuss how a positive climate and ethical practices can improve transparency and strengthen science.

We sort Table 2.3 into Table 2.4, which outlines the structure of the rest of the day.

Table 2.4:

Opinionated Analysis Development

Opinionated Analysis Development
Opinionated Approach Question Addressed Tool1 Section1
Executable analysis scripts Can you re-run the analysis and get the same results? Code first Project Organization
Literate programming Can a second analyst easily understand your code? Quarto Literate Programming
Version control (individual) Can you re-run your analysis with new data and compare it to previous results? Git Version Control
Version control (individual) Can you surface the code changes that resulted in a different analysis results? Git Version Control
Code review If you make a mistake in your code, will someone notice it? GitHub Version Control
Version control (collaborative) Can a second analyst easily contribute code to the analysis? GitHub Version Control
Version control (collaborative)2 If two analysts are developing code simultaneously, can they easily combine them? GitHub Version Control
Version control (collaborative) Can you easily track next steps in your analysis? GitHub Version Control
Version control (collaborative) Can your collaborators make requests outside of meetings or email? GitHub Version Control
Modular, tested, code Can you re-use logic in different parts of the analysis? Well-tested functions Programming
Modular, tested, code If you decide to change logic, can you change it in just one place? Well-tested functions Programming
Modular, tested, code If your code is not performing as expected, will you know? Well-tested functions Programming
Assertive testing of data, assumptions, and results If your data are corrupted, do you notice? library(assertr) Programming
Defined dependencies If an external library you’re using is updated, can you still reproduce your original results? library(renv) Environment Management
Source: Parker, Hilary. n.d. “Opinionated Analysis Development.” https://doi.org/10.7287/peerj.preprints.3210v1.
1 Added by Aaron R. Williams
2 This was originally 'Code Review'