Opinionated Analysis Development | |||
Opinionated Approach | Question Addressed | Tool1 | Section1 |
---|---|---|---|
Defined dependencies | If an external library you’re using is updated, can you still reproduce your original results? | library(renv) | Environment Management |
Source: Parker, Hilary. n.d. “Opinionated Analysis Development.” https://doi.org/10.7287/peerj.preprints.3210v1. | |||
1 Added by Aaron R. Williams |
10 library(renv)
library(renv)
.
10.1 Problem Definition
Every time we run a line of R code or Python code, we rely on an entire stack of software and hardware that affects how our line of R code or Python code runs.
The packages, software, and hardware that supports running code for an analysis.
Changing computing environments can lead to many issues with reproducibility.
- Python and R packages can change in ways that change the results of an analysis. The packages may change because the authors
- introduce new functionality
- improve the package interface
- discover and fix bugs
- Beneath packages, Python or R can change in ways that change the results of an analysis.
- Adjacent to programming languages, compilers and linear algebra libraries can change or even the computer operating system can change.
- Hardware can change in ways that change the results of an analysis.
Each example above corresponds to a layer of a computing environment.
- package
- system
- hardware
Ignoring the readiness of the data science environment results in the dreaded it works on my machine phenomenon with a failed attempt to share code with a colleague or deploy an app to production. ~ (Gold 2024)
This is core to reproducibility. Unfortunately, this is where we must temper our expectations.
A perfectly reproducible environment isn’t possible. We quickly hit the point of diminishing returns where system-level factors like machine precision and pseudo random processes (seeds) affect results.
We will focus on the packages layer of our environment. It is the place where we can get the highest return on investment for our work.
- We can avoid situations where our 2018 analysis using 2018 R packages breaks using 2024 R packages.
- We can avoid situations where our 2024 R packages don’t work on our friend’s computer.
- We can intentionally use older versions of R packages.
- We can make it easy to move from our computer to a cloud computer where we have scalable computing power.
10.2 Ideal Solution
What does an ideal solution look like for managing the package layer of a computing environment?
- Isolate: We should be able to isolate the package environment at the project level. That means we can install, update, or remove packages in our current project without affecting any other projects. This means we’ll have project-specific versions of packages.
- Document: We should be able to document our package environment so the package environment is reproducible.
- Share: Our environment should be portable. More precisely, we should be able to share documentation about our environment so someone else (or our future selves) can recreate the environment.
A library is a folder that contains installed packages. Libraries can hold at most one version of a package.
Running the .libPaths()
function shows the location of the library used by an R session.
By default, R’s library will be the system library. We’re interested in creating a project-specific library.
The condition of a computing environment at a point-in-time.
A repository is a source of packages. The most popular repository is the Comprehensive R Archive Network (CRAN), which we typically use when we run install.packages()
.
We’re interested in documenting the repository used to install each package.
A virtual environment is a collection of packages and software that support a project that are isolated from other projects.
We will use library(renv)
to create a virtual environment to track the state of our computing environment with a project-specific library with packages deliberately installed from specific repositories.
10.3 library(renv)
library(renv)
, short for reproducible environment, allows us to create project-specific virtual environments.
Our workflow will have three steps.
10.3.1 1. Isolate
We should be able to isolate the package environment at the project level. That means we can install, update, or remove packages in our current project without affecting any other projects. This means we’ll have project-specific versions of packages.
The init()
function creates a project-specific virtual environment with a project-specific library. This means the R session will use packages from a project library instead of a system library. Running init()
creates three new items in a project:
renv/library/
is the project-specific library.renv.lock
contains metadata that describes the project-specific library..Rprofile
is a hidden system file. It isn’t specific tolibrary(renv)
, but in this case it tells R to use the project library instead of the system library.
At first, this library won’t have anything in it. This is a little extra work! But we can use install()
and update()
to add R packages to the project-specific library.
10.3.2 2. Document
We should be able to document our package environment so the package environment is reproducible.
The snapshot()
function documents the current project environment by updating metadata in the renv.lock
file. snapshot()
will install, update, or uninstall any packages that are in an inconsistent state and update the lockfile to represent the current state of dependencies in the project.
The lockfile (renv.lock
) contains JSON that documents the current state of dependencies for a project. For instance, if we install and use library(palmerpenguins)
, the lockfile will look like:
{
"R": {
"Version": "4.3.1",
"Repositories": [
{
"Name": "CRAN",
"URL": "https://packagemanager.posit.co/cran/latest"
}
]
},
"Packages": {
"palmerpenguins": {
"Package": "palmerpenguins",
"Version": "0.1.1",
"Source": "Repository",
"Repository": "CRAN",
"Requirements": [
"R"
],
"Hash": "6c6861efbc13c1d543749e9c7be4a592"
},
"renv": {
"Package": "renv",
"Version": "1.0.7",
"Source": "Repository",
"Repository": "CRAN",
"Requirements": [
"utils"
],
"Hash": "397b7b2a265bc5a7a06852524dabae20"
}
}
}
The status()
gives us a snapshot of the documented project environment.
10.3.4 More about library(renv)
renv
doesn’t install packages in a project directory. Instead, renv makes references to user-level packages, which saves space and install time.
renv
doesn’t acknowledge a dependency until it is used somewhere in the project! dependencies()
will show the .R scripts and Quarto documents where dependencies are created.
update()
updates a package that has already been installed and remove()
removes a package that has been installed.
deactivate()
is like hitting pause on the project environment. It shifts the project to using the system library but doesn’t delete any of the renv files in the directory. reactivate()
is the opposite of deactive()
.
renv::deactivate(clean = TRUE)
is dynamite. It shifts the project to using the system library and deletes all of the renv files. There is no going back. At this point, using renv in the project will require starting from scratch with renv::init()
.
If the repository has a Git history, history()
can explore past versions of the project environment and revert()
can return to an earlier version of the project library. With this in mind, it may make sense to begin using renv
near the beginning of a project instead of at the end.
10.3.5 Going deeper
renv
solves environment management for the package layer of the computing environment but it doesn’t help with the system layer or the hardware layer. We’ll briefly cover some other tools that can help with the system layer and hardware layer.
10.3.6 Conda
Conda can help with the system layer in addition to managing the package layer.
Conda is an open source package management system and environment management system for installing multiple versions of software packages and their dependencies and switching easily between them. It works on Linux, OS X and Windows, and was created for Python programs but can package and distribute any software.
Interestingly, Gold (2024) is critical of Docker:
Conda allows you to create a virtual environment in user space on your laptop without having admin access. It’s especially useful when your machine is locked down by IT.
That’s not a great fit for a production environment. Conda smashes together the language version, the package management, and, sometimes, the system library management. This is conceptually simple and easy to use, but it often goes awry in production environments. In a production environment (or a shared workbench server), I recommend people manage Python packages with a virtual environment tool like
{venv}
and manage system libraries and versions of Python with tools built for those purposes.
10.3.7 Docker
A container is a self-contained system for running computer software. Typically, containers are designed to be small and fit-for-use with specific analyses.
Containerization is the process of creating a self-contained computer environment to run an analysis. Lars Vilhuber, the Data Editor for the American Economic Association, has advocated for using containers in economics research.
Docker is a popular container tool. Docker can manage the software layer of a computing environment and the package layer of a computing environment. Docker can:
- Specify the computer operating system
- Control system dependencies like the version of Pandoc, BLAS, and compilers
- Control the R or Python version
- Manage R packages and Python packages
- Manage the version of the code that’s run
DockerHub is a popular repository for sharing Docker images.
It’s worth noting that the packages within Docker can be controlled with renv
and the version of the code can be controlled with Git and GitHub.
The foundations of most containers are standard images. Rocker and Posit Images provide useful starting images.
10.3.8 Cloud Computing
The explosion of popularity of cloud computing has expanded options for managing the hardware layer of a computing environment. Amazon Web Services, Google Cloud, and Microsoft Azure provide on-demand cloud computing environments with predictable, consistent, and documentable hardware.
These cloud computing environments have out-of-pocket marginal costs, but the costs are frequently cheaper than maintaining on-premise computing environments like servers. The costs are definitely cheaper than maintaining old, on-premise infrastructure to reproduce old computing environments.
The following is a sensible workflow:
- Spin up a cloud instance (computer) with a specific operating system.
- Pick a Docker image from a Docker repository. This image should be close to fit-for-purpose.
- Set up Docker. Maybe set up renv. Create Dockerfiles and renv files.
- Run the project with good version control.
- Save the Docker files and renv
10.4 Final Thoughts
This is a lot! Managing the software layer and hardware layer of a computing environment can improve the reproducibility of a project, but the rewards quickly diminish and the complexity quickly increases.
Improving project organization and documentation, literate programming, version control, programming best practices, and managing the package layer of a computing environment will almost always yield more benefits than focusing on the software layer or hardware layer of a computing environment.