Opinionated Analysis Development | |||
Opinionated Approach | Question Addressed | Tool1 | Section1 |
---|---|---|---|
Version control (individual) | Can you re-run your analysis with new data and compare it to previous results? | Git | Version Control |
Version control (individual) | Can you surface the code changes that resulted in a different analysis results? | Git | Version Control |
Code review | If you make a mistake in your code, will someone notice it? | GitHub | Version Control |
Version control (collaborative) | Can a second analyst easily contribute code to the analysis? | GitHub | Version Control |
Version control (collaborative) | If two analysts are developing code simultaneously, can they easily combine them? | GitHub | Version Control |
Version control (collaborative) | Can you easily track next steps in your analysis? | GitHub | Version Control |
Version control (collaborative) | Can your collaborators make requests outside of meetings or email? | GitHub | Version Control |
Source: Parker, Hilary. n.d. “Opinionated Analysis Development.” https://doi.org/10.7287/peerj.preprints.3210v1. | |||
1 Added by Aaron R. Williams |
6 Reproducible Research with Git and GitHub
<>
are used throughout this chapter to indicate blanks that need to be filled in. Don’t actually submit <>
. Instead, replace them with the desired text.
6.1 Command Line
The command line (also known as shell or console) is a way of controlling computers without using a graphical user interface (i.e. pointing-and-clicking). The command line is useful because pointing-and-clicking is tough to reproduce or scale and because lots of useful software is only available through the command line. Furthermore, cloud computing often requires use of the command line.
There are different ways to use the command line.
Macs use the Terminal (Figure 6.1). Open Terminal like any other program on Mac.
Git Bash, which is installed with Git, works well on Windows. If you have Git Bash, you should be able to right-click in a desired directory and select “Git Bash Here” to access Git Bash on Windows.
RStudio contains a terminal in the tab adjacent to the console (Figure 6.2). This will allow us to work at the common line with a common experience on Mac-, Windows-, and Linux-based computers.
6.1.1 Bash
Bash is a shell program and command language that allows us to control our computer at the command line. Fortunately, we only need to know a little Bash for version control with Git.
pwd
- print working directory - prints the file path to the current location in thels
- list - lists files and folders in the current working directory.cd
- change directory - move the current working directory. Specify the relative path to move down in a directory. Usecd ..
to move up a directory.mkdir
- make directory - creates a directory (folder) in the current working directory.touch
- creates a text file with the provided name.mv
- move - moves a file from one location to the other.cat
- concatenate - concatenate and print a file.
6.1.2 Useful tips
- Tab completion can save a ton of typing. Hitting tab twice shows all of the available options that can complete from the currently typed text.
- Hit the up arrow to cycle through previously submitted commands.
- Use
man <command name>
to pull up help documentation. Hitq
to exit. - Typing
..
refers to the directory above the working directory. Writingcd ..
changes to the directory above the working directory. - Typing just
cd
changes to the home directory.
6.1.3 Programs
We can run programs from the command line. Commands from programs always start with the name of the program. Running git commands intuitively start with git
. For example:
6.2 Why version control?
Version control is a system for managing and recording changes to files over time.
Version control is essential to managing code and analyses. Good version control can:
- Create a permanent record of changes to code
- Easily undo mistakes by switching between iterations of code
- Allow multiple paths of development while protecting working versions of code
- Encourage communication between collaborators
- Facilitate multiple code reviews
- Be used for external communication
6.3 Why distributed version control?
Centralized version control stores all files and the log of those files in one centralized location.
Distributed version control stores files and logs in one or many locations and has tools for combining the different versions of files and logs.
Centralized version control systems like Google Drive or Box are good for sharing a Microsoft Word document, but they are terrible for collaborating on code.
Distributed version control allows for the simultaneous editing and running of code. It also allows for code development without sacrificing a working version of the code.
Git and GitHub are difficult to motivate a priori but the value is obvious after adopting the tools. We’ve done our best to motivate the tools. If you are unconvinced, we ask that you just trust us on this one.
6.4 Git vs. GitHub
Git is a distributed version-control system for tracking changes in code. Git is free, open-source software and can be used locally without an internet connection. It’s like a turbo-charged version of Microsoft Word’s track changes for code.
GitHub, which is owned by Microsoft, is an online hosting service for version control using Git. It also contains useful tools for collaboration and project management. It’s like a turbo-charged version of Google Drive or Box for sharing repositories created using Git.
At first, it’s easy to mix up Git and GitHub. Just try to remember that they are separate tools that complement each other well.
6.5 SSH Keys for Authentication
GitHub started requiring token-based or SSH-based authentication in 2021. We will focus on creating SSH keys for authentication. For instructions on creating a personal access token for authentication, see Section 6.10 below.
We will follow the instructions for setting up SSH keys using the console, or terminal window, from Jenny Bryan’s fantastic Happy Git with R.
6.6 Git + GitHub Workflow
Git does not work well with shared drives like Box, Google Drive, and SharePoint. Fortunately, those tools aren’t necessary for a Git + GitHub workflow.
A repository is a collection of files, often a directory, where files are organized and logged by Git.
Git and GitHub organize projects into repositories. Typically, a “repo” will correspond with the place where you started a .Rproj. When working with Git and GitHub, your files will exist in two places: locally on your computer and remotely on GitHub.
When creating a new repository, you can use either of the following alternatives:
- Initialize the repo locally on your computer and later add the repo to GitHub
- Initialize the repo remotely on GitHub and then copy (clone) the repo to your computer.
To create a repository (only needs to be done once per project):
git init
initializes a local Git repository.
OR
git clone <link>
copies a remote repository from GitHub to the location of the working directory on your computer.
6.6.1 Basic Approach
- Initialize a repository for a project (we’ve already done this!).
- Tell Git which files to track. Track scripts. Avoid tracking data or binary files like
.docx
and.xlsx
. 1 - Take a snapshot of tracked files and add a commit message.
- Save the tracked files to the remote GitHub repository.
- Repeat, repeat, repeat
6.6.2 Commands
git status
prints out all of the important information about your repo. Use it before and after most commands to understand how code changes your repo.
git add <file-name>
adds a file to staging. It says, “hey look at this!”.
git commit -m "<message>"
commits changes made to added files to the repository. It says, “hey, take a snapshot of the things you looked at in the previous command.” Don’t forget the -m
. 2
git push origin main
pushes your local commits to the remote repository on GitHub. It says, “hey, take a look at these snapshots I just made”. It is possible to push to branches other than main
. Simply replace main
with the desired branch name.
git log --oneline
shows the commit history in the repository.
git diff
shows changes since the last commit.
6.7 GitHub Pages
GitHub Pages are free websites hosted directly from a GitHub repository. With a free GitHub account, a GitHub repo must be public to create a GitHub page with that repo. When you create a GitHub page, you associate it with a specific branch of your repo. GitHub Pages will look for an index.html
, index.md
, or README.md
file as the entry file for your site.
6.8 .gitignore
We don’t want to add every file to Git or GitHub. We want to avoid binary files like Word documents. We can’t add very large files.
By default, every file in a local repository will show up after git status
and those files we don’t want to add will be at risk of being added by accident. This is annoying.
Fortunately, we can use a .gitignore
to ignore files and directories. This cleans up our git status
and protects us from accidentally add a file we don’t want to add. To ignore a file, just add the name of the file or folder to the .gitignore
.
6.9 Conclusion
Git is a distributed version-control system. It is used for tracking changes in the code. GitHub is an online hosting service for version control using git. Key workhorse commands are git status
, git add
, git commit -m <message>
git push
and git diff
. GitHub is also great because it will host websites using GitHub Pages.
6.9.1 Git is Confusing
We promise that it’s worth it.
6.9.2 Resources
6.10 Personal Access Tokens for Authentication
- Starting on your GitHub account navigate through the following:
- Click your icon in the far right
- Select Settings at the bottom of the drop down menu
- Select Developer Settings on the bottom left
- Select Personal access tokens on the bottom left
- Select Generate new token
- Set up your Personal Access Token (PAT)
- Add a note describing the use of your token. This is useful if you intend to generate multiple tokens for different uses.
- Select “No expiration”. You may want tokens to expire if that access sensitive resources.
- Select scopes. You must select at least the “repo” scope. You may want to add other scopes but they are not required for this course.
- Click Generate token
- This is your only chance to view the token. Copy and paste the token and store it somewhere safe. If you lose the token, you can always generate a new token.
- Git will prompt you for your GitHub username and password sometimes while cloning repositories or pushing to private repositories. Use your GitHub username when prompted for username. Use your generated PAT when prompted for password.
6.11 Initialize a Repo Locally and Add to GitHub
This only needs to happen once per repository
- Initialize a local repository with
git init
as outlined above. - On GitHub, click the plus sign in the top, right corner and select
New Repository
. - Create a repository with the same name as your directory.
- Copy the code under …or push an existing repository from the command line and submit it in the command line that is already open.
GitHub refuses to store files larger than 100 MiB. This poses a challenge to writing reproducible code. However, many data sources can be downloaded directly from the web or via APIs, allowing code to be reproducible without relying on storing large data sets on GitHub.↩︎
The
-m
stands formessage
. Writing a brief commit message like “fixes bug in data cleaning script” can help collaborators (including your future self) understand the purpose of your commits.↩︎