6  Reproducible Research with Git and GitHub

Abstract
Git and GitHub are powerful software tools used to control different versions of a codebase, track changes, and collaborate with other programmers. This section introduces both tools.

Hansel and Gretel by Arthur Rackham
Table 6.1:

Opinionated Analysis Development

Opinionated Analysis Development
Opinionated Approach Question Addressed Tool1 Section1
Version control (individual) Can you re-run your analysis with new data and compare it to previous results? Git Version Control
Version control (individual) Can you surface the code changes that resulted in a different analysis results? Git Version Control
Code review If you make a mistake in your code, will someone notice it? GitHub Version Control
Version control (collaborative) Can a second analyst easily contribute code to the analysis? GitHub Version Control
Version control (collaborative) If two analysts are developing code simultaneously, can they easily combine them? GitHub Version Control
Version control (collaborative) Can you easily track next steps in your analysis? GitHub Version Control
Version control (collaborative) Can your collaborators make requests outside of meetings or email? GitHub Version Control
Source: Parker, Hilary. n.d. “Opinionated Analysis Development.” https://doi.org/10.7287/peerj.preprints.3210v1.
1 Added by Aaron R. Williams
Note

<> are used throughout this chapter to indicate blanks that need to be filled in. Don’t actually submit <>. Instead, replace them with the desired text.

6.1 Command Line

The command line (also known as shell or console) is a way of controlling computers without using a graphical user interface (i.e. pointing-and-clicking). The command line is useful because pointing-and-clicking is tough to reproduce or scale and because lots of useful software is only available through the command line. Furthermore, cloud computing often requires use of the command line.

There are different ways to use the command line.

Macs use the Terminal (Figure 6.1). Open Terminal like any other program on Mac.

Figure 6.1: Mac Terminal

Git Bash, which is installed with Git, works well on Windows. If you have Git Bash, you should be able to right-click in a desired directory and select “Git Bash Here” to access Git Bash on Windows.

RStudio contains a terminal in the tab adjacent to the console (Figure 6.2). This will allow us to work at the common line with a common experience on Mac-, Windows-, and Linux-based computers.

Figure 6.2: RStudio Terminal

6.1.1 Bash

Bash is a shell program and command language that allows us to control our computer at the command line. Fortunately, we only need to know a little Bash for version control with Git.

  • pwd - print working directory - prints the file path to the current location in the
  • ls - list - lists files and folders in the current working directory.
  • cd - change directory - move the current working directory. Specify the relative path to move down in a directory. Use cd .. to move up a directory.
  • mkdir - make directory - creates a directory (folder) in the current working directory.
  • touch - creates a text file with the provided name.
  • mv - move - moves a file from one location to the other.
  • cat - concatenate - concatenate and print a file.

6.1.2 Useful tips

  • Tab completion can save a ton of typing. Hitting tab twice shows all of the available options that can complete from the currently typed text.
  • Hit the up arrow to cycle through previously submitted commands.
  • Use man <command name> to pull up help documentation. Hit q to exit.
  • Typing .. refers to the directory above the working directory. Writing cd .. changes to the directory above the working directory.
  • Typing just cd changes to the home directory.
Exercise 1
  1. Navigate to the example-project directory using cd in the RStudio terminal.
  2. Submit pwd to confirm you are in the correct directory.
Exercise 2

Use the following code to create a new text document called haiku.txt in the working directory.

pwd
ls
touch haiku.txt
ls
Exercise 3

Use following to add the haiku “The Old Pond” by Matsuo Basho to haiku.txt.

cat haiku.txt
echo "An old silent pond" >> haiku.txt
cat haiku.txt
echo "A frog jumps into the pond-" >> haiku.txt
echo "Splash! Silence again." >> haiku.txt
echo "~Matsuo Basho" >> haiku.txt
cat haiku.txt
Exercise 4

Use the following code to move haiku.txt to a subdirectory called poems/.

ls
mkdir poems
ls
mv haiku.txt poems/haiku.txt
ls
cat poems/haiku.txt

6.1.3 Programs

We can run programs from the command line. Commands from programs always start with the name of the program. Running git commands intuitively start with git. For example:

git init
git status

6.2 Why version control?

Version Control

Version control is a system for managing and recording changes to files over time.

Version control is essential to managing code and analyses. Good version control can:

  • Create a permanent record of changes to code
  • Easily undo mistakes by switching between iterations of code
  • Allow multiple paths of development while protecting working versions of code
  • Encourage communication between collaborators
  • Facilitate multiple code reviews
  • Be used for external communication

6.3 Why distributed version control?

Centralized version control

Centralized version control stores all files and the log of those files in one centralized location.

Distributed version control

Distributed version control stores files and logs in one or many locations and has tools for combining the different versions of files and logs.

Centralized version control systems like Google Drive or Box are good for sharing a Microsoft Word document, but they are terrible for collaborating on code.

Distributed version control allows for the simultaneous editing and running of code. It also allows for code development without sacrificing a working version of the code.

Note

Git and GitHub are difficult to motivate a priori but the value is obvious after adopting the tools. We’ve done our best to motivate the tools. If you are unconvinced, we ask that you just trust us on this one.

6.4 Git vs. GitHub

Git

Git is a distributed version-control system for tracking changes in code. Git is free, open-source software and can be used locally without an internet connection. It’s like a turbo-charged version of Microsoft Word’s track changes for code.

GitHub

GitHub, which is owned by Microsoft, is an online hosting service for version control using Git. It also contains useful tools for collaboration and project management. It’s like a turbo-charged version of Google Drive or Box for sharing repositories created using Git.

At first, it’s easy to mix up Git and GitHub. Just try to remember that they are separate tools that complement each other well.

Exercise 5
  1. If you don’t already have an account, sign up for GitHub.

6.5 SSH Keys for Authentication

GitHub started requiring token-based or SSH-based authentication in 2021. We will focus on creating SSH keys for authentication. For instructions on creating a personal access token for authentication, see Section 6.10 below.

We will follow the instructions for setting up SSH keys using the console, or terminal window, from Jenny Bryan’s fantastic Happy Git with R.

Exercise 5
  1. Follow the instructions for setting up SSH keys using the console. We recommend using the default key location and key name. You can choose whether or not to add a password for the key. Note that if you choose to add a password, you will need to enter that password every time you perform operations with GitHub - so make sure you’ll be able to remember it!
  2. When you get to the section of the instructions to provide the public key to GitHub, we recommend obtaining the public key as follows:
  • In a terminal window, run cat ~/.ssh/id_ed25519.pub
  • Highlight the public key that is printed to the console and copy the text.
  • Follow the instructions from Jenny Bryan at 10.5.3 to add the copied key to GitHub.

6.6 Git + GitHub Workflow

Important

Git does not work well with shared drives like Box, Google Drive, and SharePoint. Fortunately, those tools aren’t necessary for a Git + GitHub workflow.

Repository

A repository is a collection of files, often a directory, where files are organized and logged by Git.

Git and GitHub organize projects into repositories. Typically, a “repo” will correspond with the place where you started a .Rproj. When working with Git and GitHub, your files will exist in two places: locally on your computer and remotely on GitHub.

When creating a new repository, you can use either of the following alternatives:

  • Initialize the repo locally on your computer and later add the repo to GitHub
  • Initialize the repo remotely on GitHub and then copy (clone) the repo to your computer.

To create a repository (only needs to be done once per project):

git init initializes a local Git repository.

OR

git clone <link> copies a remote repository from GitHub to the location of the working directory on your computer.

Exercise 5

We’re going to create a repo remotely on GitHub and clone it to our computer.

  1. Go here to see the repository for today’s course materials.
  2. Click the green “Code” button and copy the URL under SSH.
  3. At the command line, navigate to where you want to clone the course notes.
  4. git clone https://github.com/awunderground/2024-08-03_reproducible-research-bootcamp
Exercise 6

We’re going to create a repo remotely on GitHub and clone it to our computer.

  1. Go to GitHub.
  2. Click the big green “New” button.
  3. Call the repo my-first-repo. Select a public repository and check the box that says Add a README file.
  4. Click the “Create Repository” button.
  5. Click the green “Code” button and copy the SSH link.
  6. Using the command line, navigate to a sensible place on your computer to store a project.
  7. Run git clone < link >, where you replace < link > with the SSH link you copied. This will create a folder called my-first-repo with your repo files, including the README.md file you created when you initialized the repo.

6.6.1 Basic Approach

  1. Initialize a repository for a project (we’ve already done this!).
  2. Tell Git which files to track. Track scripts. Avoid tracking data or binary files like .docx and .xlsx. 1
  3. Take a snapshot of tracked files and add a commit message.
  4. Save the tracked files to the remote GitHub repository.
  5. Repeat, repeat, repeat

Basic Git + GitHub workflow

6.6.2 Commands

git status prints out all of the important information about your repo. Use it before and after most commands to understand how code changes your repo.

git add <file-name> adds a file to staging. It says, “hey look at this!”.

git commit -m "<message>" commits changes made to added files to the repository. It says, “hey, take a snapshot of the things you looked at in the previous command.” Don’t forget the -m. 2

git push origin main pushes your local commits to the remote repository on GitHub. It says, “hey, take a look at these snapshots I just made”. It is possible to push to branches other than main. Simply replace main with the desired branch name.

git log --oneline shows the commit history in the repository.

git diff shows changes since the last commit.

Exercise 7
  1. Navigate to your example-project directory on the command line and run git status to confirm that there isn’t a git repo.
  2. Use git init to initialize a new repository.
  3. In the terminal, run git add analysis.qmd analysis.html. Then run git status.
  4. Then run git commit -m "add draft analysis" to commit the two files with a commit message. Then run git status again and observe the change.
  5. Create a repository on GitHub called example-project. Select a public repository and do not check the box that says Add a README file.
  6. Follow the instructions to …or push an existing repository from the command line.
Exercise 8
  1. Make some changes to index.qmd and re-render the document.
  2. Add your files.
  3. Commit your files.
  4. Push your files.
  5. In the GitHub repo, click on index.qmd then click on the History icon in the top right corner. Note that you’ll see your two commits. Click on the most recent commit and notice that you can see the changes that you’ve made to the file.

6.7 GitHub Pages

GitHub Pages are free websites hosted directly from a GitHub repository. With a free GitHub account, a GitHub repo must be public to create a GitHub page with that repo. When you create a GitHub page, you associate it with a specific branch of your repo. GitHub Pages will look for an index.html, index.md, or README.md file as the entry file for your site.

Exercise 9
  1. Go to your GitHub repository and select “Settings” on the top right.
  2. Under the “Code and Automations” menu on the left side, select “Pages”.
  3. Under “Build and Deployment” and “Branch”, use the drop down menu to change the branch to from “None” to “main”. This will trigger the deployment of your GitHub page from the “main” branch of your repository. It will take a bit of time for your GitHub page to be ready.
  4. Refresh the page, and when you see a box that says “your site is live” at the top of your page, click the link to go to your website!

6.8 .gitignore

We don’t want to add every file to Git or GitHub. We want to avoid binary files like Word documents. We can’t add very large files.

By default, every file in a local repository will show up after git status and those files we don’t want to add will be at risk of being added by accident. This is annoying.

Fortunately, we can use a .gitignore to ignore files and directories. This cleans up our git status and protects us from accidentally add a file we don’t want to add. To ignore a file, just add the name of the file or folder to the .gitignore.

Exercise 10
  1. Run git status.
  2. Add a file called .gitignore. In the file, add poems/.
  3. Run git status.

6.9 Conclusion

Git is a distributed version-control system. It is used for tracking changes in the code. GitHub is an online hosting service for version control using git. Key workhorse commands are git status, git add, git commit -m <message> git push and git diff. GitHub is also great because it will host websites using GitHub Pages.

6.9.1 Git is Confusing

We promise that it’s worth it.

6.9.2 Resources

6.10 Personal Access Tokens for Authentication

  1. Starting on your GitHub account navigate through the following:
    1. Click your icon in the far right
    2. Select Settings at the bottom of the drop down menu
    3. Select Developer Settings on the bottom left
    4. Select Personal access tokens on the bottom left
    5. Select Generate new token
  2. Set up your Personal Access Token (PAT)
    1. Add a note describing the use of your token. This is useful if you intend to generate multiple tokens for different uses.
    2. Select “No expiration”. You may want tokens to expire if that access sensitive resources.
    3. Select scopes. You must select at least the “repo” scope. You may want to add other scopes but they are not required for this course.
  3. Click Generate token
  4. This is your only chance to view the token. Copy and paste the token and store it somewhere safe. If you lose the token, you can always generate a new token.
  5. Git will prompt you for your GitHub username and password sometimes while cloning repositories or pushing to private repositories. Use your GitHub username when prompted for username. Use your generated PAT when prompted for password.

6.11 Initialize a Repo Locally and Add to GitHub

This only needs to happen once per repository

  1. Initialize a local repository with git init as outlined above.
  2. On GitHub, click the plus sign in the top, right corner and select New Repository.
  3. Create a repository with the same name as your directory.
  4. Copy the code under …or push an existing repository from the command line and submit it in the command line that is already open.

  1. GitHub refuses to store files larger than 100 MiB. This poses a challenge to writing reproducible code. However, many data sources can be downloaded directly from the web or via APIs, allowing code to be reproducible without relying on storing large data sets on GitHub.↩︎

  2. The -m stands for message. Writing a brief commit message like “fixes bug in data cleaning script” can help collaborators (including your future self) understand the purpose of your commits.↩︎