12  Web Scraping

Abstract
This section contains guidelines and processes for gathering information from the web using web scraping. We will focus on two approaches. First, we will learn to download many files. Second, we will learn to gather information from the bodies of websites.

12.1 Review

We explored pulling data from web APIs in DSPP1. With web APIs, stewards are often carefully thinking about how to share information. This will not be the case with web scraping.

We also explored extracting data from Excel workbooks in Section 02. We will build on some of the ideas in that section.

Recall that if we have a list of elements, we can extract the \(i^{th}\) element with [[]]. For example, we can extract the third data frame from a list of data frames called data with data[[3]].

Recall that we can use map() to iterate a function across each element of a vector. Consider the following example:

times2 <- function(x) x * 2

x <- 1:3

map(.x = x, .f = times2)
[[1]]
[1] 2

[[2]]
[1] 4

[[3]]
[1] 6

12.2 Introduction and Motivation

The Internet is an immense source of information for research. Sometimes we can easily download data of interest in an ideal format with the click of a download button or a single API call.

But it probably won’t be long until we need data that require many download button clicks. Or worse, we may want data from web pages that don’t have a download button at all.

Consider a few examples.

  • The Urban Institute’s Boosting Upward Mobility from Poverty project programmatically downloaded 51 .xslx workbooks when building the Upward Mobility Data Tables.
  • We worked with the text of executive orders going back to the Clinton Administration when learning text analysis in DSPP1. Unfortunately, the Federal Register doesn’t publish a massive file of executive orders. So we iterated through websites for each executive order, scraped the text, and cleaned the data.
  • The Urban Institute scraped course descriptions from Florida community colleges to understand opportunities for work-based learning.
  • The Billion Prices Project web scraped millions of prices each day from online retailers. The project used the data to construct real-time price indices that limited political interference and to research concepts like price stickiness.

We will explore two approaches for gathering information from the web.

  1. Iteratively downloading files: Sometimes websites contain useful information across many files that need to be separately downloaded. We will use code to download these files. Ultimately, these files can be combined into one larger data set for research.
  2. Scraping content from the body of websites: Sometimes useful information is stored as tables or lists in the body of websites. We will use code to scrape this information and then parse and clean the result.

Sometimes we download many PDF files using the first approach. A related method that we will not cover that is useful for gathering information from the web is extracting text data from PDFs.

12.4 Programatically Downloading Data

The County Health Rankings & Roadmaps is a source of state and local information.

Suppose we are interested in Injury Deaths at the state level. We can click through the interface and download a .xlsx file for each state.

12.4.1 Downloading a Single File

  1. Start here.
  2. Using the interface at the bottom of the page, we can navigate to the page for “Virginia.”
  3. Next, we can click “View State Data.”
  4. Next, we can click “Download Virginia data sets.”

That’s a lot of clicks to get here.

If we want to download “2023 Virginia Data”, we can typically right click on the link and select “Copy Link Address”. This should return one of the following two URLS:

https://www.countyhealthrankings.org/sites/default/files/media/document/2023%20County%20Health%20Rankings%20Virginia%20Data%20-%20v2.xlsx
https://www.countyhealthrankings.org/sites/default/files/media/document/2023 County Health Rankings Virginia Data - v2.xlsx

Spaces are special characters in URLs and they are sometimes encoded as %20. Both URLs above work in the web browser, but only the URL with %20 will work with code.

As we’ve seen several times before, we could use read_csv() to directly download the data from the Internet if the file was a .csv.5 We need to download this file because it is an Excel file, which we can do with download.file() provided we include a destfile.

download.file(
  url = "https://www.countyhealthrankings.org/sites/default/files/media/document/2023%20County%20Health%20Rankings%20Virginia%20Data%20-%20v2.xlsx", 
  destfile = "data/virginia-injury-deaths.xlsx"
)

12.4.2 Downloading Multiple Files

If we click through and find the links for several states, we see that all of the download links follow a common pattern. For example, the URL for Vermont is

https://www.countyhealthrankings.org/sites/default/files/media/document/2023 County Health Rankings Vermont Data - v2.xlsx

The URLs only differ by "Virginia" and "Vermont". If we can create a vector of URLs by changing state name, then it is simple to iterate downloading the data. We will only download data for two states, but we can imagine downloading data for many states or many counties. Here are three R tips:

  • paste0() and str_glue() from library(stringr) are useful for creating URLs and destination files.
  • walk() from library(purrr) can iterate functions. It’s like map(), but we use it when we are interested in the side-effect of a function.6
  • Sometimes data are messy and we want to be polite. Custom functions can help with rate limiting and cleaning data.
download_chr <- function(url, destfile) {

  download.file(url = url, destfile = destfile)

  Sys.sleep(0.5)

}

states <- c("Virginia", "Vermont")

urls <- paste0(
  "https://www.countyhealthrankings.org/sites/default/files/",
  "media/document/2023%20County%20Health%20Rankings%20",
  states,
  "%20Data%20-%20v2.xlsx"
)

output_files <- paste0("data/", states, ".xlsx")

walk2(.x = urls, .y = output_files, .f = download_chr)
Exercise 1

SOI Tax Stats - Historic Table 2 provides individual income and tax data, by state and size of adjusted gross income. The website contains a bulleted list of URLs and each URL downloads a .xlsx file.

  1. Use download.file() to download the file for Alabama.
  2. Explore the URLs using “Copy Link Address”.
  3. Iterate pulling the data for Alabama, Alaska, and Arizona.

12.5 Web Scraping with rvest

We now pivot to situations where useful information is stored in the body of web pages.

12.5.1 Web Design

It’s simple to build a website with Quarto because it abstracts away most of web development. For example, Markdown is just a shortcut to write HTML. Web scraping requires us to learn more about web development than when we use Quarto.

The user interface of websites can be built with just HTML, but most websites contain HTML, CSS, and JavaScript. The development the interface of websites with HTML, CSS, and JavaScript is called front-end web development.

Hyper Text Markup Language

Hyper Text Markup Language (HTML) is the standard language for creating web content. HTML is a markup language, which means it has code for creating structure and and formatting.

The following HTML generates a bulleted list of names.

<ul>
  <li>Alex</li>
  <li>Aaron</li>
  <li>Alena</li>
</ul>
Cascading Style Sheets

Cascading Style Sheets (CSS) describes hot HTML elements should be styled when they are displayed.

For example, the following CSS adds extra space after sections with ## in our class notes.

.level2 {
  margin-bottom: 80px;
}
JavaScript

JavaScript is a programming language that runs in web browsers and is used to build interactivity in web interfaces.

Quarto comes with default CSS and JavaScript. library(leaflet) and Shiny are popular tools for building JavaScript applications with R. We will focus on web scraping using HTML and CSS.

First, we will cover a few important HTML concepts. W3Schools offers a thorough introduction. Consider the following simple website built from HTML:

<html>
<head>
<title>Hello World!</title>
</head>
<body>
<h1 class='important'>Bigger Title!</h1>
<h2 class='important'>Big Title!</h1>
<p>My first paragraph.</p>
<p id='special-paragraph'>My first paragraph.</p>
</body>
</html>

An HTML element is a start tag, some content, and an end tag. Every start tag has a matching end tag. For example, <body and </body>. <html>, <head>, and <body> are required elements for all web pages. Other HTML elements include <h1>, <h2>, and <p>.

HTML attributes are name/value pairs that provide additional information about elements. HTML attributes are optional and are like function arguments for HTML elements.

Two HTML attributes, classes and ids, are particularly important for web scraping.

  • HTML classes are HTML attributes that label multiple HTML elements. These classes are useful for styling HTML elements using CSS. Multiple elements can have the same class.
  • HTML ids are HTML attributes that label one HTML element. Ids are useful for styling singular HTML elements using CSS. Each ID can be used only one time in an HTML document.

We can view HTML for any website by right clicking in our web browser and selecting “View Page Source.”7

Exercise 2
  1. Inspect the HTML behind this list of “Hello World examples”.
  2. Inspect the HTML behind the Wikipedia page for Jerzy Neyman.

Second, we will explore CSS. CSS relies on HTML elements, HTML classes, and HTML ids to style HTML content. CSS selectors can directly reference HTML elements. For example, the following selectors change the style of paragraphs and titles.

p {
  color: red;
}

h1 {
  font-family: wingdings;
}

CSS selectors can reference HTML classes. For example, the following selector changes the style of HTML elements with class='important'.

.important {
  font-family: wingdings;
}

CSS selectors can reference also reference HTML IDs. For example, the following selector changes the style of the one element with id='special-paragraph'

#special-paragraph {
  color: pink;
}

We can explore CSS by right clicking and selecting Inspect. Most modern websites have a lot of HTML and a lot of CSS. We can find the CSS for specific elements in a website with the button at the top left of the new window that just appeared.

Inspecting CSS

12.5.2 Tables

library(rvest) is the main tool for scraping static websites with R. We’ll start with examples that contain information in HTML tables.8

HTML tables store information in tables in websites using the <table>, <tr>, <th>, and <td>. If the data of interest are stored in tables, then it can be trivial to scrape the information.

Consider the Wikipedia page for the 2012 Presidential Election. We can scrape all 46 tables from the page with two lines of code. We use the WayBack Machine to ensure the content is stable.

library(rvest)

Attaching package: 'rvest'
The following object is masked from 'package:readr':

    guess_encoding
tables <- read_html("https://web.archive.org/web/20230814004444/https://en.wikipedia.org/wiki/2012_United_States_presidential_election") |>
  html_table()

Suppose we are interested in the table about presidential debates. We can extract that element from the list of tables.

tables[[18]]
# A tibble: 12 × 9
   `Presidential candidate`     Party `Home state` `Popular vote` `Popular vote`
   <chr>                        <chr> <chr>        <chr>          <chr>         
 1 "Presidential candidate"     Party Home state   Count          Percentage    
 2 "Barack Hussein Obama II"    Demo… Illinois     65,915,795     51.06%        
 3 "Willard Mitt Romney"        Repu… Massachuset… 60,933,504     47.20%        
 4 "Gary Earl Johnson"          Libe… New Mexico   1,275,971      0.99%         
 5 "Jill Ellen Stein"           Green Massachuset… 469,627        0.36%         
 6 "Virgil Hamlin Goode Jr."    Cons… Virginia     122,389        0.11%         
 7 "Roseanne Cherrie Barr"      Peac… Utah         67,326         0.05%         
 8 "Ross Carl \"Rocky\" Anders… Just… Utah         43,018         0.03%         
 9 "Thomas Conrad Hoefling"     Amer… Nebraska     40,628         0.03%         
10 "Other"                      Other Other        217,152        0.17%         
11 "Total"                      Total Total        129,085,410    100%          
12 "Needed to win"              Need… Needed to w… Needed to win  Needed to win 
# ℹ 4 more variables: Electoralvote <chr>, `Running mate` <chr>,
#   `Running mate` <chr>, `Running mate` <chr>

Of course, we want to be polite. library(polite) makes this very simple. “The three pillars of a polite session are seeking permission, taking slowly and never asking twice.”

We’ll use bow() to start a session and declare our user agent, and scrape() instead of read_html().9

library(polite)

session <- bow(
  url = "https://web.archive.org/web/20230814004444/https://en.wikipedia.org/wiki/2012_United_States_presidential_election",
  user_agent = "Georgetown students learning scraping -- arw109@georgetown.edu"
)

session
<polite session> https://web.archive.org/web/20230814004444/https://en.wikipedia.org/wiki/2012_United_States_presidential_election
    User-agent: Georgetown students learning scraping -- arw109@georgetown.edu
    robots.txt: 1 rules are defined for 1 bots
   Crawl delay: 5 sec
  The path is scrapable for this user-agent
election_page <- session |>
  scrape() 
  
tables <- election_page |>
  html_table()

tables[[18]]
# A tibble: 12 × 9
   `Presidential candidate`     Party `Home state` `Popular vote` `Popular vote`
   <chr>                        <chr> <chr>        <chr>          <chr>         
 1 "Presidential candidate"     Party Home state   Count          Percentage    
 2 "Barack Hussein Obama II"    Demo… Illinois     65,915,795     51.06%        
 3 "Willard Mitt Romney"        Repu… Massachuset… 60,933,504     47.20%        
 4 "Gary Earl Johnson"          Libe… New Mexico   1,275,971      0.99%         
 5 "Jill Ellen Stein"           Green Massachuset… 469,627        0.36%         
 6 "Virgil Hamlin Goode Jr."    Cons… Virginia     122,389        0.11%         
 7 "Roseanne Cherrie Barr"      Peac… Utah         67,326         0.05%         
 8 "Ross Carl \"Rocky\" Anders… Just… Utah         43,018         0.03%         
 9 "Thomas Conrad Hoefling"     Amer… Nebraska     40,628         0.03%         
10 "Other"                      Other Other        217,152        0.17%         
11 "Total"                      Total Total        129,085,410    100%          
12 "Needed to win"              Need… Needed to w… Needed to win  Needed to win 
# ℹ 4 more variables: Electoralvote <chr>, `Running mate` <chr>,
#   `Running mate` <chr>, `Running mate` <chr>
Exercise 3
  1. Install and load library(rvest).
  2. Install and load library(polite).
  3. Scrape the Presidential debates table from the Wikipedia article for the 2008 presidential election.

12.5.3 Other HTML Content

Suppose we want to scrape every URL in the body of the 2012 Presidential Election webpage. html_table() no longer works.

We could manually poke through the source code to find the appropriate CSS selectors. Fortunately, SelectorGadget often eliminates this tedious work by telling you the name of the html elements that you click on.

  1. Click the SelectorGadget gadget browser extension. You may need to click the puzzle piece to the right of the address bar and then click the SelectorGadget browser extension.
  2. Select an element you want to scrape. The elements associated with the CSS selector provided at the bottom will be in green and yellow.
  3. If SelectorGadget selects too few elements, select additional elements. If SelectorGadget selects too many elements, click those elements. They should turn red.

Each click should refine the CSS selector.

After a few clicks, it’s clear we want p a. This should select any element a in p. a is the element for URLs.

We’ll need a few more functions to finish this example.

  • html_elements() filters the output of read_html()/scrape() based on the provided CSS selector. html_elements() can return multiple elements while html_element() always returns one element.
  • html_text2() retrieves text from HTML elements.
  • html_attrs() retrieves HTML attributes from HTML elements. html_attrs() can return multiple attributes while html_attr() always returns one attribute.
tibble(
  text = election_page |>
    html_elements(css = "p a") |>
    html_text2(),
  url = election_page |>
    html_elements(css = "p a") |>
    html_attr(name = "href")
)
# A tibble: 355 × 2
   text                  url                                                    
   <chr>                 <chr>                                                  
 1 Barack Obama          /web/20230814004444/https://en.wikipedia.org/wiki/Bara…
 2 Democratic            /web/20230814004444/https://en.wikipedia.org/wiki/Demo…
 3 Barack Obama          /web/20230814004444/https://en.wikipedia.org/wiki/Bara…
 4 Democratic            /web/20230814004444/https://en.wikipedia.org/wiki/Demo…
 5 presidential election /web/20230814004444/https://en.wikipedia.org/wiki/Unit…
 6 Democratic            /web/20230814004444/https://en.wikipedia.org/wiki/Demo…
 7 President             /web/20230814004444/https://en.wikipedia.org/wiki/Pres…
 8 Barack Obama          /web/20230814004444/https://en.wikipedia.org/wiki/Bara…
 9 running mate          /web/20230814004444/https://en.wikipedia.org/wiki/Runn…
10 Vice President        /web/20230814004444/https://en.wikipedia.org/wiki/Vice…
# ℹ 345 more rows
Exercise 4

Suppose we are interested in examples of early websites. Wikipedia has a list of URLs from before 1995.

  1. Add the SelectorGadget web extension to your browser.
  2. Use library(polite) and library(rvest) to scrape() the following URL.
https://web.archive.org/web/20230702163608/https://en.wikipedia.org/wiki/List_of_websites_founded_before_1995
  1. We are interested in scraping the names of early websites and their URLs. Use SelectorGadget to determine the CSS selectors associated with these HTML elements.
  2. Create a tibble with a variable called name and a variable called url.
  3. Remove duplicate rows with distinct() or filter().
Exercise 5
  1. Find your own HTML table of interest to scrape.
  2. Use library(rvest) and library(polite) to scrape the table.

12.6 Conclusion

Scraping data from websites is a powerful way of collecting data at scale or even when it is not organized in easily downloadable files like CSVs. Iteratively downloading is an elegant alternative to a time-intensive and potentially prohibitive process of going to websites and repeatedly downloading individual data sets. Scraping content from the body of websites is a more sophisticated approach that involves determining website html structure and then using that knowledge to extract key elements of that text. We strongly encourage you to consider the legal and ethnical risks of downloading this data.


  1. We are not lawyers. This is not official legal advise. If in-doubt, please contact a legal professional.↩︎

  2. This blog and this blog support this statement. Again, we are not lawyers and the HiQ Labs v. LinkedIn decision is complicated because of its long history and conclusion in settlement.↩︎

  3. The scale of crawling is so great that there is concern about models converging once all models use the same massive training data. Common Crawl is one example. This isn’t a major issue for generating images but model homogeneity is a big concern in finance.↩︎

  4. Who deserves privacy is underdiscussed and inconsistent. Every year, newspapers across the country FOIA information about government employees and publish their full names, job titles, and salaries.↩︎

  5. Consequently, code that may once have worked can break, but using read_csv(<file_path>) to access data once it has been downloaded will work consistently.↩︎

  6. The only difference between map() and walk() is their outputs. map() returns the results of a function in a list. walk() returns nothing when used without assignment, and we never use walk() with assignment. walk() is useful when we don’t care about the output of functions and are only interested in their “side-effects”. Common functions to use with walk() are ggsave() and write_csv(). For more information on walk(), see Advanced R.↩︎

  7. We recommend using Google Chrome, which has excellent web development tools.↩︎

  8. If a website is static, that means that the website is not interactive and will remain the same unless the administrator actively makes changes. Hello World examples is an example of a static website.↩︎

  9. The polite documentation describes the bow() function as being used to “introduce the client to the host and ask for permission to scrape (by inquiring against the host’s robots.txt file).”↩︎