[[1]]
[1] 2
[[2]]
[1] 4
[[3]]
[1] 6
12 Web Scraping
12.1 Review
We explored pulling data from web APIs in DSPP1. With web APIs, stewards are often carefully thinking about how to share information. This will not be the case with web scraping.
We also explored extracting data from Excel workbooks in Section 02. We will build on some of the ideas in that section.
Recall that if we have a list of elements, we can extract the \(i^{th}\) element with [[]]
. For example, we can extract the third data frame from a list of data frames called data
with data[[3]]
.
Recall that we can use map()
to iterate a function across each element of a vector. Consider the following example:
12.2 Introduction and Motivation
The Internet is an immense source of information for research. Sometimes we can easily download data of interest in an ideal format with the click of a download button or a single API call.
But it probably won’t be long until we need data that require many download button clicks. Or worse, we may want data from web pages that don’t have a download button at all.
Consider a few examples.
- The Urban Institute’s Boosting Upward Mobility from Poverty project programmatically downloaded 51 .xslx workbooks when building the Upward Mobility Data Tables.
- We worked with the text of executive orders going back to the Clinton Administration when learning text analysis in DSPP1. Unfortunately, the Federal Register doesn’t publish a massive file of executive orders. So we iterated through websites for each executive order, scraped the text, and cleaned the data.
- The Urban Institute scraped course descriptions from Florida community colleges to understand opportunities for work-based learning.
- The Billion Prices Project web scraped millions of prices each day from online retailers. The project used the data to construct real-time price indices that limited political interference and to research concepts like price stickiness.
We will explore two approaches for gathering information from the web.
- Iteratively downloading files: Sometimes websites contain useful information across many files that need to be separately downloaded. We will use code to download these files. Ultimately, these files can be combined into one larger data set for research.
- Scraping content from the body of websites: Sometimes useful information is stored as tables or lists in the body of websites. We will use code to scrape this information and then parse and clean the result.
Sometimes we download many PDF files using the first approach. A related method that we will not cover that is useful for gathering information from the web is extracting text data from PDFs.
12.3 Legal and Ethical Considerations
It is important to consider the legal and ethical implications of any data collection. Collecting data from the web through methods like web scraping raises serious ethical and legal considerations.
12.3.1 Legal1
Different countries have different laws that affect web scraping. The United States has different laws and legal interpretations than countries in Europe, which are largely regulated by the European Union. In general, the United States has more relaxed policies than the European when it comes to gathering data from the web.
R for Data Science (2e) contains a clear and approachable rundown of legal consideration for gathering information for the web. We adopt their three-part standard of “public, non-personal, and factual”, which relate to terms of service, personally identifiable information, and copyright.
We will focus solely on laws in the United States.
Terms of Service
The legal environment for web scraping is in flux, but US Courts have created an environment that is legally supportive of gathering public information from the web.
First, we need to understand how many websites bar web scraping. Second, we need to understand when we can ignore these rules.
A terms of service is a list of rules posted by the provider of a website, web service, or software.
Terms of Service for many websites bar web scraping.
For example, LinkedIn’s Terms of Service says users agree to not “Develop, support or use software, devices, scripts, robots or any other means or processes (including crawlers, browser plugins and add-ons or any other technology) to scrape the Services or otherwise copy profiles and other data from the Services;”
This sounds like the end of web scraping, but as Wickham, Çetinkaya-Rundel, and Grolemund (2023) note, Terms of Service end up being a “legal land grab” for companies. It isn’t clear how LinkedIn would legally enforce this. HiQ Labs v. LinkedIn from the United States Court of Appeals for the Ninth Circuit bars Computer Fraud and Abuse Act (CFAA) claims against web scraping public information.2
We follow a simple guideline: it is acceptable to scrape information when we don’t need to create an account.
12.3.2 PII
Personal Identifiable Information (PII) is any information that can be used to directly identify an individual.
Public information on the Internet often contains PII, which raises legal and ethical challenges. We will discuss the ethics of PII later.
The legal considerations are trans-Atlantic. The General Data Protection Regulation (GDPR) is a European Union regulation about information privacy. It contains strict rules about the collection and storage of PII. It applies to almost everyone collecting data inside the EU. The GDPR is also extraterritorial, which means its rules can apply outside of the EU under certain circumstances like when an American company gathers information about EU individuals.
We will avoid gathering PII, so we don’t need to consider PII.
Copyright
Copyright protection subsists, in accordance with this title, in original works of authorship fixed in any tangible medium of expression, now known or later developed, from which they can be perceived, reproduced, or otherwise communicated, either directly or with the aid of a machine or device. Works of authorship include the following categories:
- literary works;
- musical works, including any accompanying words;
- dramatic works, including any accompanying music;
- pantomimes and choreographic works;
- pictorial, graphic, and sculptural works;
- motion pictures and other audiovisual works;
- sound recordings; and
- architectural works.
In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work.
Our final legal concern for gathering information from the Internet is copyright law. We have two main options for avoiding copyright limitations.
- We can avoid copyright protections by not scraping authored content in the protected categories (i.e. literary works and sound recordings). Fortunately, factual data are not typically protected by copyright.
- We can scrape information that is fair use. This is important if we want to use images, films, music, or extended text as data.
We will focus on data that are not copyrighted.
12.3.3 Ethical
We now turn to ethical considerations and some of the best-practices for gathering information from the web. In general, we will aim to be polite, give credit, and respect individual information.
Be polite
It is expensive and time-consuming to host data on the web. Hosts experience a small burden every time we access a website. This burden is small but can quickly grow with repeated queries. Just like with web APIs, we want to pace the burden of our access to be polite.
Rate limiting is the intentional slowing of web traffic for a user or users.
We will use Sys.sleep()
in custom functions to slow our web scraping and ease the burden of our web scraping on web hosts.
robots.txt tells web crawlers and scrapers which URLs the crawler is allowed to access on a website.
Many websites contain a robots.txt file. Consider examples from the Urban Institute and White House.
We can manually look at the robots.txt. For example, just visit https://www.urban.org/robots.txt
or https://www.whitehouse.gov/robots.txt
. We can also use library(polite)
, which will automatically look at the robots.txt.
Give Credit
Academia and the research profession undervalue the collection and curation of data. Generally speaking, no one gets tenure for constructing even the most important data sets. It is important to give credit for data accessed from the web. Ideally, add the citation to Zotero and then easily add it to your manuscript in Quarto.
Be sure to make it easy for others to cite data sets that you create. Include an example citation like IPUMS or create a DOI for your data.
The rise of generative AI models like GPT-3, Stable Diffusion, DALL-E 2 makes urgent considerations of giving credit. These models consume massive amounts of training data, and it isn’t clear where the training data come from or the legal and ethical implications of the training data.3
Consider a few current events:
- Sarah Silverman is suing OpenAI because she “never gave permission for OpenAI to ingest the digital version of her 2010 book to train its AI models, and it was likely stolen from a ‘shadow library’ of pirated works.”
- Somepalli et al. (2023) use state-of-the-art image retrieval models to find that generative AI models like the popular the popular Stable Diffusion model “blatantly copy from their training data.” This is a major problem if the training data are copyrighted. The first page of their paper (here) contains some dramatic examples.
- Finally, this Harvard Business Review article discusses the intellectual property problem facing generative AI.
Respect Individual Information
Data science methods should adhere to the same ethical standards as any research method. The social sciences have ethical norms about protecting privacy (discussed later) and informed consent.
Let’s consider an example. In 2016, researchers posted data about 70,000 OkCupid accounts. The data didn’t contain names but did contain usernames. The data also contained many sensitive variables including topics like sexual habits and politics.
The release drew strong reactions from some research ethicists including Michael Zimmer and Os Keyes.4
Fellegi (1972) defines data privacy as the ability “to determine what information about ourselves we will share with others”. Maybe OkCupid users made the decision to forego confidentiality when they published their accounts. Many institutional ethics committees do not require informed consent for public data.
Ravn, Barnwell, and Barbosa Neves (2020) do a good job developing a conceptual framework that bridges the gap between the view that all public data require informed consent and the view that no public data require informed consent.
It’s possible to conceive of a web scraping research project that is purely observational that adheres to the ethical standards of research and contains potentially disclosive information about individuals. Fortunately, researchers can typically use Institutional Review Boards and research ethicists to navigate these questions.
As a basic standard, we will avoid collecting PII and use anonymization techniques to limit the risk of re-identification.
We will also focus on applications where the host of information crudely shares the information. There are ample opportunities to create value by gathering information from government sources and converting it into more useful formats. For example, the government too often shares information in .xls
and .xlsx
files, clunky web interfaces, and PDFs.
12.4 Programatically Downloading Data
The County Health Rankings & Roadmaps is a source of state and local information.
Suppose we are interested in Injury Deaths at the state level. We can click through the interface and download a .xlsx file for each state.
12.4.1 Downloading a Single File
- Start here.
- Using the interface at the bottom of the page, we can navigate to the page for “Virginia.”
- Next, we can click “View State Data.”
- Next, we can click “Download Virginia data sets.”
That’s a lot of clicks to get here.
If we want to download “2023 Virginia Data”, we can typically right click on the link and select “Copy Link Address”. This should return one of the following two URLS:
https://www.countyhealthrankings.org/sites/default/files/media/document/2023%20County%20Health%20Rankings%20Virginia%20Data%20-%20v2.xlsx
https://www.countyhealthrankings.org/sites/default/files/media/document/2023 County Health Rankings Virginia Data - v2.xlsx
Spaces are special characters in URLs and they are sometimes encoded as %20
. Both URLs above work in the web browser, but only the URL with %20
will work with code.
As we’ve seen several times before, we could use read_csv()
to directly download the data from the Internet if the file was a .csv
.5 We need to download this file because it is an Excel file, which we can do with download.file()
provided we include a destfile
.
12.4.2 Downloading Multiple Files
If we click through and find the links for several states, we see that all of the download links follow a common pattern. For example, the URL for Vermont is
https://www.countyhealthrankings.org/sites/default/files/media/document/2023 County Health Rankings Vermont Data - v2.xlsx
The URLs only differ by "Virginia"
and "Vermont"
. If we can create a vector of URLs by changing state name, then it is simple to iterate downloading the data. We will only download data for two states, but we can imagine downloading data for many states or many counties. Here are three R tips:
paste0()
andstr_glue()
fromlibrary(stringr)
are useful for creating URLs and destination files.walk()
fromlibrary(purrr)
can iterate functions. It’s likemap()
, but we use it when we are interested in the side-effect of a function.6- Sometimes data are messy and we want to be polite. Custom functions can help with rate limiting and cleaning data.
download_chr <- function(url, destfile) {
download.file(url = url, destfile = destfile)
Sys.sleep(0.5)
}
states <- c("Virginia", "Vermont")
urls <- paste0(
"https://www.countyhealthrankings.org/sites/default/files/",
"media/document/2023%20County%20Health%20Rankings%20",
states,
"%20Data%20-%20v2.xlsx"
)
output_files <- paste0("data/", states, ".xlsx")
walk2(.x = urls, .y = output_files, .f = download_chr)
12.5 Web Scraping with rvest
We now pivot to situations where useful information is stored in the body of web pages.
12.5.1 Web Design
It’s simple to build a website with Quarto because it abstracts away most of web development. For example, Markdown is just a shortcut to write HTML. Web scraping requires us to learn more about web development than when we use Quarto.
The user interface of websites can be built with just HTML, but most websites contain HTML, CSS, and JavaScript. The development the interface of websites with HTML, CSS, and JavaScript is called front-end web development.
Hyper Text Markup Language (HTML) is the standard language for creating web content. HTML is a markup language, which means it has code for creating structure and and formatting.
The following HTML generates a bulleted list of names.
Cascading Style Sheets (CSS) describes hot HTML elements should be styled when they are displayed.
For example, the following CSS adds extra space after sections with ##
in our class notes.
JavaScript is a programming language that runs in web browsers and is used to build interactivity in web interfaces.
Quarto comes with default CSS and JavaScript. library(leaflet)
and Shiny are popular tools for building JavaScript applications with R. We will focus on web scraping using HTML and CSS.
First, we will cover a few important HTML concepts. W3Schools offers a thorough introduction. Consider the following simple website built from HTML:
<html>
<head>
<title>Hello World!</title>
</head>
<body>
<h1 class='important'>Bigger Title!</h1>
<h2 class='important'>Big Title!</h1>
<p>My first paragraph.</p>
<p id='special-paragraph'>My first paragraph.</p>
</body>
</html>
An HTML element is a start tag, some content, and an end tag. Every start tag has a matching end tag. For example, <body
and </body>
. <html>
, <head>
, and <body>
are required elements for all web pages. Other HTML elements include <h1>
, <h2>
, and <p>
.
HTML attributes are name/value pairs that provide additional information about elements. HTML attributes are optional and are like function arguments for HTML elements.
Two HTML attributes, classes and ids, are particularly important for web scraping.
- HTML classes are HTML attributes that label multiple HTML elements. These classes are useful for styling HTML elements using CSS. Multiple elements can have the same class.
- HTML ids are HTML attributes that label one HTML element. Ids are useful for styling singular HTML elements using CSS. Each ID can be used only one time in an HTML document.
We can view HTML for any website by right clicking in our web browser and selecting “View Page Source.”7
Second, we will explore CSS. CSS relies on HTML elements, HTML classes, and HTML ids to style HTML content. CSS selectors can directly reference HTML elements. For example, the following selectors change the style of paragraphs and titles.
CSS selectors can reference HTML classes. For example, the following selector changes the style of HTML elements with class='important'
.
CSS selectors can reference also reference HTML IDs. For example, the following selector changes the style of the one element with id='special-paragraph'
We can explore CSS by right clicking and selecting Inspect. Most modern websites have a lot of HTML and a lot of CSS. We can find the CSS for specific elements in a website with the button at the top left of the new window that just appeared.
12.5.2 Tables
library(rvest)
is the main tool for scraping static websites with R. We’ll start with examples that contain information in HTML tables.8
HTML tables store information in tables in websites using the <table>
, <tr>
, <th>
, and <td>
. If the data of interest are stored in tables, then it can be trivial to scrape the information.
Consider the Wikipedia page for the 2012 Presidential Election. We can scrape all 46 tables from the page with two lines of code. We use the WayBack Machine to ensure the content is stable.
Attaching package: 'rvest'
The following object is masked from 'package:readr':
guess_encoding
Suppose we are interested in the table about presidential debates. We can extract that element from the list of tables.
# A tibble: 12 × 9
`Presidential candidate` Party `Home state` `Popular vote` `Popular vote`
<chr> <chr> <chr> <chr> <chr>
1 "Presidential candidate" Party Home state Count Percentage
2 "Barack Hussein Obama II" Demo… Illinois 65,915,795 51.06%
3 "Willard Mitt Romney" Repu… Massachuset… 60,933,504 47.20%
4 "Gary Earl Johnson" Libe… New Mexico 1,275,971 0.99%
5 "Jill Ellen Stein" Green Massachuset… 469,627 0.36%
6 "Virgil Hamlin Goode Jr." Cons… Virginia 122,389 0.11%
7 "Roseanne Cherrie Barr" Peac… Utah 67,326 0.05%
8 "Ross Carl \"Rocky\" Anders… Just… Utah 43,018 0.03%
9 "Thomas Conrad Hoefling" Amer… Nebraska 40,628 0.03%
10 "Other" Other Other 217,152 0.17%
11 "Total" Total Total 129,085,410 100%
12 "Needed to win" Need… Needed to w… Needed to win Needed to win
# ℹ 4 more variables: Electoralvote <chr>, `Running mate` <chr>,
# `Running mate` <chr>, `Running mate` <chr>
Of course, we want to be polite. library(polite)
makes this very simple. “The three pillars of a polite session are seeking permission, taking slowly and never asking twice.”
We’ll use bow()
to start a session and declare our user agent, and scrape()
instead of read_html()
.9
library(polite)
session <- bow(
url = "https://web.archive.org/web/20230814004444/https://en.wikipedia.org/wiki/2012_United_States_presidential_election",
user_agent = "Georgetown students learning scraping -- arw109@georgetown.edu"
)
session
<polite session> https://web.archive.org/web/20230814004444/https://en.wikipedia.org/wiki/2012_United_States_presidential_election
User-agent: Georgetown students learning scraping -- arw109@georgetown.edu
robots.txt: 1 rules are defined for 1 bots
Crawl delay: 5 sec
The path is scrapable for this user-agent
# A tibble: 12 × 9
`Presidential candidate` Party `Home state` `Popular vote` `Popular vote`
<chr> <chr> <chr> <chr> <chr>
1 "Presidential candidate" Party Home state Count Percentage
2 "Barack Hussein Obama II" Demo… Illinois 65,915,795 51.06%
3 "Willard Mitt Romney" Repu… Massachuset… 60,933,504 47.20%
4 "Gary Earl Johnson" Libe… New Mexico 1,275,971 0.99%
5 "Jill Ellen Stein" Green Massachuset… 469,627 0.36%
6 "Virgil Hamlin Goode Jr." Cons… Virginia 122,389 0.11%
7 "Roseanne Cherrie Barr" Peac… Utah 67,326 0.05%
8 "Ross Carl \"Rocky\" Anders… Just… Utah 43,018 0.03%
9 "Thomas Conrad Hoefling" Amer… Nebraska 40,628 0.03%
10 "Other" Other Other 217,152 0.17%
11 "Total" Total Total 129,085,410 100%
12 "Needed to win" Need… Needed to w… Needed to win Needed to win
# ℹ 4 more variables: Electoralvote <chr>, `Running mate` <chr>,
# `Running mate` <chr>, `Running mate` <chr>
12.5.3 Other HTML Content
Suppose we want to scrape every URL in the body of the 2012 Presidential Election webpage. html_table()
no longer works.
We could manually poke through the source code to find the appropriate CSS selectors. Fortunately, SelectorGadget often eliminates this tedious work by telling you the name of the html elements that you click on.
- Click the SelectorGadget gadget browser extension. You may need to click the puzzle piece to the right of the address bar and then click the SelectorGadget browser extension.
- Select an element you want to scrape. The elements associated with the CSS selector provided at the bottom will be in green and yellow.
- If SelectorGadget selects too few elements, select additional elements. If SelectorGadget selects too many elements, click those elements. They should turn red.
Each click should refine the CSS selector.
After a few clicks, it’s clear we want p a
. This should select any element a
in p
. a
is the element for URLs.
We’ll need a few more functions to finish this example.
html_elements()
filters the output ofread_html()
/scrape()
based on the provided CSS selector.html_elements()
can return multiple elements whilehtml_element()
always returns one element.html_text2()
retrieves text from HTML elements.html_attrs()
retrieves HTML attributes from HTML elements.html_attrs()
can return multiple attributes whilehtml_attr()
always returns one attribute.
tibble(
text = election_page |>
html_elements(css = "p a") |>
html_text2(),
url = election_page |>
html_elements(css = "p a") |>
html_attr(name = "href")
)
# A tibble: 355 × 2
text url
<chr> <chr>
1 Barack Obama /web/20230814004444/https://en.wikipedia.org/wiki/Bara…
2 Democratic /web/20230814004444/https://en.wikipedia.org/wiki/Demo…
3 Barack Obama /web/20230814004444/https://en.wikipedia.org/wiki/Bara…
4 Democratic /web/20230814004444/https://en.wikipedia.org/wiki/Demo…
5 presidential election /web/20230814004444/https://en.wikipedia.org/wiki/Unit…
6 Democratic /web/20230814004444/https://en.wikipedia.org/wiki/Demo…
7 President /web/20230814004444/https://en.wikipedia.org/wiki/Pres…
8 Barack Obama /web/20230814004444/https://en.wikipedia.org/wiki/Bara…
9 running mate /web/20230814004444/https://en.wikipedia.org/wiki/Runn…
10 Vice President /web/20230814004444/https://en.wikipedia.org/wiki/Vice…
# ℹ 345 more rows
12.6 Conclusion
Scraping data from websites is a powerful way of collecting data at scale or even when it is not organized in easily downloadable files like CSVs. Iteratively downloading is an elegant alternative to a time-intensive and potentially prohibitive process of going to websites and repeatedly downloading individual data sets. Scraping content from the body of websites is a more sophisticated approach that involves determining website html structure and then using that knowledge to extract key elements of that text. We strongly encourage you to consider the legal and ethnical risks of downloading this data.
We are not lawyers. This is not official legal advise. If in-doubt, please contact a legal professional.↩︎
This blog and this blog support this statement. Again, we are not lawyers and the HiQ Labs v. LinkedIn decision is complicated because of its long history and conclusion in settlement.↩︎
The scale of crawling is so great that there is concern about models converging once all models use the same massive training data. Common Crawl is one example. This isn’t a major issue for generating images but model homogeneity is a big concern in finance.↩︎
Who deserves privacy is underdiscussed and inconsistent. Every year, newspapers across the country FOIA information about government employees and publish their full names, job titles, and salaries.↩︎
Consequently, code that may once have worked can break, but using
read_csv(<file_path>)
to access data once it has been downloaded will work consistently.↩︎The only difference between
map()
andwalk()
is their outputs.map()
returns the results of a function in a list.walk()
returns nothing when used without assignment, and we never usewalk()
with assignment.walk()
is useful when we don’t care about the output of functions and are only interested in their “side-effects”. Common functions to use withwalk()
areggsave()
andwrite_csv()
. For more information onwalk()
, see Advanced R.↩︎We recommend using Google Chrome, which has excellent web development tools.↩︎
If a website is static, that means that the website is not interactive and will remain the same unless the administrator actively makes changes. Hello World examples is an example of a static website.↩︎
The polite documentation describes the
bow()
function as being used to “introduce the client to the host and ask for permission to scrape (by inquiring against the host’s robots.txt file).”↩︎