12 Web Scraping

Abstract

This section contains guidelines and processes for gathering information from the web using web scraping. We will focus on two approaches. First, we will learn to download many files. Second, we will learn to gather information from the bodies of websites.

12.1 Review

We explored pulling data from web APIs in DSPP1. With web APIs, stewards are often carefully thinking about how to share information. This will not be the case with web scraping.

We also explored extracting data from Excel workbooks in Section 02. We will build on some of the ideas in that section.

Recall that if we have a list of elements, we can extract the \(i^{th}\) element with [[]]. For example, we can extract the third data frame from a list of data frames called data with data[[3]].

Recall that we can use map() to iterate a function across each element of a vector. Consider the following example:

times2 <- function(x) x * 2

x <- 1:3

map(.x = x, .f = times2)

[[1]]
[1] 2

[[2]]
[1] 4

[[3]]
[1] 6

12.2 Introduction and Motivation

The Internet is an immense source of information for research. Sometimes we can easily download data of interest in an ideal format with the click of a download button or a single API call.

But it probably won’t be long until we need data that require many download button clicks. Or worse, we may want data from web pages that don’t have a download button at all.

Consider a few examples.

The Urban Institute’s Boosting Upward Mobility from Poverty project programmatically downloaded 51 .xslx workbooks when building the Upward Mobility Data Tables.
We worked with the text of executive orders going back to the Clinton Administration when learning text analysis in DSPP1. Unfortunately, the Federal Register doesn’t publish a massive file of executive orders. So we iterated through websites for each executive order, scraped the text, and cleaned the data.
The Urban Institute scraped course descriptions from Florida community colleges to understand opportunities for work-based learning.
The Billion Prices Project web scraped millions of prices each day from online retailers. The project used the data to construct real-time price indices that limited political interference and to research concepts like price stickiness.

We will explore two approaches for gathering information from the web.

Iteratively downloading files: Sometimes websites contain useful information across many files that need to be separately downloaded. We will use code to download these files. Ultimately, these files can be combined into one larger data set for research.
Scraping content from the body of websites: Sometimes useful information is stored as tables or lists in the body of websites. We will use code to scrape this information and then parse and clean the result.

Sometimes we download many PDF files using the first approach. A related method that we will not cover that is useful for gathering information from the web is extracting text data from PDFs.

12.3 Legal and Ethical Considerations

It is important to consider the legal and ethical implications of any data collection. Collecting data from the web through methods like web scraping raises serious ethical and legal considerations.

12.3.1 Legal¹

Different countries have different laws that affect web scraping. The United States has different laws and legal interpretations than countries in Europe, which are largely regulated by the European Union. In general, the United States has more relaxed policies than the European when it comes to gathering data from the web.

R for Data Science (2e) contains a clear and approachable rundown of legal consideration for gathering information for the web. We adopt their three-part standard of “public, non-personal, and factual”, which relate to terms of service, personally identifiable information, and copyright.

We will focus solely on laws in the United States.

Terms of Service

The legal environment for web scraping is in flux, but US Courts have created an environment that is legally supportive of gathering public information from the web.

First, we need to understand how many websites bar web scraping. Second, we need to understand when we can ignore these rules.

A terms of service is a list of rules posted by the provider of a website, web service, or software.

For example, LinkedIn’s Terms of Service says users agree to not “Develop, support or use software, devices, scripts, robots or any other means or processes (including crawlers, browser plugins and add-ons or any other technology) to scrape the Services or otherwise copy profiles and other data from the Services;”

This sounds like the end of web scraping, but as Wickham, Çetinkaya-Rundel, and Grolemund (2023) note, Terms of Service end up being a “legal land grab” for companies. It isn’t clear how LinkedIn would legally enforce this. HiQ Labs v. LinkedIn from the United States Court of Appeals for the Ninth Circuit bars Computer Fraud and Abuse Act (CFAA) claims against web scraping public information.²

We follow a simple guideline: it is acceptable to scrape information when we don’t need to create an account.

12.3.2 PII

Personal Identifiable Information

Personal Identifiable Information (PII) is any information that can be used to directly identify an individual.

Public information on the Internet often contains PII, which raises legal and ethical challenges. We will discuss the ethics of PII later.

The legal considerations are trans-Atlantic. The General Data Protection Regulation (GDPR) is a European Union regulation about information privacy. It contains strict rules about the collection and storage of PII. It applies to almost everyone collecting data inside the EU. The GDPR is also extraterritorial, which means its rules can apply outside of the EU under certain circumstances like when an American company gathers information about EU individuals.

We will avoid gathering PII, so we don’t need to consider PII.

Copyright

Copyright protection subsists, in accordance with this title, in original works of authorship fixed in any tangible medium of expression, now known or later developed, from which they can be perceived, reproduced, or otherwise communicated, either directly or with the aid of a machine or device. Works of authorship include the following categories:
- 1. literary works;
- 1. musical works, including any accompanying words;
- 1. dramatic works, including any accompanying music;
- 1. pantomimes and choreographic works;
- 1. pictorial, graphic, and sculptural works;
- 1. motion pictures and other audiovisual works;
- 1. sound recordings; and
- 1. architectural works.
In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work.

17 U.S.C.

Our final legal concern for gathering information from the Internet is copyright law. We have two main options for avoiding copyright limitations.

We can avoid copyright protections by not scraping authored content in the protected categories (i.e. literary works and sound recordings). Fortunately, factual data are not typically protected by copyright.
We can scrape information that is fair use. This is important if we want to use images, films, music, or extended text as data.

We will focus on data that are not copyrighted.

12.3.3 Ethical

We now turn to ethical considerations and some of the best-practices for gathering information from the web. In general, we will aim to be polite, give credit, and respect individual information.

Be polite

It is expensive and time-consuming to host data on the web. Hosts experience a small burden every time we access a website. This burden is small but can quickly grow with repeated queries. Just like with web APIs, we want to pace the burden of our access to be polite.

Rate Limiting

Rate limiting is the intentional slowing of web traffic for a user or users.

We will use Sys.sleep() in custom functions to slow our web scraping and ease the burden of our web scraping on web hosts.

robots.txt

robots.txt tells web crawlers and scrapers which URLs the crawler is allowed to access on a website.

Many websites contain a robots.txt file. Consider examples from the Urban Institute and White House.

We can manually look at the robots.txt. For example, just visit https://www.urban.org/robots.txt or https://www.whitehouse.gov/robots.txt. We can also use library(polite), which will automatically look at the robots.txt.

Give Credit

Academia and the research profession undervalue the collection and curation of data. Generally speaking, no one gets tenure for constructing even the most important data sets. It is important to give credit for data accessed from the web. Ideally, add the citation to Zotero and then easily add it to your manuscript in Quarto.

Be sure to make it easy for others to cite data sets that you create. Include an example citation like IPUMS or create a DOI for your data.

The rise of generative AI models like GPT-3, Stable Diffusion, DALL-E 2 makes urgent considerations of giving credit. These models consume massive amounts of training data, and it isn’t clear where the training data come from or the legal and ethical implications of the training data.³

Consider a few current events:

Sarah Silverman is suing OpenAI because she “never gave permission for OpenAI to ingest the digital version of her 2010 book to train its AI models, and it was likely stolen from a ‘shadow library’ of pirated works.”
Somepalli et al. (2023) use state-of-the-art image retrieval models to find that generative AI models like the popular the popular Stable Diffusion model “blatantly copy from their training data.” This is a major problem if the training data are copyrighted. The first page of their paper (here) contains some dramatic examples.
Finally, this Harvard Business Review article discusses the intellectual property problem facing generative AI.

Respect Individual Information

Data science methods should adhere to the same ethical standards as any research method. The social sciences have ethical norms about protecting privacy (discussed later) and informed consent.

Discussion

Is it appropriate to collect and share public PII?

Do these norms apply to data that is already public on the Internet?

Let’s consider an example. In 2016, researchers posted data about 70,000 OkCupid accounts. The data didn’t contain names but did contain usernames. The data also contained many sensitive variables including topics like sexual habits and politics.

The release drew strong reactions from some research ethicists including Michael Zimmer and Os Keyes.⁴

Fellegi (1972) defines data privacy as the ability “to determine what information about ourselves we will share with others”. Maybe OkCupid users made the decision to forego confidentiality when they published their accounts. Many institutional ethics committees do not require informed consent for public data.

Ravn, Barnwell, and Barbosa Neves (2020) do a good job developing a conceptual framework that bridges the gap between the view that all public data require informed consent and the view that no public data require informed consent.

It’s possible to conceive of a web scraping research project that is purely observational that adheres to the ethical standards of research and contains potentially disclosive information about individuals. Fortunately, researchers can typically use Institutional Review Boards and research ethicists to navigate these questions.

As a basic standard, we will avoid collecting PII and use anonymization techniques to limit the risk of re-identification.

We will also focus on applications where the host of information crudely shares the information. There are ample opportunities to create value by gathering information from government sources and converting it into more useful formats. For example, the government too often shares information in .xls and .xlsx files, clunky web interfaces, and PDFs.

12.4 Programatically Downloading Data

The County Health Rankings & Roadmaps is a source of state and local information.

Suppose we are interested in Injury Deaths at the state level. We can click through the interface and download a .xlsx file for each state.

12.4.1 Downloading a Single File

Start here.
Using the interface at the bottom of the page, we can navigate to the page for “Virginia.”
Next, we can click “View State Data.”
Next, we can click “Download Virginia data sets.”

That’s a lot of clicks to get here.

If we want to download “2023 Virginia Data”, we can typically right click on the link and select “Copy Link Address”. This should return one of the following two URLS:

https://www.countyhealthrankings.org/sites/default/files/media/document/2023%20County%20Health%20Rankings%20Virginia%20Data%20-%20v2.xlsx

https://www.countyhealthrankings.org/sites/default/files/media/document/2023 County Health Rankings Virginia Data - v2.xlsx

Spaces are special characters in URLs and they are sometimes encoded as %20. Both URLs above work in the web browser, but only the URL with %20 will work with code.

As we’ve seen several times before, we could use read_csv() to directly download the data from the Internet if the file was a .csv.⁵ We need to download this file because it is an Excel file, which we can do with download.file() provided we include a destfile.

download.file(
  url = "https://www.countyhealthrankings.org/sites/default/files/media/document/2023%20County%20Health%20Rankings%20Virginia%20Data%20-%20v2.xlsx", 
  destfile = "data/virginia-injury-deaths.xlsx"
)

12.4.2 Downloading Multiple Files

If we click through and find the links for several states, we see that all of the download links follow a common pattern. For example, the URL for Vermont is

https://www.countyhealthrankings.org/sites/default/files/media/document/2023 County Health Rankings Vermont Data - v2.xlsx

The URLs only differ by "Virginia" and "Vermont". If we can create a vector of URLs by changing state name, then it is simple to iterate downloading the data. We will only download data for two states, but we can imagine downloading data for many states or many counties. Here are three R tips:

paste0() and str_glue() from library(stringr) are useful for creating URLs and destination files.
walk() from library(purrr) can iterate functions. It’s like map(), but we use it when we are interested in the side-effect of a function.⁶
Sometimes data are messy and we want to be polite. Custom functions can help with rate limiting and cleaning data.

download_chr <- function(url, destfile) {

  download.file(url = url, destfile = destfile)

  Sys.sleep(0.5)

}

states <- c("Virginia", "Vermont")

urls <- paste0(
  "https://www.countyhealthrankings.org/sites/default/files/",
  "media/document/2023%20County%20Health%20Rankings%20",
  states,
  "%20Data%20-%20v2.xlsx"
)

output_files <- paste0("data/", states, ".xlsx")

walk2(.x = urls, .y = output_files, .f = download_chr)

Exercise 1

SOI Tax Stats - Historic Table 2 provides individual income and tax data, by state and size of adjusted gross income. The website contains a bulleted list of URLs and each URL downloads a .xlsx file.

Use download.file() to download the file for Alabama.
Explore the URLs using “Copy Link Address”.
Iterate pulling the data for Alabama, Alaska, and Arizona.

12.5 Web Scraping with rvest

We now pivot to situations where useful information is stored in the body of web pages.

12.5.1 Web Design

It’s simple to build a website with Quarto because it abstracts away most of web development. For example, Markdown is just a shortcut to write HTML. Web scraping requires us to learn more about web development than when we use Quarto.

The user interface of websites can be built with just HTML, but most websites contain HTML, CSS, and JavaScript. The development the interface of websites with HTML, CSS, and JavaScript is called front-end web development.

Hyper Text Markup Language

Hyper Text Markup Language (HTML) is the standard language for creating web content. HTML is a markup language, which means it has code for creating structure and and formatting.

The following HTML generates a bulleted list of names.

<ul>
  <li>Alex</li>
  <li>Aaron</li>
  <li>Alena</li>
</ul>

Cascading Style Sheets

Cascading Style Sheets (CSS) describes hot HTML elements should be styled when they are displayed.

For example, the following CSS adds extra space after sections with ## in our class notes.

.level2 {
  margin-bottom: 80px;
}

JavaScript

JavaScript is a programming language that runs in web browsers and is used to build interactivity in web interfaces.

Quarto comes with default CSS and JavaScript. library(leaflet) and Shiny are popular tools for building JavaScript applications with R. We will focus on web scraping using HTML and CSS.

First, we will cover a few important HTML concepts. W3Schools offers a thorough introduction. Consider the following simple website built from HTML:

<html>
<head>
<title>Hello World!</title>
</head>
<body>
<h1 class='important'>Bigger Title!</h1>
<h2 class='important'>Big Title!</h1>
<p>My first paragraph.</p>
<p id='special-paragraph'>My first paragraph.</p>
</body>
</html>

An HTML element is a start tag, some content, and an end tag. Every start tag has a matching end tag. For example, <body and </body>. <html>, <head>, and <body> are required elements for all web pages. Other HTML elements include <h1>, <h2>, and <p>.

HTML attributes are name/value pairs that provide additional information about elements. HTML attributes are optional and are like function arguments for HTML elements.

Two HTML attributes, classes and ids, are particularly important for web scraping.

HTML classes are HTML attributes that label multiple HTML elements. These classes are useful for styling HTML elements using CSS. Multiple elements can have the same class.
HTML ids are HTML attributes that label one HTML element. Ids are useful for styling singular HTML elements using CSS. Each ID can be used only one time in an HTML document.

We can view HTML for any website by right clicking in our web browser and selecting “View Page Source.”⁷

Exercise 2

Inspect the HTML behind this list of “Hello World examples”.
Inspect the HTML behind the Wikipedia page for Jerzy Neyman.

Second, we will explore CSS. CSS relies on HTML elements, HTML classes, and HTML ids to style HTML content. CSS selectors can directly reference HTML elements. For example, the following selectors change the style of paragraphs and titles.

p {
  color: red;
}

h1 {
  font-family: wingdings;
}

CSS selectors can reference HTML classes. For example, the following selector changes the style of HTML elements with class='important'.

.important {
  font-family: wingdings;
}

CSS selectors can reference also reference HTML IDs. For example, the following selector changes the style of the one element with id='special-paragraph'

#special-paragraph {
  color: pink;
}

We can explore CSS by right clicking and selecting Inspect. Most modern websites have a lot of HTML and a lot of CSS. We can find the CSS for specific elements in a website with the button at the top left of the new window that just appeared.

12.5.2 Tables

library(rvest) is the main tool for scraping static websites with R. We’ll start with examples that contain information in HTML tables.⁸

HTML tables store information in tables in websites using the <table>, <tr>, <th>, and <td>. If the data of interest are stored in tables, then it can be trivial to scrape the information.

Consider the Wikipedia page for the 2012 Presidential Election. We can scrape all 46 tables from the page with two lines of code. We use the WayBack Machine to ensure the content is stable.

library(rvest)


Attaching package: 'rvest'

The following object is masked from 'package:readr':

    guess_encoding

tables <- read_html("https://web.archive.org/web/20230814004444/https://en.wikipedia.org/wiki/2012_United_States_presidential_election") |>
  html_table()

Suppose we are interested in the table about presidential debates. We can extract that element from the list of tables.

tables[[18]]

# A tibble: 12 × 9
   `Presidential candidate`     Party `Home state` `Popular vote` `Popular vote`
   <chr>                        <chr> <chr>        <chr>          <chr>         
 1 "Presidential candidate"     Party Home state   Count          Percentage    
 2 "Barack Hussein Obama II"    Demo… Illinois     65,915,795     51.06%        
 3 "Willard Mitt Romney"        Repu… Massachuset… 60,933,504     47.20%        
 4 "Gary Earl Johnson"          Libe… New Mexico   1,275,971      0.99%         
 5 "Jill Ellen Stein"           Green Massachuset… 469,627        0.36%         
 6 "Virgil Hamlin Goode Jr."    Cons… Virginia     122,389        0.11%         
 7 "Roseanne Cherrie Barr"      Peac… Utah         67,326         0.05%         
 8 "Ross Carl \"Rocky\" Anders… Just… Utah         43,018         0.03%         
 9 "Thomas Conrad Hoefling"     Amer… Nebraska     40,628         0.03%         
10 "Other"                      Other Other        217,152        0.17%         
11 "Total"                      Total Total        129,085,410    100%          
12 "Needed to win"              Need… Needed to w… Needed to win  Needed to win 
# ℹ 4 more variables: Electoralvote <chr>, `Running mate` <chr>,
#   `Running mate` <chr>, `Running mate` <chr>

Of course, we want to be polite. library(polite) makes this very simple. “The three pillars of a polite session are seeking permission, taking slowly and never asking twice.”

We’ll use bow() to start a session and declare our user agent, and scrape() instead of read_html().⁹

library(polite)

session <- bow(
  url = "https://web.archive.org/web/20230814004444/https://en.wikipedia.org/wiki/2012_United_States_presidential_election",
  user_agent = "Georgetown students learning scraping -- arw109@georgetown.edu"
)

session

<polite session> https://web.archive.org/web/20230814004444/https://en.wikipedia.org/wiki/2012_United_States_presidential_election
    User-agent: Georgetown students learning scraping -- arw109@georgetown.edu
    robots.txt: 1 rules are defined for 1 bots
   Crawl delay: 5 sec
  The path is scrapable for this user-agent

election_page <- session |>
  scrape() 
  
tables <- election_page |>
  html_table()

tables[[18]]

# A tibble: 12 × 9
   `Presidential candidate`     Party `Home state` `Popular vote` `Popular vote`
   <chr>                        <chr> <chr>        <chr>          <chr>         
 1 "Presidential candidate"     Party Home state   Count          Percentage    
 2 "Barack Hussein Obama II"    Demo… Illinois     65,915,795     51.06%        
 3 "Willard Mitt Romney"        Repu… Massachuset… 60,933,504     47.20%        
 4 "Gary Earl Johnson"          Libe… New Mexico   1,275,971      0.99%         
 5 "Jill Ellen Stein"           Green Massachuset… 469,627        0.36%         
 6 "Virgil Hamlin Goode Jr."    Cons… Virginia     122,389        0.11%         
 7 "Roseanne Cherrie Barr"      Peac… Utah         67,326         0.05%         
 8 "Ross Carl \"Rocky\" Anders… Just… Utah         43,018         0.03%         
 9 "Thomas Conrad Hoefling"     Amer… Nebraska     40,628         0.03%         
10 "Other"                      Other Other        217,152        0.17%         
11 "Total"                      Total Total        129,085,410    100%          
12 "Needed to win"              Need… Needed to w… Needed to win  Needed to win 
# ℹ 4 more variables: Electoralvote <chr>, `Running mate` <chr>,
#   `Running mate` <chr>, `Running mate` <chr>

Exercise 3

Install and load library(rvest).
Install and load library(polite).
Scrape the Presidential debates table from the Wikipedia article for the 2008 presidential election.

<polite session> https://en.wikipedia.org/wiki/2008_United_States_presidential_election
    User-agent: Georgetown students learning scraping -- arw109@georgetown.edu
    robots.txt: 464 rules are defined for 33 bots
   Crawl delay: 5 sec
  The path is scrapable for this user-agent

# A tibble: 4 × 8
  No.   Date    Time  Host  City  Moderators Participants `Viewership(millions)`
  <chr> <chr>   <chr> <chr> <chr> <chr>      <chr>        <chr>                 
1 P1    Friday… 9:00… Univ… Oxfo… Jim Lehrer Senator Bar… 52.4[112]             
2 VP    Friday… 9:00… Wash… St. … Gwen Ifill Senator Joe… 69.9[112]             
3 P2    Tuesda… 9:00… Belm… Nash… Tom Brokaw Senator Bar… 63.2[112]             
4 P3    Wednes… 9:00… Hofs… Hemp… Bob Schie… Senator Bar… 56.5[112]

12.5.3 Other HTML Content

Suppose we want to scrape every URL in the body of the 2012 Presidential Election webpage. html_table() no longer works.

We could manually poke through the source code to find the appropriate CSS selectors. Fortunately, SelectorGadget often eliminates this tedious work by telling you the name of the html elements that you click on.

Click the SelectorGadget gadget browser extension. You may need to click the puzzle piece to the right of the address bar and then click the SelectorGadget browser extension.
Select an element you want to scrape. The elements associated with the CSS selector provided at the bottom will be in green and yellow.
If SelectorGadget selects too few elements, select additional elements. If SelectorGadget selects too many elements, click those elements. They should turn red.

Each click should refine the CSS selector.

After a few clicks, it’s clear we want p a. This should select any element a in p. a is the element for URLs.

We’ll need a few more functions to finish this example.

html_elements() filters the output of read_html()/scrape() based on the provided CSS selector. html_elements() can return multiple elements while html_element() always returns one element.
html_text2() retrieves text from HTML elements.
html_attrs() retrieves HTML attributes from HTML elements. html_attrs() can return multiple attributes while html_attr() always returns one attribute.

tibble(
  text = election_page |>
    html_elements(css = "p a") |>
    html_text2(),
  url = election_page |>
    html_elements(css = "p a") |>
    html_attr(name = "href")
)

# A tibble: 639 × 2
   text                   url                                               
   <chr>                  <chr>                                             
 1 George W. Bush         /wiki/George_W._Bush                              
 2 Republican             /wiki/Republican_Party_(United_States)            
 3 Barack Obama           /wiki/Barack_Obama                                
 4 Democratic             /wiki/Democratic_Party_(United_States)            
 5 Presidential elections /wiki/United_States_presidential_election         
 6 United States          /wiki/United_States                               
 7 Democratic             /wiki/Democratic_Party_(United_States)            
 8 Barack Obama           /wiki/Barack_Obama                                
 9 senator                /wiki/List_of_United_States_senators_from_Illinois
10 Illinois               /wiki/Illinois                                    
# ℹ 629 more rows

Exercise 4

Suppose we are interested in examples of early websites. Wikipedia has a list of URLs from before 1995.

Add the SelectorGadget web extension to your browser.
Use library(polite) and library(rvest) to scrape() the following URL.

https://web.archive.org/web/20230702163608/https://en.wikipedia.org/wiki/List_of_websites_founded_before_1995

We are interested in scraping the names of early websites and their URLs. Use SelectorGadget to determine the CSS selectors associated with these HTML elements.
Create a tibble with a variable called name and a variable called url.
Remove duplicate rows with distinct() or filter().

Exercise 5

Find your own HTML table of interest to scrape.
Use library(rvest) and library(polite) to scrape the table.

12.6 Conclusion

Scraping data from websites is a powerful way of collecting data at scale or even when it is not organized in easily downloadable files like CSVs. Iteratively downloading is an elegant alternative to a time-intensive and potentially prohibitive process of going to websites and repeatedly downloading individual data sets. Scraping content from the body of websites is a more sophisticated approach that involves determining website html structure and then using that knowledge to extract key elements of that text. We strongly encourage you to consider the legal and ethnical risks of downloading this data.

We are not lawyers. This is not official legal advise. If in-doubt, please contact a legal professional.↩︎
This blog and this blog support this statement. Again, we are not lawyers and the HiQ Labs v. LinkedIn decision is complicated because of its long history and conclusion in settlement.↩︎
The scale of crawling is so great that there is concern about models converging once all models use the same massive training data. Common Crawl is one example. This isn’t a major issue for generating images but model homogeneity is a big concern in finance.↩︎
Who deserves privacy is underdiscussed and inconsistent. Every year, newspapers across the country FOIA information about government employees and publish their full names, job titles, and salaries.↩︎
Consequently, code that may once have worked can break, but using read_csv(<file_path>) to access data once it has been downloaded will work consistently.↩︎
The only difference between map() and walk() is their outputs. map() returns the results of a function in a list. walk() returns nothing when used without assignment, and we never use walk() with assignment. walk() is useful when we don’t care about the output of functions and are only interested in their “side-effects”. Common functions to use with walk() are ggsave() and write_csv(). For more information on walk(), see Advanced R.↩︎
We recommend using Google Chrome, which has excellent web development tools.↩︎
If a website is static, that means that the website is not interactive and will remain the same unless the administrator actively makes changes. Hello World examples is an example of a static website.↩︎
The polite documentation describes the bow() function as being used to “introduce the client to the host and ask for permission to scrape (by inquiring against the host’s robots.txt file).”↩︎

12.1 Review

12.2 Introduction and Motivation

12.3 Legal and Ethical Considerations

12.3.1 Legal1

Terms of Service

12.3.2 PII

Copyright

12.3.3 Ethical

Be polite

Give Credit

Respect Individual Information

12.4 Programatically Downloading Data

12.4.1 Downloading a Single File

12.4.2 Downloading Multiple Files

12.5 Web Scraping with rvest

12.5.1 Web Design

12.5.2 Tables

12.5.3 Other HTML Content

12.6 Conclusion

12.3.1 Legal¹