Before providing motivation, let’s define two key terms.
Text Corpus
A set of text. A corpus generally has theme such as “The Federalist Papers” or Jane Austin novels.
Text analysis
The process of deriving information from text using algorithms and statistical analysis.
Text analysis has broad applications that extend well beyond policy analysis. We will briefly focus on four applications that are related to policy analysis.
24.1.1 1. Document Summarization
The process of condensing the amount of information in a document into a useful subset of information. Techniques range from counting words to using complex machine learning algorithms.
Example: Frederick Mosteller and David Wallace used Bayesian statistics and the frequency of certain words to identify the authorship of the twelve unclaimed Federalist papers. (blog) Before estimating any models, they spent months cutting out each word of the Federalist Papers and the counting the frequency of words.
24.1.2 2. Text Classification (supervised)
The process of labeling documents with a predetermined set of labels.
Example: Researchers in the Justice Policy Center at the Urban Institute classified millions of tweets with the word “cop” or “police” as “positive”, “negative”, “neutral”, or “not applicable” for sentiment and “crime or incident information”, “department or event information”, “traffic or weather updates”, “person identification”, or “other” for topic. The researchers created content and metadata features, manually labeled a few thousand tweets for training data, and used gradient boosting for supervised machine learning for the task. (blog)
Example: Frederick Mosteller and David Wallace used Bayesian statistics and the frequency of certain words to identify the authorship of the twelve unclaimed Federalist papers. (blog)
24.1.3 3. Document Grouping (unsupervised)
The algorithmic grouping or clustering of documents using features extracted from the documents. This includes unsupervised classification of documents into meaningful groups. Techniques like topic modeling result in lists of important words that can be used to summarize and label the documents while techniques like K-means clustering result in arbitrary cluster labels.
Example: Pew Research used unsupervised and semi-supervised methods to create topic models of open-ended text responses about where Americans find meaning in their lives. (blog)
24.1.4 4. Text Extraction
Text often contains important unstructured information that would be useful to have in a structured format like a table. Text extraction is the process of searching and identifying key entities or concepts from unstructured text and then placing them in a structured format.
Example: Researchers from Cornell counted sources of misinformation from 38 million articles. NYTimes. Methodology.
24.1.5 Other
Speech recognition, machine translation, question answering, and text autocompletion are other forms of text analysis and language processing that are common but not implemented with R packages.
24.2 Tools
24.2.1 Frequency
Term Frequency
A count or relative frequency of the most common words in a document or documents.
Term Frequency-Inverse Document Frequency (TF-IDF)
Some words are naturally more common than other words. TF-IDF quantifies the importance of words/terms in one document relative to other documents in a corpus.
\[TF-IDF = TF(t, d) \cdot IDF(t)\]
where \(TF(t, d)\) is the relative frequency of term \(t\) in document \(d\) and \(IDF(t)\) is the inverse frequency of the number of document where the term appears.
# A tibble: 58,077 × 2
gutenberg_id text
<int> <chr>
1 1404 "THE FEDERALIST PAPERS"
2 1404 "THE FEDERALIST PAPERS"
3 1404 "THE FEDERALIST PAPERS"
4 1404 ""
5 1404 ""
6 1404 ""
7 1404 "By Alexander Hamilton, John Jay, and James Madison"
8 1404 "By Alexander Hamilton, John Jay, and James Madison"
9 1404 "By Alexander Hamilton, John Jay, and James Madison"
10 1404 ""
# ℹ 58,067 more rows
A meaningful unit of text. Tokens include works, phrases, and sentences.
Tokenization
Process of splitting a larger unit of text into tokens. For example, “data science is useful” can be “data”, “science”, “is”, “useful”.
# tidytext can tokenize text with unnest_tokens()tidy_fed_paper1 <- fed_paper1 |>unnest_tokens(output = word, input = text)tidy_fed_paper1
# A tibble: 1,297 × 2
gutenberg_id word
<int> <chr>
1 1404 by
2 1404 alexander
3 1404 hamilton
4 1404 john
5 1404 jay
6 1404 and
7 1404 james
8 1404 madison
9 1404 by
10 1404 alexander
# ℹ 1,287 more rows
Stemming
A method of removing the end, and keeping only the root, of a word. Stemming is unaware of the context or use of the word. Example
# SnowballC has a stemmer that works well with tidytextlibrary(SnowballC)tidy_fed_paper1 |>mutate(stem =wordStem(word)) |>filter(word != stem)
# A tibble: 471 × 3
gutenberg_id word stem
<int> <chr> <chr>
1 1404 alexander alexand
2 1404 jay jai
3 1404 james jame
4 1404 alexander alexand
5 1404 jay jai
6 1404 james jame
7 1404 general gener
8 1404 introduction introduct
9 1404 general gener
10 1404 introduction introduct
# ℹ 461 more rows
Lemmatizing
A method of returning the base of a word. Lemmatization considers the context of a word. Example Lemmatizing requires natural language processing, so this requires Stanford CoreNLP or the Python package spaCy, which can be accessed in R via library(spacyr).
Stop words
Extremely common words that are often not useful for text analysis. library(tidytext) contains stop words from the onix, SMART, and snowball lexicons.
stop_words
# A tibble: 1,149 × 2
word lexicon
<chr> <chr>
1 a SMART
2 a's SMART
3 able SMART
4 about SMART
5 above SMART
6 according SMART
7 accordingly SMART
8 across SMART
9 actually SMART
10 after SMART
# ℹ 1,139 more rows
Text can also be stored as a string, a corpus, and a document-term matrix.
# tidy text formattidy_fed_paper1
# A tibble: 1,297 × 2
gutenberg_id word
<int> <chr>
1 1404 by
2 1404 alexander
3 1404 hamilton
4 1404 john
5 1404 jay
6 1404 and
7 1404 james
8 1404 madison
9 1404 by
10 1404 alexander
# ℹ 1,287 more rows
24.6 Example 1
# load necessary packageslibrary(tidyverse)library(tidytext)library(gutenbergr)library(SnowballC)# download the Federalist Papers from Project Gutenbergfed_papers <-gutenberg_download(gutenberg_id =1404, mirror ="http://mirror.csclub.uwaterloo.ca/gutenberg")# this data cleaning comes from Emil Hvitfeldt# https://www.hvitfeldt.me/blog/predicting-authorship-in-the-# federalist-papers-with-tidytext/hamilton <-c(1, 6:9, 11:13, 15:17, 21:36, 59:61, 65:85)madison <-c(10, 14, 18:20, 37:48)jay <-c(2:5, 64)unknown <-c(49:58, 62:63)fed_papers <-pull(fed_papers, text) |>str_c(collapse =" ") |>str_split(pattern ="\\.|\\?|\\!") |>unlist() %>%tibble(text = .) |>mutate(sentence =row_number())tidy_fed_papers <- fed_papers |>mutate(paper_number =cumsum(str_detect( text, regex("FEDERALIST No", ignore_case =TRUE))) ) |>unnest_tokens(word, text) |>mutate(author =case_when( paper_number %in% hamilton ~"hamilton", paper_number %in% madison ~"madison", paper_number %in% jay ~"jay", paper_number %in% unknown ~"unknown" ) )
24.6.1 Approach 1
Here we’ll calculate term frequency without stop words and with stemming.
# filter to main authorstidy_fed_papers1 <- tidy_fed_papers |>filter(author %in%c("hamilton", "madison"))# remove stop words and stem the wordstidy_fed_papers1 <- tidy_fed_papers1 |>anti_join(stop_words, by ="word") |>mutate(word =wordStem(word))# count Hamilton's most-frequent wordstidy_fed_papers1 |>filter(author =="hamilton") |>count(word, sort =TRUE)
# A tibble: 2,728 × 2
word n
<chr> <int>
1 govern 648
2 nation 510
3 peopl 363
4 power 360
5 union 315
6 time 234
7 author 198
8 danger 192
9 confederaci 186
10 countri 183
# ℹ 2,718 more rows
# A tibble: 1,592 × 2
word n
<chr> <int>
1 govern 174
2 nation 153
3 union 108
4 power 96
5 war 81
6 confederaci 69
7 peopl 69
8 form 63
9 object 63
10 time 63
# ℹ 1,582 more rows
24.6.2 Approach 2
Here we’ll perform TF-IDF.
# calculate tf-idftf_idf <- tidy_fed_papers |>count(author, word, sort =TRUE) |>bind_tf_idf(term = word, document = author, n = n)
Warning: A value for tf_idf is negative:
Input should have exactly one row per document-term combination.
tidy_fed_papers |>count(author, word, sort =TRUE) |>filter(word =="upon")
# A tibble: 5 × 3
author word n
<chr> <chr> <int>
1 <NA> upon 813
2 hamilton upon 225
3 madison upon 66
4 unknown upon 39
5 jay upon 18
24.7 Example 2
Let’s consider ten of Shakespeare’s plays.
ids <-c(2265, # Hamlet 1795, # Macbeth 1522, # Julius Caesar 2235, # The Tempest 1780, # 1 Henry IV 1532, # King Lear1513, # Romeo and Juliet 1110, # King John1519, # Much Ado About Nothing1539# The Winter's Tale)# download corpusshakespeare <-gutenberg_download(gutenberg_id = ids,meta_fields ="title")
Determining mirror for Project Gutenberg from
https://www.gutenberg.org/robot/harvest.
Using mirror http://aleph.gutenberg.org.
# create tokens and drop character cuesshakespeare_clean <- shakespeare |>unnest_tokens(word, text, to_lower =FALSE) |>filter(word !=str_to_upper(word)) # calculate TF-IDFshakespeare_tf_idf <- shakespeare_clean |>count(title, word, sort =TRUE) |>bind_tf_idf(term = word, document = title, n = n)# plotshakespeare_tf_idf |>group_by(title) |>top_n(8, tf_idf) |>mutate(word =reorder(word, tf_idf)) |>ggplot(aes(tf_idf, word, fill = title)) +geom_col() +facet_wrap(~title, scales ="free", ncol =2) +theme_minimal() +guides(fill ="none")