In this chapter, we introduce fundamental ideas for text analysis.
24.1 Motivation
Before providing motivation, let’s define two key terms.
Text Corpus
A set of text. A corpus generally has theme such as “The Federalist Papers” or Jane Austin novels.
Text analysis
The process of deriving information from text using algorithms and statistical analysis.
Text analysis has broad applications that extend well beyond policy analysis. We will briefly focus on four applications that are related to policy analysis.
24.1.1 1. Document Summarization
The process of condensing the amount of information in a document into a useful subset of information. Techniques range from counting words to using complex machine learning algorithms.
Example: Frederick Mosteller and David Wallace used Bayesian statistics and the frequency of certain words to identify the authorship of the twelve unclaimed Federalist papers. (blog) Before estimating any models, they spent months cutting out each word of the Federalist Papers and the counting the frequency of words.
24.1.2 2. Text Classification (supervised)
The process of labeling documents with a predetermined set of labels.
Example: Researchers in the Justice Policy Center at the Urban Institute classified millions of tweets with the word “cop” or “police” as “positive”, “negative”, “neutral”, or “not applicable” for sentiment and “crime or incident information”, “department or event information”, “traffic or weather updates”, “person identification”, or “other” for topic. The researchers created content and metadata features, manually labeled a few thousand tweets for training data, and used gradient boosting for supervised machine learning for the task. (blog)
24.1.3 3. Document Grouping (unsupervised)
The algorithmic grouping or clustering of documents using features extracted from the documents. This includes unsupervised classification of documents into meaningful groups. Techniques like topic modeling result in lists of important words that can be used to summarize and label the documents while techniques like K-means clustering result in arbitrary cluster labels.
Example: Pew Research used unsupervised and semi-supervised methods to create topic models of open-ended text responses about where Americans find meaning in their lives. (blog)
24.1.4 4. Text Extraction
Text often contains important unstructured information that would be useful to have in a structured format like a table. Text extraction is the process of searching and identifying key entities or concepts from unstructured text and then placing them in a structured format.
Example: Researchers from Cornell counted sources of misinformation from 38 million articles. NYTimes. Methodology.
24.1.5 Other
Speech recognition, machine translation, question answering, and text autocompletion are other forms of text analysis and language processing that are common but not implemented with R packages.
24.2 Tools
24.2.1 Frequency
Term Frequency
A count or relative frequency of the most common words in a document or documents.
Term Frequency-Inverse Document Frequency (TF-IDF)
Some words are naturally more common than other words. TF-IDF quantifies the importance of words/terms in one document relative to other documents in a corpus.
\[TF-IDF = TF(t, d) \cdot IDF(t)\]
where \(TF(t, d)\) is the relative frequency of term \(t\) in document \(d\) and \(IDF(t)\) is the inverse frequency of the number of document where the term appears.
A meaningful unit of text. Tokens include works, phrases, and sentences.
Process of splitting a larger unit of text into tokens. For example, “data science is useful” can be “data”, “science”, “is”, “useful”.
# tidytext can tokenize text with unnest_tokens()tidy_fed_paper1 <- fed_paper1 |>unnest_tokens(output = word, input = text)tidy_fed_paper1
# A tibble: 1,612 × 2
gutenberg_id word
<int> <chr>
1 1404 federalist
2 1404 no
3 1404 1
4 1404 general
5 1404 introduction
6 1404 for
7 1404 the
8 1404 independent
9 1404 journal
10 1404 saturday
# ℹ 1,602 more rows
A method of removing the end, and keeping only the root, of a word. Stemming is unaware of the context or use of the word. Example
# SnowballC has a stemmer that works well with tidytextlibrary(SnowballC)tidy_fed_paper1 |>mutate(stem =wordStem(word)) |>filter(word != stem)
# A tibble: 536 × 3
gutenberg_id word stem
<int> <chr> <chr>
1 1404 general gener
2 1404 introduction introduct
3 1404 independent independ
4 1404 saturday saturdai
5 1404 october octob
6 1404 people peopl
7 1404 unequivocal unequivoc
8 1404 experience experi
9 1404 inefficacy inefficaci
10 1404 subsisting subsist
# ℹ 526 more rows
A method of returning the base of a word. Lemmatization considers the context of a word. Example Lemmatizing requires natural language processing, so this requires Stanford CoreNLP or the Python package spaCy, which can be accessed in R via library(spacyr).
Stop words
Extremely common words that are often not useful for text analysis. library(tidytext) contains stop words from the onix, SMART, and snowball lexicons.
# A tibble: 1,149 × 2
word lexicon
<chr> <chr>
2 a's SMART
3 able SMART
4 about SMART
5 above SMART
6 according SMART
7 accordingly SMART
8 across SMART
9 actually SMART
10 after SMART
# ℹ 1,139 more rows
# A tibble: 2,863 × 2
word n
<chr> <int>
1 govern 318
2 power 311
3 constitut 213
4 peopl 144
5 author 125
6 execut 120
7 feder 118
8 form 99
9 depart 98
10 union 97
# ℹ 2,853 more rows
24.6.2 Approach 2
Here we’ll perform TF-IDF.
# calculate tf-idftf_idf <- tidy_fed_papers |>count(author, word, sort =TRUE) |>bind_tf_idf(term = word, document = author, n = n) # plottf_idf |>filter(author %in%c("hamilton", "madison")) |>group_by(author) |>top_n(15, tf_idf) |>mutate(word =reorder(word, tf_idf)) |>ggplot(aes(tf_idf, word, fill = author)) +geom_col() +facet_wrap(~author, scales ="free") +theme_minimal() +guides(fill ="none")
24.6.3 Approach 3
tidy_fed_papers |>count(author, word, sort =TRUE) |>filter(word =="upon")
# A tibble: 4 × 3
author word n
<chr> <chr> <int>
1 hamilton upon 374
2 madison upon 9
3 unknown upon 3
4 jay upon 1
24.7 Example 2
Let’s consider ten of Shakespeare’s plays.
ids <-c(2265, # Hamlet 1795, # Macbeth 1522, # Julius Caesar 2235, # The Tempest 1780, # 1 Henry IV 1532, # King Lear1513, # Romeo and Juliet 1110, # King John1519, # Much Ado About Nothing1539# The Winter's Tale)# download corpusshakespeare <-gutenberg_download(gutenberg_id = ids,meta_fields ="title")
# create tokens and drop character cuesshakespeare_clean <- shakespeare |>unnest_tokens(word, text, to_lower =FALSE) |>filter(word !=str_to_upper(word)) # calculate TF-IDFshakespeare_tf_idf <- shakespeare_clean |>count(title, word, sort =TRUE) |>bind_tf_idf(term = word, document = title, n = n)# plotshakespeare_tf_idf |>group_by(title) |>top_n(8, tf_idf) |>mutate(word =reorder(word, tf_idf)) |>ggplot(aes(tf_idf, word, fill = title)) +geom_col() +facet_wrap(~title, scales ="free", ncol =2) +theme_minimal() +guides(fill ="none")