The Schoolbooks Project: Automated Text Analysis

This article is a terrific overview of developments in automated text analysis. My hope is that these technologies can open many doors to academic and non-academic research in digital archives.

Loon, Austin van. 2022. “Three Families of Automated Text Analysis.” Social Science Research 108 (November): 102798.

This page is a precis of this article with important definitions of functions that could be applied to ocr scanned digital archival materials.

A key concept: "text as data"

The idea of "text" has changed. This article introduces three major changes in the role of text in society that make it an important source of research information. Evidence, first, of the new role of text in society comes from two sources:

Poe, M. T. (2010). A History of Communications: Media and Society from the Evolution of Speech to the Internet. Cambridge University Press.

Evidence of the influence of social media.

Roser, M., & Ortiz-Ospina, E. (2016). Literacy. Our world in data. Our World in Data.

Evidence of the prevalence of text.

https://ourworldindata.org/rise-of-social-media?ref=tms

Here is a key quote from van Loon: "...the text-producing segment of the population has become more representative of the population as a whole." This is important, I think, because there is a better chance that the text we have at our disposal as researchers is more representative of society as a whole. And since research is but a representation shaped by theory, or shaped by patterns of human behaviour, we have a good chance to measure social realities. Loon calls this, "the goings-on of social life."

van Loon defines three changes in the nature of text in society that make text important as a source of insight into social realities: 1) text producers are more representative of all of society, 2) more social life happens through text, like dating, and 3) machine readable text is more plentiful. As to the "more plentiful" Loon cites: Project Gutenberg (https://www.gutenberg.org/) and the Google Books Corpus (https://www.english-corpora.org/ googlebooks/). One could also cite the Internet Archive (https://archive.org/).

What are the advantages of automated text analysis?

Natural language is more expressive and representative of thoughts and feelings than a Likert scale response.
Digitally stored text exchanges are available in perfect fidelity. No reduction from the phenomenon being analyzed to the data available.
Because text corpora are ubiquitous gives access to populations who might otherwise not participate. Judges, celebrities, CEOs.

What obstacles does ATA present?

ATA grossly simplifies written language.
Meaning is an elusive construct. (e.g. although the phrases “Mark Granovetter is smart” and “The author of ‘The Strength of Weak Ties’ possesses great intelligence” might mean the same thing for all intents and purposes, they actually have no words in common).

What are the three families of automated text analysis?

term frequency analysis. This analysis represents text as observations that vary in how often certain strings of characters (e.g., words) appear.
document structure analysis. This analysis assumes one can extract from word co-occurrence statistics what any given document is “about” (i.e., what the appropriate keywords or themes are) and represents text as observations that vary on this feature.
semantic similarity analysis. This analysis attempts to quantify the meaning of strings of characters and represents texts as collections of such meanings.

What are two reviews of ATA?

Gentzkow, Matthew, Bryan Kelly, and Matt Taddy. 2019. “Text as Data.” Journal of Economic Literature 57 (3): 535–74.

Grimmer, Justin, and Brandon M. Stewart. 2013. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis: An Annual Publication of the Methodology Section of the American Political Science Association 21 (3): 267–97.

What are some of the advantages of this more analytical typology over other, more practical ones?

Features can be applied systematically across theoretical constructs.
The features lens helps theories of social science keep up with innovations in computational linguistics.

What is the "term frequency analysis" family of ATA?

The systematic analysis of word choice in communication. This is measured in how often particular terms are used at scale.
There of two categories of such methods: closed-vocabulary (a-priori sets of theoretical constructs, and open-vocabulary (inductive analysis of patterns some aspect of text).

What is the closed-vocabulary approach?

The key assumption, often backed by some validation exercise(s), is that the prevalence of one or a set of terms meaningfully corresponds to the theorized construct.
Example: The "threat dictionary". Upticks of terms related to a set of terms indicates a cultural feeling of threat.
Sample dictionaries:

Development process: a) develop "seed words," b) expand seed words using human judgement and word embeddings or WordNet, c) pruning of "too distant" terms, d) repeat.
Demonstrating validity of a lexicon: 1) convergent validity to show lexical measures correlate with plausible events associated with their theoretical construct, or use words elicited in a survey, or examine contexts of terms in a corpus, 2) divergent validity show the lexical measure is uncorrelated with related but distinct theoretical constructs.
Linguistic Inquiry and Word Count or LIWC. A general purpose dictionary.

What is the open-vocabulary approach?

Definition: the analysis of term frequencies is entirely inductive, allowing for interesting relationships between term frequency and metadata to emerge from the corpus.
Review of the approach:

Tam, Vivian, Nikunj Patel, Michelle Turcotte, Yohan Bossé, Guillaume Paré, and David Meyre. 2019. “Benefits and Limitations of Genome-Wide Association Studies.” Nature Reviews. Genetics 20 (8): 467–84.

What is a resource for differential language analysis?

Andrew Schwartz, H., Salvatore Giorgi, Maarten Sap, Patrick Crutchley, Johannes C. Eichstaedt, and Lyle Ungar. n.d. Differential Language Analysis ToolKit. Github. Accessed April 9, 2023. https://github.com/dlatk.

"The core assumption of term frequency analyses is that a term’s prevalence consistently reflects something meaningful about the document, its author, or the context in which the document was produced." p. 5

What is the document structure analysis family of ATA?

Documents: specific tweets, policy platforms, and text messages
Document structure analysis relies on the assumption that co-occurrence statistics, and therefore the boundaries of documents, are meaningful.
Document structure analysis finds "patterns at the level of the document, seeking to estimate hidden patterns in the way words are distributed amongst them."
Topics: "...can instead be thought of as the clusters of words whose combined presence meaningfully divide [or classify] the documents."
Two approaches to theory building and testing: the "grounded theory" or inductive, vs. the "abductive approach"
There are two dominant approaches to document structure analysis.

The first is a set of methods which infer topics through Bayesian inference and are what are widely referred to as “topic models”.
The second set of approaches treat the document-term matrix (or a transformation of it) as an adjacency matrix, which is then modeled as a network.

Bayesian approaches: In this approach, a “topic” is modeled as a multinomial distribution over all terms in the vocabulary. Some terms might be assigned higher probabilities than others.

Process: Start with a document and select words that represent topics. Then use the "bag of words" to represent the document-term matrix. Once a matrix is created the researcher can examine other documents.

Example: Given the hypothesis that college applicants from different socio-economic backgrounds write systematically different college admission essays, a researcher could use a correlated topic model to automatically categorize admissions essays. (p. 7)

Example 2: Create a word network from state of the union speeches by American presidents. Connect words if they occur in the same paragraph of the same address, identify clusters of words and what the clusters represent, and trace these over time to follow political consciousness.

The Schoolbooks Project

Automated Text Analysis

No comments:

Post a Comment