Recent Developments in Humanities Topic Modeling Matthew Jockers's Macroanalysis and the Journal of Digital Humanities

1. Ongoing Concerns Matthew Jockers’s Macroanalysis: Digital Methods & Literary History arrived in the mail yesterday, and I finished reading just a short while ago. Between it and the recent Journal of Digital Humanities issue on the “Digital Humanities Contribution to Topic Modeling,” I’ve had quite a lot to read and think about. John Laudun and I also finished editing our forthcoming article in The Journal of American Folklore on using topic-models to map disciplinary change. Our article takes a strongly interpretive and qualitative approach, and I want to review what Jockers and some of the contributors to the JDH volume have to say about the interpretation of topic models.

Before I get to that, however, I want to talk about the Representations project’s status, as it was based on viewing the same corpus through a number of different topic-sizes. I had an intuition that documents that were highly cited outside of the journal, such as Pierre Nora’s “Between Memory and History,” might tend to be more reflective of the journal’s overall thematic structure than those less-cited. The fact that citation-count is (to some degree) correlated with publication date complicates this, of course, and I also began to doubt the premise. The opposite, in fact, might be as likely to be true, with articles that have an inverted correlation to the overall thematic structure possibly having more notability than “normal science.” The mathematical naivety of my approach compared to the existing work on topic-modeling and document influence, such as the Gerrish and Blei paper I linked to in the original post, also concerned me.

One important and useful feature missing from the browsers I had built was the display of related documents for each article. After spending one morning reading through early issues of Computers and the Humanities, I built a browser of it and then began working on computing similarity scores for individual articles. I used what seemed to be the simplest and most intuitive measure–the sum of absolute differences of topic assignments (this is known as Manhattan distance). Travis Brown pointed out to me on twitter that Kullback-Leibler divergence would likely give better results.* (Sure enough, in the original LDA paper, KL divergence is recommended.) The Computers and the Humanities browser currently uses the simpler distance measure, and the results are not very good. (This browser also did not filter for research articles only, and I only used the default stop-words list, which means that it is far from as useful as it could be.)

While the KL-divergence is not hard to calculate, I didn’t have time at the beginning of the end of the semester to rewrite the similarity score script to use it.** And since I wanted the next iteration of the browsers to use the presumably more accurate document-similarity scores, I’ve decided to postpone that project for a month or so. Having a javascript interface that allows you to instantly switch views between pre-generated models of varying numbers of topics also seemed like a useful idea; I haven’t seen anyone do that with different numbers of topics in each model yet (please let me know if there are existing examples of something like this).

2. Interpretation

I’m only going to write about a small section of Macroanalysis here. A full review might come in the future. I think that the rhetorical strategies of Jockers’s book (and also of Stephen Ramsay’s Reading Machines, an earlier volume in the Topics in the Digital Humanities series published by the University of Illinois Press) contrast interestingly with other scholarly monographs in literary studies and that this rhetoric is worth examining in the context of the current crisis in the humanities, and the salvific role of computational methods therein. But what I’m going to discuss here is Jockers’s take on labeling and interpreting the topics generated by LDA.

In our interpretation of the folklore-journals corpus John and I did do de-facto labeling or clustering of the topics. We were particularly interested in a cluster of topics related to the performative turn in folklore. Several of these topics did match our expectations in related terms and chronological trends. (Ben Schmidt’s cautions about graphing trends in topics chronologically are persuasive, though I’m more optimistic than he is about the use of dynamic topic modeling for secondary literature.) The documents associated with these apparently performance-related topics accorded with our expectations, and we took this as evidence that the co-occurrence and relative frequency assignments of the algorithm were working as expected. If that were all, then the results would be only another affirmation of the long-attested usefulness of LDA in classification or information-retrieval. And this goes a long way. If it works for things we know, then it works for things we don’t. And there are many texts we don’t know much about.

The real interest with using topic modeling to examine scholarship is when the results contrast with received understanding. When they mostly accord with what someone would expect to find, but there are oddities and discrepancies, we must interpret the results to determine if the fault lies in the algorithm’s classification or in the discipline’s received understanding of its history. By definition, this received understanding is based more on generalization and oral lore rather than analytic scrutiny and revision (which obviously drives much inquiry, but is almost always selective in its target), so there will always be discrepancies. Bibliometric approaches to humanities scholarship lag far behind those of the sciences, as I understand it, and I think they are of intrinsic interest independent of their contribution to disciplinary history.

Jockers describes efforts to label topics algorithmically in Macroanalysis (135, fn1). He mentions that his own work in successively revising the labels of his 19th century novels topic model is being used by David Mimno to train a classifying algorithm. He also cites “Automatic Labeling of Topic Models” and “Best Topic Word Selection for Topic Labelling” by Jey Han Lau and co-authors. Both of these papers explore automatically assigning labels to topics from either the terms themselves or from querying an external source, such as wikipedia, to correlate with the terms. My browsers just use the first four terms of a topic as the label, but I can see how a human-assigned label would make them more consistently understandable. Of course, with many models and large numbers of topics, this process becomes laborious, thus the interest in automatic assignment.

But some topics cannot be interpreted. (These are described as “uninterruptable” topics in Macroanalysis [129] in what I assume is a spell-check mistake.) Ignoring ambiguous topics is “a legitimate use of the data and should not be viewed with suspicion by those who may be wary of the ‘black box’” (130) I agree with Jockers here. In my experience modeling JSTOR data, there are always “evidence/argument” related topics that are highly represented in a hyperparametrized model, and these topics are so general as to be useless for analytic purposes. There are also “OCR error” topics and “bibliography” topics. I wouldn’t describe these latter ones as ambiguous so much as useless, but the point is that you don’t have to account for the entire model to interpret some of the topics. Topics near the bottom of a hyperparametrized model tend not to be widely represented in a corpus and thus are not of very high quality: this “dewey ek chomsky” topic from the browser I created out of five theory-oriented journals is a good example.

I was particularly intrigued by Jockers’s description of combining topic-model and stylometric classifications into a similarity matrix. I would be bewildered and intimidated by the underlying statistical difficulties of combining these two types of classifications, but the results are certainly intriguing. The immortal George Payne Rainsford James and his The False Heir was classified as the closest non-Dickens novel to A Tale of Two Cities, for example (161).

3. The JDH Issue

Scott Weingart and Elijah Meeks, as I noted above, co-edited a recent issue of JDH devoted to topic modeling in the humanities. Many of the articles are versions of widely circulated posts of the last few months, such as the aforementioned Ben Schmidt article and Andrew Goldstone’s and Ted Underwood’s piece on topic-modeling PMLA. (Before I got distracted by topic-browsers, I created some network visualizations of topics similar to those in the Underwood and Goldstone piece. I get frustrated easily with Gephi for some reason, but the network visualization packages in R don’t generally produce graphs as handsome as Gephi’s.) There is a shortened version of David Blei’s “Probabilistic Topic Models” review article, and the slides from David Mimno’s very informative presentation from November’s Topic-Modeling workshop at the University of Maryland. Megan R. Brett does a good job of explaining what’s interesting about the process to a non-specialist audience. I’ve tried this myself two or three times, and it’s much more difficult than I expected it would be. The slightly decontextualized meanings of “topic,” “theme,” “document,” and possibly even “word” that are used to describe the process cause confusion, from what I’ve observed, and it’s also quite difficult to grasp why the “bag of words” approach can produce coherent results if you’re unaccustomed to thinking about the statistical properties of language. Formalist training and methods are hard to reconcile with frequency-based analysis.

Lisa Rhody’s article describes using LDA to model ekphrastic poetry. I was impressed with Rhody’s discussion of interpretation here, as poetry presents a different level of abstraction from secondary texts and even other forms of creative writing. I had noticed in the rhetoric browser I created out of College English, jac, Rhetoric Review, Rhetoric Society Quarterly, and CCC, that the poems often published in College English consistently clustered together (and that topic would have been clustered together had I stop-worded “poems,” which I probably should have done.) Rhody’s article is the longest of the contributions, I believe, and it has a number of observations about the interpretation of topics that I want to think about more carefully.

Finally, the overview of tools available for topic modeling was very helpful. I’ve never used Paper Machines on my zotero collections, but I look forward to trying this out in the near future. A tutorial on using the R lda package might have been a useful addition, though perhaps its target audience would be too small to bother. I think I might be one of the few humanists to experiment with dynamic topic models, which I think is a useful and productive—if daunting—LDA variant. (MALLET has a built-in hierarchical LDA model, but I haven’t yet experimented with it.)

*Here is an informative storified conversation about distance measurements for topic models that Brown showed me.

**Possibly interesting detail: at no point do any of my browser-creation programs use objects or any more complicated data-structure than a hash. If you’re familiar with the types of data manipulation necessary to create one of these, that probably sounds somewhat crazy—hence my reluctance to share the code on github or similar. I know enough to know that it’s not the best way to solve the problem, but it also works, and I don’t feel the need to rewrite it for legibility and some imagined community’s approval. I’m fascinated by the ethos of code-sharing, and I might write something longer about this later.

***I disagree with the University of Illinois Press’s decision to use sigils instead of numbered notes in this book. As a reader, I prefer endnotes, though I know how hard they are to typeset, but Jockers’s book has enough of them that they should be numbered.