Since I wrote my last set of directions for creating a topic-browser from the “Genre-specific wordcounts for 178,381 volumes from the HathiTrust Digital Library”, Andrew Goldstone updated his dfrtopics package to make it easier to use for this purpose. I haven’t completely rewritten the post, though where his instructions conflict with mine (using R instead of dropping into perl for ligature substitution, for example), his solution is both more elegant and more technically reliable.
The “Word Frequencies in English-Language Literature, 1700-1922” data set from the HathiTrust digital library was released last month. (See Ted Underwood’s post for more detail.) It contains word-frequency lists of texts from the digitized HathiTrust collection published between 1700-1922 that are divided into fiction, poetry, and drama. (A description of the method used to classify the documents can be found here.)
There are many approaches to exploring this data. What I’m going to describe is building a topic browser of a model created with LDA.
Of the many interesting things in Matthew Jockers’s Macroanalysis, I was most intrigued by his discussion of interpreting the topics in topic models. Interpretation is what literary scholars are trained for and tend to excel at, and I’m somewhat skeptical of the notion of an “uninterpretable” topic. I prefer to think of it as a topic that hasn’t yet met its match, hermeneutically speaking. In my experience building topic models of scholarly journals, I have found clear examples of lumping and splitting—terms that are either separated from their natural place or agglomerated into an unhappy mass.
1. Ongoing Concerns Matthew Jockers’s Macroanalysis: Digital Methods & Literary History arrived in the mail yesterday, and I finished reading just a short while ago. Between it and the recent Journal of Digital Humanities issue on the “Digital Humanities Contribution to Topic Modeling,” I’ve had quite a lot to read and think about. John Laudun and I also finished editing our forthcoming article in The Journal of American Folklore on using topic-models to map disciplinary change.
I have been interested in bibliometrics for some time now. Humanities citation data has always been harder to come by than that of the sciences, largely because the importance of citation-count as a metric has never much caught on there. Another important reason is a generalized distrust and suspicion of quantification in the humanities. And there are very good reasons to be suspicious of assigning too much significance to citation-counts in any discipline.
I’ve been thinking a lot recently about a simple question: can machine learning detect patterns of disciplinary change that are at odds with received understanding? The forms of machine learning that I’ve been using to try to test this—LDA and the dynamic LDA variant—do a very good job of picking up the patterns that you would suspect to find in, say, a large corpus of literary journals. The model I built of several theoretically oriented journals in JSTOR, for example, shows much the same trends that anyone familiar with the broad contours of literary theory would expect to find.
Ben Schmidt, in a detailed and very useful post about some potential problems with using topic models for humanities research, wondered why people didn’t commonly build browsers for their models. For me, the answer was quite simple: I couldn’t figure out how to get the necessary output files from MALLET to use Allison Chaney’s topic modeling visualization engine. I’m sure that the output can be configured to do so, and I’ve built the dynamic-topic-modeling code, which does produce the same type of files as lda-c, but I hadn’t actually used lda-c (except through an R package front-end) for my own models.
After experimenting with topic models of Critical Inquiry, I thought it would be interesting to collect several of the theoretical journals that JSTOR has in their collection and run the model on a bigger collection with more topics to see how the algorithm would chart developments in theory.
I downloaded all of the articles (word-frequency data for each article, that is) in New Literary History, Critical Inquiry, boundary 2, Diacritics, Cultural Critique, and Social Text.
When I started experimenting with graphing changes in topic-proportions over time, I didn’t pay much attention to the design of the graph. I could see that it was far too busy, but I assumed that this would be relatively easy to adjust using ggplot2’s many parameters.
It wasn’t. It didn’t take me too long to figure out that I needed to change the data from discrete to continuous in order to see anything like a sparkline, but it was also apparent from the other data sets I was working with that taking the mean at intervals was the only way to make a reasonably clean graph.
I was impressed with Ian Milligan’s visualizations of Canadian parliamentary debates, and I wanted to try to visualize some of the topic models I’ve been creating from JSTOR’s Data for Research.
ELH I thought would be an interesting journal to try, as it publishes articles in each issue on quite a range of literary periods, often ranging from medieval to twentieth-century material. I assumed that LDA would be likely to identify each of these periods as a topic.
When I first began reading about topic modeling, I very much wanted to experiment with “dynamic” topic modeling, or the tracking of changes in topics over time. David Blei and John Lafferty describe their algorithm in this paper. They also have made a dynamic topic model browser of Science available. I was very impressed with this project and wanted to apply the technique to journals in the humanities using JSTOR’s Data for Research (DfR).
Here are some instructions for creating the same types of topic models of JSTOR’s journals that I did with Critical Inquiry and Signs.
These instructions are designed for someone using a Mac or Linux platform. (The differences below between using Linux and a Mac should be apparent to anyone who uses Linux, so I’m not going to indicate them here; it’s mainly where files are stored.) All of this should work on Windows, but you’ll need to install Cygwin or use alternate shell commands.
Here are two topic models of Critical Inquiry, generated with the same algorithm but different implementations (MALLET and R topicmodels package, slightly different stopword lists, the latter also was generated with a minimum word frequency of seven):
0 black music musical white african jazz sound performance american racial negro song cultural sounds race rap cage singer composer 1 meaning theory interpretation question philosophy language point claim philosophical sense truth fact argument knowledge intention metaphor text account speech 2 american duke james john trans culture life william things modern cambridge michael david robert soviet shame henry york objects 3 trans time question subject derrida language place order object relation word thing reading moment longer things work thought writing 4 history historical narrative discourse account contemporary terms status context social ways relation discussion essay sense form representation specific position 5 god christian history religious greek ancient modern tradition divine century body early philosophy latin nature religion medieval church soul 6 film cinema films camera screen images frame image movie theater shot early visual cinematic narrative kiss hollywood scene documentary 7 science scientific human knowledge media theory sciences natural studies life social history technology communication machine humanities disciplines system psychology 8 body time game process space affect play form motion hand ways level attention bodies turn making figure physical parts 9 political social politics cultural power culture theory society ideology critique intellectual ideological state economic class liberal struggle revolution marx 10 law legal public case justice trial political war court state violence rights states moral crime speech abuse slave united 11 art painting work visual image picture artist images fig paintings works artists artistic photography aesthetic museum photograph objects object 12 form work time terms art nature individual structure order reality analysis general style works experience concept process elements theory 13 literary literature criticism text reading book work critics texts writing reader fiction language english author readers works read critic 14 life human moral good man sense great experience work fact kind find personal idea mind character people view social 15 time years people day great long house young called read book wrote man word times year men left english 16 story love death man dead face life eyes point scene narrative moment long real james stories heart narrator characters 17 italian di del fig della il fascist spanish inca ii italy autumn giovanni st saint che text building verdi 18 german history benjamin trans historical freud von germany art modern das memory panofsky essay berlin early hegel walter war 19 public war time national city american education work social economic space people urban culture corporate building united market business 20 poetry poem poet language poems poetic poets english lines literary lyric word romantic text verse poetics prose pound milton 21 women sexual female woman male feminist desire sex men sexuality mother gender freud identity body psychoanalytic child psychoanalysis feminine 22 jewish jews israel israeli palestinian state arab jew religious land people political religion identity palestinians muslim islamic rabbi al 23 french en france title sur qui dans paris une paul text ne foucault letter est man jean derrida au 24 cultural european culture colonial western american national chinese african indian native english identity white british racial race south africa
Natalia Cecire tweeted during the topic-modeling workshop that she was momentarily excited by thinking that a presentation on the journal Science was on Signs: Journal of Women in Culture and Society. As it turns out, I have been experimenting with creating topic models from JSTOR’s Data for Research, and I decided to see what the Signs corpus would come up with.
I downloaded word-frequency data for all the issues of the journal.
I have finally done some experimenting with topic modeling. Though there are a variety of pre-existing packages for this (MALLET and the Stanford Topic Modeling Toolbox are two), I used the R package topicmodels combined with Will Lowe’s JFreq to create a topic model of some of Marjorie Bowen’s historical novels.
Why Marjorie Bowen? Let me admit something: I haven’t actually read any of her historical fiction. And besides, topic modeling is thought to work best on large and distinct corpora as I understand it.