It has come to my attention that some instructions I posted about four years ago are in need of revision. I appreciate the attention given to this post by the authors, and I wouldn’t even want to notice that the dead link they identified was moved by the same organization that one of them works for, because what kind of precedent does that set?
In all seriousness, I am sure that the container models that Hathi now provides are more reliable ways of performing these kinds of tasks.
Like many netizens, I was amused by Andrej Karpahty’s “The Unreasonable Effectiveness of Recurrent Neural Networks” when it first appeared. I don’t mean the explanation of what a recurrent neural network is or the claim that there’s much wisdom in Paul Graham’s essays. The text-generation samples, however, were really neat. RNN text-generators power many bots on the social media platform known as “twitter,” and I suspect that they may also be used in commercial solicitations.
When I wrote my last post about modeling Darko Suvin’s genres of Victorian science fiction, I did not have access to Suvin’s comprehensive bibliography. What can I say? The Louisiana State Library loaned me theirs, but it took a few days. I was forced to model the texts that Suvin claimed were not science fiction. While I could guess what many of the books that Suvin would admit to the Victorian SF canon were, I preferred to wait until I could see them in cold print before gathering them.
You can’t go far in reading about science fiction’s genres without encountering the work of Darko Suvin. His Metamorphoses of Science Fiction is among the most widely cited and influential works in the field. Suvin published a reference work devoted to Victorian SF after that: Victorian Science Fiction in the UK. Two related articles appeared in Science Fiction Studies in 1979 and 1980: “On What Is and Is Not an SF Narration; With a List of 101 Victorian Books That Should Be Excluded from SF Bibliographies” and “Seventy-Four More Victorian Books That Should Be Excluded from Science Fiction Bibliographies”.
Kieran Healy posted last year about “sleeping beauties” in philosophy—papers that went several years before receiving any citations but that ended up accumulating many. This pattern is unusual, as most papers receive a good amount of citations immediately and continue to do so (or the opposite). I think literary studies and history is less paper-driven than philosophy, and I would encourage everyone to read this for more context on citations in the humanities.
Since I wrote my last set of directions for creating a topic-browser from the “Genre-specific wordcounts for 178,381 volumes from the HathiTrust Digital Library”, Andrew Goldstone updated his dfrtopics package to make it easier to use for this purpose. I haven’t completely rewritten the post, though where his instructions conflict with mine (using R instead of dropping into perl for ligature substitution, for example), his solution is both more elegant and more technically reliable.
The “Word Frequencies in English-Language Literature, 1700-1922” data set from the HathiTrust digital library was released last month. (See Ted Underwood’s post for more detail.) It contains word-frequency lists of texts from the digitized HathiTrust collection published between 1700-1922 that are divided into fiction, poetry, and drama. (A description of the method used to classify the documents can be found here.)
There are many approaches to exploring this data. What I’m going to describe is building a topic browser of a model created with LDA.
The Current Situation
Around last year at this time, I became interested in what the archived editions of the MLA Job Information List could tell us about how the profession has changed over time. The MLA provided page-scans of all the JILs going back to 1965, and Jim Ridolfo used commercial OCR software to make them searchable. Once the documents were searchable, finding the first occurrence of various key words and graphing their frequency over time became feasible.
“The Real Nature of Control” [I originally didn’t quite have the kinks worked out of my org-mode HTML export process that produced the document below, but I have updated the post. There is also a pdf of these remarks about Brian Lennon’s “The Digital Humanities and National Security.”]
“The Real Nature of Control” The last text I assigned in my recent “Modernism, Fascism, and Sexuality” seminar was Gravity’s Rainbow.1 Among its many oddities is a scene where the spirit of Walther Rathenau is summoned through a medium for the entertainment and mockery of an elite “corporate Nazi crowd”: These signs are real.
Two stories caught my attention yesterday. The first was a review of some recent studies of citation practices by field, broadly considered. The claim that alarmed a number of people on twitter was that “82%” of humanities scholarship was never cited. I pointed out that it was a mistake to assume that “never cited” means “never read.” That someone would even make this inference is quite mysterious to me. Let me explain: this semester, I have been teaching, for the first time, a course on the Victorian novel.
I don’t remember exactly when the MLA digitized all of the issues of the Job Information List, but I was excited about what these documents could tell us about institutional history, the job market, salary trends, and many other things. The PDFs hosted by MLA are image scans, however, which are not immediately searchable as plain text. A variety of OCR solutions are available, but I personally was too lazy to attempt to use any of them.
A problem that many of the co-citation graphs I discussed in the last post share is that they are too dense to be easily readable. I created the sliders as a way of alleviating this problem, but some of the data sets are too dense at any citation-threshold. Being able to view only one of the communities at a time seemed like a plausible solution, but I was far from sure how to implement it using d3.
I’ve created several new co-citation graphs recently. While I enjoy looking at the visualizations, I haven’t yet analyzed any of them thoroughly. The film studies network was intriguing to me for several reasons, and I’m going to explore it now in more detail.
I downloaded just over 12K articles from various film studies journals in Web of Science. The journals are Sight and Sound; Film Comment; Literature/Film Quarterly; American Film; Cinema Journal; Screen; Historical Journal of Film, Radio, and Television; Journal of Popular Film & Television; Wide Angle; Film Quarterly; Journal of Film and Video; Film Criticism; and Quarterly Review of Film & Video.
I’ve written here and here about creating co-citation networks in D3 from Web of Science data. My first experiment, described above, was creating a threshold slider. I next wanted to try to create a chronological slider that would allow you to adjust the dates of the citations in the network.
There are doubtless many ways of going about doing this, and I’m reasonably sure that the method I’m going to describe is far from ideal.
After reading Kieran Healy’s latest post about women and citation patterns in philosophy, I wanted to revisit the co-citation graph I had made of five journals in literary and cultural theory. As I noted, one of these journals is Signs, which is devoted specifically to feminist theory. I didn’t think that its presence would skew the results too much, but I wanted to test it. Here are the top thirty citations in those five journals:
I wanted to modify this script by Neal Caren to create an adjustable graph that allows you to control the threshold of citations for nodes that will appear on the graph. If for example, you wanted to see only those nodes with twenty or more citations, you can just move the slider over to see those, and the data will automatically update. I have created three of these: Modernist Journals, Literary Theory, and Rhetoric and Composition.
I’ve been interested in humanities citation analysis for some time now, though I had been somewhat frustrated in that work by JSTOR pulling its citation data from its DfR portal a year or so ago. It was only a day or two ago with Kieran Healy’s fascinating post on philosophy citation networks that I noticed that the Web of Science database has this information in a relatively accessible format. Healy used Neal Caren’s work on sociology journals as a model.
Of the many interesting things in Matthew Jockers’s Macroanalysis, I was most intrigued by his discussion of interpreting the topics in topic models. Interpretation is what literary scholars are trained for and tend to excel at, and I’m somewhat skeptical of the notion of an “uninterpretable” topic. I prefer to think of it as a topic that hasn’t yet met its match, hermeneutically speaking. In my experience building topic models of scholarly journals, I have found clear examples of lumping and splitting—terms that are either separated from their natural place or agglomerated into an unhappy mass.
1. Ongoing Concerns Matthew Jockers’s Macroanalysis: Digital Methods & Literary History arrived in the mail yesterday, and I finished reading just a short while ago. Between it and the recent Journal of Digital Humanities issue on the “Digital Humanities Contribution to Topic Modeling,” I’ve had quite a lot to read and think about. John Laudun and I also finished editing our forthcoming article in The Journal of American Folklore on using topic-models to map disciplinary change.
I have been interested in bibliometrics for some time now. Humanities citation data has always been harder to come by than that of the sciences, largely because the importance of citation-count as a metric has never much caught on there. Another important reason is a generalized distrust and suspicion of quantification in the humanities. And there are very good reasons to be suspicious of assigning too much significance to citation-counts in any discipline.
One of my secret vices is reading polemics about whether or not some group of people, usually humanists or librarians, should learn how to code. What’s meant by “to code” in these discussions varies quite a lot. Sometimes it’s a markup language. More frequently it’s an interpreted language (usually python or ruby). I have yet to come across an argument for why a humanist should learn how to allocate memory and keep track of pointers in C, or master the algorithms and data structures in this typical introductory computer science textbook; but I’m sure they’re out there.
I’ve been thinking a lot recently about a simple question: can machine learning detect patterns of disciplinary change that are at odds with received understanding? The forms of machine learning that I’ve been using to try to test this—LDA and the dynamic LDA variant—do a very good job of picking up the patterns that you would suspect to find in, say, a large corpus of literary journals. The model I built of several theoretically oriented journals in JSTOR, for example, shows much the same trends that anyone familiar with the broad contours of literary theory would expect to find.
Ben Schmidt, in a detailed and very useful post about some potential problems with using topic models for humanities research, wondered why people didn’t commonly build browsers for their models. For me, the answer was quite simple: I couldn’t figure out how to get the necessary output files from MALLET to use Allison Chaney’s topic modeling visualization engine. I’m sure that the output can be configured to do so, and I’ve built the dynamic-topic-modeling code, which does produce the same type of files as lda-c, but I hadn’t actually used lda-c (except through an R package front-end) for my own models.
No one likes gamification or MOOCs, as far as I can tell. What I should say is that anyone trained in the hermeneutics of suspicion might even find it hard to accept their existence. It’s hard to come up with a hypothetical concept that would cry more piteously to the heavens for critique, for example. True to form, until a few weeks ago I had never earned a badge in my life and would have regarded the prospect of doing so with contempt and a touch of pity for whoever was naive enough to suggest it.
After experimenting with topic models of Critical Inquiry, I thought it would be interesting to collect several of the theoretical journals that JSTOR has in their collection and run the model on a bigger collection with more topics to see how the algorithm would chart developments in theory.
I downloaded all of the articles (word-frequency data for each article, that is) in New Literary History, Critical Inquiry, boundary 2, Diacritics, Cultural Critique, and Social Text.
When I started experimenting with graphing changes in topic-proportions over time, I didn’t pay much attention to the design of the graph. I could see that it was far too busy, but I assumed that this would be relatively easy to adjust using ggplot2’s many parameters.
It wasn’t. It didn’t take me too long to figure out that I needed to change the data from discrete to continuous in order to see anything like a sparkline, but it was also apparent from the other data sets I was working with that taking the mean at intervals was the only way to make a reasonably clean graph.
I was impressed with Ian Milligan’s visualizations of Canadian parliamentary debates, and I wanted to try to visualize some of the topic models I’ve been creating from JSTOR’s Data for Research.
ELH I thought would be an interesting journal to try, as it publishes articles in each issue on quite a range of literary periods, often ranging from medieval to twentieth-century material. I assumed that LDA would be likely to identify each of these periods as a topic.
When I first began reading about topic modeling, I very much wanted to experiment with “dynamic” topic modeling, or the tracking of changes in topics over time. David Blei and John Lafferty describe their algorithm in this paper. They also have made a dynamic topic model browser of Science available. I was very impressed with this project and wanted to apply the technique to journals in the humanities using JSTOR’s Data for Research (DfR).
Here are some instructions for creating the same types of topic models of JSTOR’s journals that I did with Critical Inquiry and Signs.
These instructions are designed for someone using a Mac or Linux platform. (The differences below between using Linux and a Mac should be apparent to anyone who uses Linux, so I’m not going to indicate them here; it’s mainly where files are stored.) All of this should work on Windows, but you’ll need to install Cygwin or use alternate shell commands.
Here are two topic models of Critical Inquiry, generated with the same algorithm but different implementations (MALLET and R topicmodels package, slightly different stopword lists, the latter also was generated with a minimum word frequency of seven):
0 black music musical white african jazz sound performance american racial negro song cultural sounds race rap cage singer composer 1 meaning theory interpretation question philosophy language point claim philosophical sense truth fact argument knowledge intention metaphor text account speech 2 american duke james john trans culture life william things modern cambridge michael david robert soviet shame henry york objects 3 trans time question subject derrida language place order object relation word thing reading moment longer things work thought writing 4 history historical narrative discourse account contemporary terms status context social ways relation discussion essay sense form representation specific position 5 god christian history religious greek ancient modern tradition divine century body early philosophy latin nature religion medieval church soul 6 film cinema films camera screen images frame image movie theater shot early visual cinematic narrative kiss hollywood scene documentary 7 science scientific human knowledge media theory sciences natural studies life social history technology communication machine humanities disciplines system psychology 8 body time game process space affect play form motion hand ways level attention bodies turn making figure physical parts 9 political social politics cultural power culture theory society ideology critique intellectual ideological state economic class liberal struggle revolution marx 10 law legal public case justice trial political war court state violence rights states moral crime speech abuse slave united 11 art painting work visual image picture artist images fig paintings works artists artistic photography aesthetic museum photograph objects object 12 form work time terms art nature individual structure order reality analysis general style works experience concept process elements theory 13 literary literature criticism text reading book work critics texts writing reader fiction language english author readers works read critic 14 life human moral good man sense great experience work fact kind find personal idea mind character people view social 15 time years people day great long house young called read book wrote man word times year men left english 16 story love death man dead face life eyes point scene narrative moment long real james stories heart narrator characters 17 italian di del fig della il fascist spanish inca ii italy autumn giovanni st saint che text building verdi 18 german history benjamin trans historical freud von germany art modern das memory panofsky essay berlin early hegel walter war 19 public war time national city american education work social economic space people urban culture corporate building united market business 20 poetry poem poet language poems poetic poets english lines literary lyric word romantic text verse poetics prose pound milton 21 women sexual female woman male feminist desire sex men sexuality mother gender freud identity body psychoanalytic child psychoanalysis feminine 22 jewish jews israel israeli palestinian state arab jew religious land people political religion identity palestinians muslim islamic rabbi al 23 french en france title sur qui dans paris une paul text ne foucault letter est man jean derrida au 24 cultural european culture colonial western american national chinese african indian native english identity white british racial race south africa
Natalia Cecire tweeted during the topic-modeling workshop that she was momentarily excited by thinking that a presentation on the journal Science was on Signs: Journal of Women in Culture and Society. As it turns out, I have been experimenting with creating topic models from JSTOR’s Data for Research, and I decided to see what the Signs corpus would come up with.
I downloaded word-frequency data for all the issues of the journal.
The Guardian recently posted some sales data of the Booker Prize winners. I thought it would be interesting to compare those figures with LibraryThing ownership to see how reliable that latter figure might be in determining a book’s total sales. The median was 2.77%, mean 3.88%.
The table is below, not very well-formatted I’m afraid.
1969 PH Newby Something To Answer For Faber & Faber 421 64 15.20% 1970 Bernice Rubens The Elected Member Eyre & Spottiswoode 3,901 133 3.
I have finally done some experimenting with topic modeling. Though there are a variety of pre-existing packages for this (MALLET and the Stanford Topic Modeling Toolbox are two), I used the R package topicmodels combined with Will Lowe’s JFreq to create a topic model of some of Marjorie Bowen’s historical novels.
Why Marjorie Bowen? Let me admit something: I haven’t actually read any of her historical fiction. And besides, topic modeling is thought to work best on large and distinct corpora as I understand it.
Using a modified (and depleted) version of this script written by digital historian blogger William J. Turkel, I tried to see how useful of an automated source generator Amazon’s Statistically Improbable Phrases and Capitalized Phrases data would be for Gene Wolfe’s Solider of Sidon:
SIP red land
Atlantis: Insights from a Lost Civilization Soldier of Sidon Red Land Yellow River: A Story from the Cultural Revolution The Golden Star of Halich: A Tale of the Red Land in 1362
My contribution to the Moretti event:
“Suppose at this juncture we were to state the blindingly obvious: that, whatever their other properties, literary texts do not possess genes” (59). So begins the “Perils of Analogy” section of Christopher Prendergast’s response* to Moretti. Notwithstanding the Paris Review interviews, it does seem difficult to maintain that literature has genes. Does it have memes, however? Ideologemes? Maybe. And I will discuss metaphors of cultural transmission and evolutionary analogies in Moretti’s argument.
I’ve announced the upcoming Valve book event on Franco Moretti’s Graphs, Maps, Trees, about which I’m excited.
Also, Mark Bauerlein has an article (currently subscription) in the Chronicle about adolescent culture and the decline of literacy. In many ways, I think Bauerlein misses the mark here; but for now I just want to note that this:
The fact that involvement fell while access rose signals a new stance toward literature and the arts among the young.