Topic Models and Highly Cited Articles Pierre Nora's Between Memory and History in Representations
I have been interested in bibliometrics for some time now. Humanities citation data has always been harder to come by than that of the sciences, largely because the importance of citation-count as a metric has never much caught on there. Another important reason is a generalized distrust and suspicion of quantification in the humanities. And there are very good reasons to be suspicious of assigning too much significance to citation-counts in any discipline.
I used google scholar to search for most-cited articles in several journals in literary studies and allied fields. (Its default search behavior is to return the most-cited article in its database, which, while having a very broad reach, is far from comprehensive or error-free.) By far the most-cited article I found in any of the journals I looked at was Pierre Nora’s “Between Memory and History: Les Lieux de Memoire.” A key to success in citation-gathering is multidisciplinary appeal, and Nora’s article has it. It is cited in history, literary studies, anthropology, sociology, and several other fields. (It would be interesting to consider Nora’s argument about the ever-multiplying sites of memory in era of mass quantification, but I’ll have to save that for another time.)
The next question that came to mind would be where Nora’s article would be classified in a topic model of all of the journal’s articles. Representations was first published in 1983. The entire archive in JSTOR contains 1036 documents. For many of my other topic-modeling work with journals, I have only used what JSTOR classifies as research articles. Here, because of the relatively small size of the sample (and also because I wanted to see how the algorithm would classify front matter, back matter, and the other paraphernalia), I used everything. In order to track “Between Memory and History,” I created several different models. It is always a heuristic process to match the number of topics with the size and density of a given corpus. Normally, I would have guessed that somewhere between 30-50 would have been good enough to catch most of the distinct topics while minimizing the lumping together of unrelated ones.
For this project, however, I decided to create six separate models with an incrementally increasing number of topics. The number of topics in each is 10, 30, 60, 90, 120, and 150. I have also created browsers for each model. The index page of each browser shows the first four words of each topic for that model. The topics are sorted in descending order of their proportion in the model. Clicking on one of the topics takes you to a page which shows the full list of terms associated with that topic, the articles most closely associated with that topic (also sorted in descending order—the threshold is .05), and a graph that shows the annual mean of that topic over time. Clicking on any given journal article will take you to a page showing that journal’s bibliographic information, along with a link to JSTOR. The four topics most closely associated with that article are also listed there.
In the ten-topic browser, whose presence here is intended to demonstrate
my suspicion that ten topics would not be nearly enough to capture the
range of discourse in Representations, Nora’s article is in the
‘French’ topic,
a lumped-together race/memory
topic, a
generalized
social/history
topic, and the suggestive “time, death, narrative”
topic.
With a .05 threshold, 32% of the documents in the corpus appear in
the ten-topic browser. [UPDATE: 3/16, this figure turned out to be
based on a bug in the browser-building program.] None of these
classifications are particularly surprising or revealing, given how
broad the topics have to be at this level of detail; but one idea that I
want to return is the ability of topic-models to identify influential
documents in a given corpus. Nora’s article has clearly been very
influential, but are there any detectable traces of this influence in a
model of the journal in which it appeared?
Sean M. Gerrish and David Blei’s article “Language-based Approach to Measuring Scholarly Impact” uses dynamic topic models to infer which documents are (or will be) most influential in a given collection. What I have done with these Representations models is not dynamic topic modeling but the regular LDA model. I have experimented with dynamic topic models in the past, and I would like to apply the particular techniques described in their article once I can understand them better.
Here is how Nora’s article is classified in each of the topic models (sorted vertically from most to least representative):
There is a notable consistency between the topics the article is assigned to no matter how many there are to choose from. A logical question to ask is if Nora’s article is assigned to more or less topics than the average article across these six models. The percentage of all articles that are assigned to a topic with a proportional threshold >= .05 ranges from 32% with the ten-topic model to 52% in the 150-topic.
In my next post, I am going to describe the relative frequency of the average article in the different models and try to identify which ones (including Nora’s, if it turns out to be) are disproportionately represented in the topics. I will also begin interpreting these results in light of what I felt was historicism’s relative absence in the theory-journals corpus I created earlier.
[UPDATE: 3/16. I corrected a bug in the browser-building program and generated a new table above with the correct topics linked for Nora’s article. The previous table had omitted a few.]