Experimenting with Dynamic Topic Models

Sun Nov 18, 2012

When I first began reading about topic modeling, I very much wanted to experiment with “dynamic” topic modeling, or the tracking of changes in topics over time. David Blei and John Lafferty describe their algorithm in this paper. They also have made a dynamic topic model browser of Science available. I was very impressed with this project and wanted to apply the technique to journals in the humanities using JSTOR’s Data for Research (DfR).

Thankfully, the source code for creating dynamic topic models is also available. (Take note to apply the patch listed under “Issues,” or you’ll get segmentation faults, at least on my computer.) Building this code into an executable program may require several steps, depending on your set-up. If you use Linux, it’s just a matter of installing the Gnu Scientific Libraries. For a Mac, you have to download the XCode command-line tools from Apple, and then install the GSL libraries. For Windows, I’d guess that cygwin would be the best way to go, though I used to compile things with Visual C++ that were written for Unix systems without too much trouble.

The code itself will require a matrix of word frequencies in the ldac format (I use text2ldac to generate these), and a file of time-sequences. The matrix can be a bit tricky to generate from the DfR data. You first have to decide what kind of time-slices you want, and then you have to rename all of the files to reflect their date (the DOIs will not automatically correlate to date of publication). I wrote a perl script to do this. It is sadly far too embarrassing to share with anyone at this point, but I’ll see what I can do if there’s any interest. You then need to count how many documents are in each time-slice. I also used a perl script to do this. Once you have that information, the “-seq.dat” file will have the total number of time-slices on the first line, followed by the number of documents in each time-slice on the following lines.

Now you can execute the dtm code. It has many options, and there’s a shell script in the main directory which outlines several of them for you. I would just copy that code and make the necessary changes. Once it has finished running, it will generate a series of probability matrices for each term in each topic at each time-slice in the “lda-seq” directory of the output folder.

I used R to read these and correlate them with the “.vocab” file, or word-list, that text2ldac generates. You have to create a matrix with the correct number of time-slices, import the word list, and then sort by each column to get the topics for each time. I wrote an R function to do this as well, but it’s, if anything, more embarrassing than the perl.

My colleagues John Laudun and Clai Rice have been working on citation networks in The Journal of American Folklore. We are wondering if correlating regular LDA topic models or dynamic ones with changes in citation frequencies can reveal anything about disciplinary changes. To test the dynamic topic model, I ran it on the research articles in this journal from 1900-2010. I generated 50 topics, which I won’t reproduce all of here. This topic is of local interest, however:

[1900-1909] "french" "dream" "german" "dreams" "yuma" "mohave" "ethnic"
"local" "louisiana" "american" [1910-1919] "french" "dream" "german"
"dreams" "yuma" "mohave" "ethnic" "local" "louisiana" "american"
[1920-1929] "french" "dream" "german" "dreams" "yuma" "mohave" "ethnic"
"local" "louisiana" "american" [1930-1939] "french" "german" "dream"
"dreams" "ethnic" "yuma" "mohave" "local" "louisiana" "american"
[1940-1949] "french" "german" "dream" "dreams" "ethnic" "local"
"louisiana" "mohave" "yuma" "identity" [1950-1959] "french" "german"
"dream" "ethnic" "dreams" "local" "identity" "louisiana" "american"
"group"
[1960-1969] "german" "french" "ethnic" "dream" "local" "identity"
"dreams" "group" "american" "louisiana" [1970-1979] "german" "ethnic"
"french" "identity" "local" "dream" "mardi" "group" "gras" "american"
[1980-1989] "german" "ethnic" "french" "identity" "mardi" "local" "gras"
"social" "american" "group"
[1990-1999] "mardi" "gras" "ethnic" "identity" "french" "german"
"local" "american" "people" "louisiana" [2000-2010] "mardi" "gras"
"ethnic" "identity" "french" "american" "canadian" "people" "local"
"cultural"

I am not ready to make any interpretive claims yet. Another thing I’ve been working is using python’s Natural Language Toolkit’s WordNet lemmatizer to lemmatize the corpus before topic-modeling it. It works, but it is also very slow.

Here is some code in R that you can use to extract the topics. First, setwd to the “lda-seq” directory, and copy the “wordcounts.vocab” file to it: X is the number of slices, y is the number of terms you want printed out. This code requires library(lda) for read.vocab() function

dynamic.topic <- function(x,y)
file.list <- system("ls topic-*-var-e-log-prob.dat", intern=TRUE)
words<-read.vocab("wordcounts.vocab")

for (file in file.list) {
    a=scan(file)
    b=matrix(a,ncol=x,byrow=TRUE)
    b[]<-exp(b)

   for (i in 1:x) {
       try.df<-data.frame(b[,i],words, stringsAsFactors=FALSE)
       var.col<-colnames(try.df)
       var<-var.col[1] 
       to.sort<-try.df[order(-try.df[[var]]), ]
       print(head(to.sort$words, y))
  }
}

}