Beginning Adventures in Topic Modeling

Sat, Aug 11, 2012

I have finally done some experimenting with topic modeling. Though there are a variety of pre-existing packages for this (MALLET and the Stanford Topic Modeling Toolbox are two), I used the R package topicmodels combined with Will Lowe’s JFreq to create a topic model of some of Marjorie Bowen’s historical novels.

Why Marjorie Bowen? Let me admit something: I haven’t actually read any of her historical fiction. And besides, topic modeling is thought to work best on large and distinct corpora as I understand it. It’s not likely to reveal something about the works of a single author that a human reader wouldn’t notice. But this human reader hasn’t read them and was curious to try it out, even on a < 1M word model.

There are two techniques provided by the topicmodels package, as far I can tell: Gibbs sampling and VEM (variational expectation maximization). The results generally are comparable with twenty topics modeled. For instance, topic 9 of the Gibbs sample was {boccage, children, woman, paris, herself, duke, duchess, monsieur, st, faubourg} and VEM topic 2 was {boccage, woman, herself, children, man, know, little, never, even, well}. As far as I can tell, VEM tracks more closely to general word frequencies, but I haven’t done much looking under the hood of these.

I suppose I might have allowed myself to hope for a series of juicy abstract nouns that would be relatively easy to relate to the contemporary historical circumstances of these novels, and with a more generous word-stop list, who knows what might emerge? It’s actually much more of a pain-in-the-ass to get one’s greedy little hands on digitized books than I had imagined beforehand.

Update: One tip I have is that if you get “invalid dimnames” errors when you import from the JFreq data, eliminate any double-quotes that are in the word.csv file.