Two Topic Browsers

Wed, Feb 13, 2013

Ben Schmidt, in a detailed and very useful post about some potential problems with using topic models for humanities research, wondered why people didn’t commonly build browsers for their models. For me, the answer was quite simple: I couldn’t figure out how to get the necessary output files from MALLET to use Allison Chaney’s topic modeling visualization engine. I’m sure that the output can be configured to do so, and I’ve built the dynamic-topic-modeling code, which does produce the same type of files as lda-c, but I hadn’t actually used lda-c (except through an R package front-end) for my own models.

It occurred to me that a simple browser wouldn’t be that hard to build myself, so I made one for Clancy’s explorations of the rhetoric/composition journals in JSTOR and another for the theory corpus. (I did use Chaney’s CSS file.) I used my old graphs without the scatterplots layer for the theory-browser, as I didn’t want to take the time to regenerate those yet. And I’m not sure quite what’s going on with unicode/non-ASCII characters; theoretically the code I wrote should convert those properly. [UPDATE: Thanks to a pointer from Andrew Goldstone on twitter, I fixed the encoding issue. binmode, ":utf8" on all filehandles is the answer in perl at least.)

The articles shown for each topic are those that have that topic most strongly associated with them. It’s quite possible that other articles could have higher proportions but have another topic even more strongly associated with it. I should also rewrite the code so that it grabs all articles below a certain threshold of significance.