Two Topic Browsers
Ben Schmidt, in a detailed and very useful post about some potential problems with using topic models for humanities research, wondered why people didn’t commonly build browsers for their models. For me, the answer was quite simple: I couldn’t figure out how to get the necessary output files from MALLET to use Allison Chaney’s topic modeling visualization engine. I’m sure that the output can be configured to do so, and I’ve built the dynamic-topic-modeling code, which does produce the same type of files as lda-c, but I hadn’t actually used lda-c (except through an R package front-end) for my own models.
It occurred to me that a simple browser wouldn’t be that hard to build
myself, so I made one for Clancy’s
explorations of the
rhetoric/composition journals in JSTOR and
another for the theory
corpus. (I did use Chaney’s CSS file.) I
used my old graphs without the scatterplots layer for the
theory-browser, as I didn’t want to take the time to regenerate those
yet. And I’m not sure quite what’s going on with unicode/non-ASCII
characters; theoretically the code I wrote should convert those
properly. [UPDATE: Thanks to a pointer from Andrew
Goldstone on twitter, I fixed the encoding
issue. binmode, ":utf8"
on all filehandles is the answer in perl at
least.)
The articles shown for each topic are those that have that topic most strongly associated with them. It’s quite possible that other articles could have higher proportions but have another topic even more strongly associated with it. I should also rewrite the code so that it grabs all articles below a certain threshold of significance.