One Hundred Topics (1700-1922)

Sat, Oct 10, 2015

Since I wrote my last set of directions for creating a topic-browser from the “Genre-specific wordcounts for 178,381 volumes from the HathiTrust Digital Library”, Andrew Goldstone updated his dfrtopics package to make it easier to use for this purpose. I haven’t completely rewritten the post, though where his instructions conflict with mine (using R instead of dropping into perl for ligature substitution, for example), his solution is both more elegant and more technically reliable.

Adventures in De-duplication

Most of the time I’ve spent working on this project over the last two weeks has been devoted to detecting and eliminating duplicates in the fiction data. (I’ve also taught Kafka, Joyce, and Woolf in that time, and, though I hesitate to mention it, administered the largest undergraduate program on my campus of 17K students. That not much time overall has been spent, as another overworked administrator once put it, “is the salient fucking point.”) It’s a harder problem than I initially realized. Ted Underwood graciously pointed me to some code he had written to assist with this fearsome procedure. Something there is (in me) that doesn’t love java, however, and I have been reluctant to delve into its intricacies.

A brute-force solution appealed to my better instincts. Why not compare each text (bag-of-words) to each other? Cosine distance should give values very close to 1 for reprints. Some tests of known duplicates revealed this to be the case. Comparing each bag-of-words to each other, however, would require ten billion operations. Furthermore, the pre-processing required to convert the bag-of-words into a document-term matrix that could be efficiently compared was significant. My R code for managing this stupendous process was making about twenty comparisons per minute, so I’ll leave it as an exercise to the reader to determine if it would finish before the nanobots summon the Old Ones.

Using only a small number of comparison words (which seems to be the approach that Underwood’s code linked above used) and rewriting the code to be more efficient (or in a faster language) would increase the performance, but I don’t think it would be enough to be feasible for someone with my limited computational resources. If you browse the HathiTrust catalog, you’ll notice that they recommend similar texts. I suspect, though I haven’t investigated this thoroughly, that within their supercomputing clusters and quantum fusion storage centers, they have a correlation matrix of all the texts. I could ask for some additional metadata. That’s one potential solution.

Another would be only comparing texts that are likely to be duplicates. Many of the duplicates can be identified by author and title similarity, and a matrix of (I’m guessing) 5000x5000 would be tractable. All of the editions of Cervantes wouldn’t be identified this way, however, and that’s just one prominent example.

The One Hundred Topics

Thwarted for the time being by my failed efforts to eliminate duplicates in the 1900-1922 slice of the fiction corpus, I thought that sampling the entire corpus might be a welcome diversion. I downloaded all of the fiction word lists, read the fiction_metadata.csv file into R, and used this simple command (from dplyr) to sample five percent of the total:

meta <- meta %>%> sample_n(as.integer(nrow(meta)*.05))

(Using a rounding function rather than as.integer is probably a good idea. You could also keep adding some magrittr pipes to write that expression in a consistent style.)

I then followed the procedure outlined here to create this browser of one hundred topics. One hundred topics of a presumably representative sample of much of the English-language fiction published between 1700 and 1922? I was dizzy contemplating the possibilities while waiting about an hour for the model to be generated. I knew that the aggressive stop-word list I had cultivated for the 1900-1922 slice was not going to capture many of the names that would appear in earlier fiction. I also could anticipate that some of the OCR errors only seen in earlier texts would cluster in a topic. I did not anticipate, however, the ineluctable force of Jacob Abbott and his ubiquitous Rollo books.

At first I thought this topic was headed by a largely unknown volume of potential Kafkan interest. But it seems to have been a classifier error. The data itself, as I haven’t written about very much, shows page-level classifications. Some texts have only a few pages designated as fiction, which means that distance measurements between them and other texts with few pages will often report false positives. That’s another wrinkle for the duplication-detection. Dumas (along with “nice name he has”), Dickens, Defoe, and Disraeli all have topics of their own, along with several other prolific writers. It was optimistic of me to think that a random sample would eliminate many of the duplicates, but I think the model is still worth exploring.

This war topic does not show quite the time-series that you would expect, but I attribute that largely to the sampling process. The early texts in this detective topic are also potentially interesting. I don’t know how well known this autobiography of a London detective is, for example, but I wouldn’t have found it any other way. As a model of the dominant themes in fiction over two centuries, this model is at best incomplete—at worst, a complete failure. As a serendipity-enhancer, however, I think it has some value.

After I get the de-duplication procedure sorted out, I will write about a twentieth-century model of fiction (and possibly the poetry and drama as well).