Creating a Topic Browser of HathiTrust Data

Tue Sep 22, 2015

The “Word Frequencies in English-Language Literature, 1700-1922” data set from the HathiTrust digital library was released last month. (See Ted Underwood’s post for more detail.) It contains word-frequency lists of texts from the digitized HathiTrust collection published between 1700-1922 that are divided into fiction, poetry, and drama. (A description of the method used to classify the documents can be found here.)

There are many approaches to exploring this data. What I’m going to describe is building a topic browser of a model created with LDA. I’ll be using slightly modified versions of Andrew Goldstone’s dfrtopics and dfr-browser. (See this post for more detail on Goldstone’s powerful, flexible, and well-documented package.) This browser is a wonderfully detailed interactive topic-model visualization: for examples of it being used to visualize models created from JSTOR’s Data for Research, see Signs, seven literary studies journals, and the Journal of Modern Literature.

[Update: 10/10/15. Andrew has written a very useful post detailing some updates to his package that make it easier to work with HathiTrust data. Where his instructions conflict with mine below, I recommend his approach.]

Data Gathering and Preparation

The first step is to download the fiction_metadata.csv. Then we will select the slice of the fiction data from 1920-22. I would put the latter file into a directory named “tsv”. Enter that directory and type:

tar -xfz fiction_1920-1922.tar.gz

(You may also use a GUI archive utility, depending on your preferences and platform.)

There will now be six thousand or so *.tsv files in the directory. In order to process them, we will need to filter the metadata file so that it matches the files we’ve downloaded. For this task and many of the others, I’ll be using R.

library(dplyr)
setwd("~/hathi/")
meta <- read.csv("fiction_metadata.csv", stringsAsFactors=FALSE)
meta <- filter(meta,date >= 1920)

The dpylr package is not strictly necessary to perform these tasks, but my examples will rely on it. If you don’t have it installed, you’ll need to install.packages("dplyr") first. Also, change “~/hathi” above to whatever directory you stored your “fiction_metadata.csv” file in.

Duplicates

We now have a data frame that matches what we have in our “tsv” directory. If you start to examine its contents, you will notice several things. Perhaps the most difficult of these from a modeling perspective is that there a large number of duplicate texts. Some are exact copies, and some are clearly the same book with minor modifications in the associated metadata. Eliminating these programmatically is surprisingly difficult. I have found no consistently reliable method. One approach would be to eliminate exact title duplicates before filtering (this might also help with the reprints problem discussed below):

meta <- distinct(meta,title)

If you enter the above command before filtering by date, your titles will be reduced from 6019 to 3014. After, to 3692. The problem with the former approach, of course, is that different books can have the same title. This problem is less pronounced with a chronologically reduced slice, but it’s still there. Either approach will still keep many duplicates whose titles do not match exactly. Normalizing the titles by case and eliminating punctuation marks would help, I think, but I’m not going to describe that step here. What I recommend is writing the file out and reviewing it in a spreadsheet program. Use that program’s sort features to identify duplicate titles (and also reprints, if that concerns you), and manually delete them. This approach is manageable for smaller slices of the data, but it would not be practical for the entire file.

To write your csv file for processing,

write.csv(meta, file="fiction_metadata_1920-22.csv", row.names=FALSE)

I use OpenOffice’s Calc for reviewing the data. Make sure to export the results as a csv file. The defaults for separators and quotes should be fine. I do not know about Excel’s behavior, but make sure that nothing has been changed in the format. If it is, the following steps will likely fail.

Reprints

There are many reprints in this world. If you manually processed your filtered csv file, you doubtlessly noticed many prominent 18th and 19th C authors in your data. If you took the most aggressive approach to eliminating duplicates described above (selecting distinct titles before filtering by date), these are the most prolific authors left in the 1920-22 data:

meta %>%
group_by(author) %>%
summarise(count=n()) %>%
arrange(desc(count))
author count
France, Anatole 14
Hope, Laura Lee 14
Fitzhugh, Percy Keese 12
Fletcher, J. S. 10
Curwood, James Oliver 9
Grey, Zane 9
James, Henry 9
Roy, Lillian Elizabeth 9
Cobb, Irvin S. 8

(I omit “Anon.,” who leads with 61.)

If we tried the original data set before filtering without removing distinct titles at all, the list would be

meta <- read.csv("fiction_metadata.csv", stringsAsFactors=FALSE)
meta %>%
filter(date >= 1920) %>%
group_by(author) %>%
summarise(count=n()) %>%
arrange(desc(count))
author count
James, Henry 71
Turgenev, Ivan 55
Hardy, Thomas 51
Conrad, Joseph 37
France, Anatole 36
Stevenson, Robert Louis 36
Twain, Mark 30
Hudson, W. H. 28
Tarkington, Booth 26

If we remove distinct titles after filtering for our time-period:

meta %>%
filter (date >= 1920) %>%
distinct(title) %>%
group_by(author) %>%
summarise(count=n()) %>%
arrange(desc(count))
author count
James, Henry 26
France, Anatole 23
Conrad, Joseph 20
Turgenev, Ivan 20
Stevenson, Robert Louis 17
Tarkington, Booth 15
Henry, O. 14
Hope, Laura Lee 14
Curwood, James Oliver 13

Though it can be misleading, a quick glance at these authors supports the contention that eliminating duplicate titles before filtering by date will produce data with the fewest reprints of these simple methods.

Preparing for Topic Modeling

The data set that we will topic model will be the one produced by the following commands:

meta <- read.csv("fiction_metadata.csv", stringsAsFactors=FALSE)
meta <- meta %>%
distinct(title) %>%
filter (date >= 1920) 

It should have 3014 rows. Before we load the package that will perform the topic modeling, we have to save our filtered csv metadata file and create a new directory containing only those .tsv files that correspond to it. To save the file:

write.csv(meta, file="fiction_metadata_1920-22.csv", row.names=FALSE)

I would create a new directory named “20-20-tsv” or similar for the files that match our filtered metadata. To copy them over, you can use a simple perl script like this:

use utf8; #not really necessary
binmode STDOUT, ":utf8"; 

open (META, "fiction_metadata_1900-1922.csv") or die "Can't";
binmode(META, ":utf8");

while (<META>) {
  
  
  @sp=split/,/;
  
  $record = $sp[0];
  $record=~ s/\"//g;
  
  if ($record eq "htid") {
    next;
  }
  
  $fi = $record;
  $fi = $fi."\.tsv";

if ($fi =~ m/\$/) {
    
    $fi =~ s/\$/\\\$/g; # '$' characters in hathi ids cause problems if not escaped
    

}
  
  `cp tsv/$fi 20-22-tsv/$fi`;
  }

Save that code as “copy.pl” in the directory that contains your metadata file and the tsv directories and run perl copy.pl. (I’m aware that things would be simpler if I kept all the code to one language, but I don’t have the patience to do something in R that I can do very quickly in perl. It’s one of my many failings.)

We are almost ready to load the topic-modeling package. It is likely that if you skip the following step, a spurious topic will be generated consisting of words that begin with the unicode ligatures and .

cd 20-22-tsv
perl -CSD -pi -e 's/[\x{FB01}]/fi/g' *.tsv
perl -CSD -pi -e 's/[\x{FB02}]/fl/g' *.tsv

(See this stackoverflow answer (and this legendary one) for more details on this irksome issue. This step requires a version of perl later than 5.08. There are doubtless other ways to handle this with python or even R, but perl is what I resort to in times of crisis, and I’m too old to go changing my ways.)

Now that we have prepared our files, we need to install the R package that will topic model them. I have added a few functions to Andrew Goldstone’s dfrtopics to process the HathiTrust files. To install my fork in R:

library(devtools)
install_github("joncgoodwin/dfrtopics")

Before loading the library, we will need to allocate more memory to java:

options(java.parameters="-Xmx8g")

This would allocate 8 gigabytes of memory. My laptop has 16, and this much memory usually won’t cause any serious problems. Four may be enough, but Java (which is called by the RMallet package that is loaded by the package) will run out of memory quite quickly with larger models.

Now we load the library:

library(dfrtopics)

Next we must process the metadata file and read the tsv files:

meta <- read_hathi_metadata("sample-1900-1922.csv")
counts <- read_wordcounts_hathi(list.files("sample-00-22-tsv", full.names=T))

The next stop involves making a stopword list. Stopwords are those that the model will ignore. MALLET, the tool that this R package uses, comes with a default stopword list. The dfrtopics package also includes an expanded one. Adding stopwords is an interpretive decision. Without the most common English prepositions and articles stopped, the model will be almost useless. What you do from there, however, depends on what you are looking for. The list that I use has many names. The one packaged with dfrtopics has even more names. Names will dominate any topic model made of fiction if they are not blocked. There are imaginable reasons, however, why someone might want to see this behavior.

Once you have decided on a stopword list, apply it to your data with

stoplist <- readLines("stopwords.txt")
counts <- counts %>% wordcounts_remove_stopwords(stoplist)

(This assumes that your stopword list is in the working directory and named “stopwords.txt”.)

As Andrew Goldstone explains here, LDA does not perform well with many words that are used infrequently. (As you can see, I’m following this tutorial almost step-by-step here.) To keep only the most 20K frequently used words:

counts <- counts %>% wordcounts_remove_rare(20000)

The next two commands will transform our data to a form suitable to be sent to MALLET:

docs <- wordcounts_texts(counts)
ilist <- make_instances(docs)

Determining how many topics to model is the next step. With about 3000 files of this size and an aggressive stopword approach, I have found that somewhere between 100-150 topics gives the best results. Typically, you will need to run the model several times to find a useful number of topics. We will choose 100 topics for this example:

m <- train_model(ilist, n_topics=100, n_iters=200,metadata=meta)

There are several parameters that can be passed to the train_model function. The “n_iters” parameter affects the quality of the model and also the length of time it will take to run. With a 100-topic model and 200 iterations on my laptop, this code will finish in about thirty minutes.

As the MALLET process is running, you will see it report on the total number of tokens in your corpus. It will list a series of log likelihood figures (that should descend with each iteration). Every fifty iterations it will print out a list of the topics it has inferred at that point.

Once it has finished, we are ready to make a browser to explore the model. First, we need to export the data we just created:

export_ht_browser_data(m, "data")

The final step of this function creates a “topic_scaled.csv” file. It will currently take significantly longer than the rest of the files. This file is not necessary to see the majority of the browser, and an update to the dfrtopics package has been made to make it much faster. My fork does not yet contain it, however. I will update this section when it does.

I have modified dfr-browser to work with this HathiTrust data. Install my fork with git clone https://github.com/joncgoodwin/dfr-browser.git. Now copy the files from that “data” directory you just created into “dfr-browser/data”. To complete the process, enter these shell commands:

cd dfr-browser
bin/server

Open “localhost:8888” in your web browser. If all went according to plan, you should see your topic browser. A browser of the 1920-1922 fiction data created with the filtering-distinct-titles-before-filtering-by-date method is here.