Darko Suvin's Genres of Victorian SF Revisited

Mon Oct 17, 2016

Engaging Preliminaries

When I wrote my last post about modeling Darko Suvin’s genres of Victorian science fiction, I did not have access to Suvin’s comprehensive bibliography. What can I say? The Louisiana State Library loaned me theirs, but it took a few days. I was forced to model the texts that Suvin claimed were not science fiction. While I could guess what many of the books that Suvin would admit to the Victorian SF canon were, I preferred to wait until I could see them in cold print before gathering them.

Doubtlessly many of these texts are to be found in the fiction part of the Word Frequencies in English-Language Literature, 1700-1922 data, but I chose a different way. I decided to get the texts from Hathi directly, you see. It was my first time using the research portal. While I contemplated trying to list all of the titles and using the batch-search set-building functionality, I added the titles one-by-one. The original Suvin workset was public, but I accidentally deleted the first 72 files. This was a bit easier to do than would have been ideal, though I have to accept all blame. The remaining data set can be found in public list, creatively titled “Suvin.”

What happens once you build a workset? There are several choices. Algorithms that run clustering and topic-models on the workset are available, and you can also download the page-level features and MARC records for metadata. Both of these are required to convert the files into word-frequency lists and metadata usable with Ted Underwood’s genre-modeling code. To download the page-level features, you have to ask the system to build you a shell script that uses rsync to transfer the files to your machine. The files currently come as bzip2-compressed json, buried very deeply in often complex directory structures. I recommend using a command1 similar to this to extract them (assuming you’ve created a directory called “json” to house the results):

find . -name "*.json" -print -exec cp {} json/ \;

Then, gunzip *.bz2. Now we have a directory full of json files, but we need word-frequency lists. Luckily, there is code provided to generate these for you. This python script is designed to work on a single file, but you can easily wrap it in a shell script or use the find solution above to run it on all the json files. I also recommend changing the “results.txt” to a variable based on the filename and appending “.fic.tsv” to it for consistency. Now we copy those files over to the “newdata” directory in Underwood’s repository and remember to delete the lexicon file in “lexicon.”

Metadata, however, remains. Metadata is (are?) always enjoyable, especially when it comes in XML MARC records. Nothing parses like XML! I use XML::Simple for my XML-parsing needs. I followed the logic of the obscure MARC records in this helpful script. A more typical user would have doubtlessly just modified this script, but XML induces a type of crisis in programming-confidence in me, and I had to regress to perl. Properly parsing the dates, as you can see from the aforelinked code, is difficult. You may, depending on the extravagances of your data, need to correct some of the date fields manually.

Files from hathi sometimes include illegal or at least questionable characters, either in the filename themselves (though maybe they aren’t by definition—who knows) or, more likely, in the document identifiers. You’ll notice that Underwood’s scripts converts those to legal filenames, so make sure to do that on your own. The filenames with “$” may not comply with the bash globbing mechanism that the find solution above uses. Also, the rsync commands generated by Hathi have consistently missed the first file in the workset. So there will be issues. Given that only about half2 of Suvin’s bibliography appears to be present in Hathi, I wasn’t worried about completeness and didn’t mind dropping a handful of files here and there.

Exciting Results

Once you’ve added the metadata entries to the “finalmeta.csv” file in the “meta” directory, you can add then model them against a random selection to see how well Suvin’s genre-designation coheres. I predicted in the last post that Suvin’s positive identifications would cohere better than his negative ones. It just seems intuitive that this would be the case. And it does. The accuracy and F1 scores are around 90% in all of the models I’ve run. (They vary a point or so on each run because of the different qualities of the random texts chosen. Some in the random sample are much more SF-like than others.) To run a model, choose option #12 from “replicate.py” and then choose whatever genre tag you used in your metadata. I chose “realsuvin,” as I had earlier used “suvin” for the negative examples.

Here is an interactive chart of the Suvin-designated SF (101 volumes) modeled against a random selection:

I never would have guessed that Arthur Bailey Scott’s The Tale of the Grumpy Weasel would score so low on the SF-index. Ah, well. Pater and Lafargue remain, in this sample as they were in the last, texts with perhaps surprising SF-ness. The strongest signals from Suvin’s list were:

author title logistic
Blair, Andrew. Annals of the twenty-ninth cen 0.90
Cobbe, Frances Power, The age of science. 0.90
Hermes, Another world; 0.90
Fiske, Amos Kidder, Beyond the bourn; 0.89
Harting, P. Anno Domini 2071. 0.88
Dodd, Anna Bowman, The republic of the future, or 0.87
Hertzka, Theodor, Freeland : 0.87
Hinton, Charles Howard. Scientific romances. 0.87
Greg, Percy, Across the zodiac: 0.86
Dudgeon, R. E. Colymbia. 0.83
Verne, Jules, From the earth to the moon, 0.83

And the ten weakest:

author title logistic
Bailey, Arthur Scott, The tale of Grumpy Weasel 0.08
Abbott, Eleanor Hallowell The white linen nurse 0.09
Raine, William MacLeod, Oh: you Tex! 0.13
Martin, Helen Reimensnyder, The schoolmaster of Hessville 0.14
Stringer, Arthur, The prairie omnibus 0.15
Alcott, Louisa May, Kitty’s class day ; 0.16
Kerr, Doris Boake, Painted clay 0.18
Winterburn, Florence Hull Liberty hall 0.18
Isham, Frederic Stewart, Under the rose 0.20
Hunt, Edward Eyre, Tales from a famished lan 0.21

Words like “universal,” “vast,” “destroyed,” and, yes, “science” were strongly correlated with SF in this model. The most notable semantic cluster with negative correlation were words associated with domestic description: “diningroom,” for example, but also in this model—“Europe,” “English,” and, for some reason, “azure.”

What this result shows is that yes, Suvin’s categorization of Victorian SF coheres even when a relatively simple classification algorithm tests it using word frequencies. What would be a better test of Suvin’s critical acumen, perhaps, is to test his designated works of Victorian SF against those he claims are not in fact science fiction.

The accuracy and F1 of Suvin-designated SF trained against Suvin-excluded writing is not as high as the former trained against a random selection, but it’s still pretty high: about 81%. Here is a chart similar to the one above with the two genres represented as a scatterplot:3

A quick glance at the scatterplot shows that there aren’t many false positives. The false negatives, books that Suvin did indicate were science fiction, but which the model did not agree, are somewhat interesting:

author title logistic
Bellamy, Edward, Dr. Heidenhoff’s process. 0.32
Miller, Joaquin, The destruction of Gotham. 0.34
Tincker, Mary Agnes, San Salvador / 0.35
O’Brien, Fitz James, The diamond lens, 0.36
Payn, James, The eavesdropper; 0.37
Pemberton, Max. The iron pirate; 0.38
Morris, William, A dream of John Ball, and A ki 0.39
Cobban, J. Maclaren Julius Courtney, or, Master of 0.41
Hyne, C. J. Cutcliffe, The new Eden, 0.41
Fox, S. M. Our own Pompeii : 0.42
Nisbet, Hume, Valdmer the Viking : 0.42
Dixie, Florence, Gloriana, or, The revolution o 0.42
Collins, Mortimer, Transmigration. 0.43
Doyle, Arthur Conan, The doings of Raffles Haw. 0.44
Hudson, W. H. A crystal age. 0.44
Twain, Mark, A Connecticut Yankee in King A 0.44
Besant, Walter, The ivory gate. 0.44

Why is Bellamy’s Dr. Heidenhoff’s Process so much more like the non-SF than the rest? Here’s Suvin’s capsule summary: “Invention prevents erasure of painful memories. All was a dream.” As I mentioned before, his criteria are not always clear to me. As for The Destruction of Gotham: “Dickensian tale with seduced girl and noble reporter4 who befriends her, issues into alternate history; in the last chapter mob burns down New York City. Well written though sentimental. Marginal.” So, the alternative history is enough, but it’s mostly written in a recognizable tradition of mild satire and light fantasy, two qualities shared by many of the disqualified texts.

Abrupt Conclusion

I remain highly intrigued by the use of machine classifiers to explore genre. Discovery compels the most interest for me at the moment, though the evolution of genre is also fascinating. I could spend a year just reading some of the obscure texts that Suvin noted, and it’s quite tantalizing that I can use a model trained on those to find others like them in some reasonably accurate way. The time signal of Victorian prose is clearly quite strong, and I know that explains at least partially the ease with which Suvin’s SF coheres against the random selection of (some) newer texts. What I may try next is to model the Victorian SF against the contemporary SF from the Chicago collection in Underwood’s repository. Some of this (Tek War) is clearly “sci-fi” and not “SF”; but can that valuable distinction be made by a machine? We’ll see.

  1. One annoyance here is that the find command will start trying to copy files in the “json” directory to themselves. I’m sure there’s an option to cp to fix this, or you could just use a directory above where the *.bz2 files are stored. It will still work, though.
  2. There was a predictable pattern. Works that Suvin classified as SF that projected a near-future in which a Reform Bill passed, or women got to vote, or the Irish achieved independence, were almost all limited-print editions that were never purchased (or discarded) by U. S. libraries. Why Suvin classifies these marginal satires as SF is another, more interesting, question.
  3. Note that texts with the same logistic score are not overplotted, so this graph does not contain everything. I know there must be a geom_jitter equivalent in d3, but I didn’t implement it in this graph.
  4. The “noble reporter” bit reminds me that I can never hear this word without imagining Gus from The Wire saying it quizzically to himself. Did I imagine that David Simon said in an interview that he didn’t care for Dickens because he was too rich?