Modeling Darko Suvin's Genres of Science Fiction
You can’t go far in reading about science fiction’s genres without encountering the work of Darko Suvin. His Metamorphoses of Science Fiction is among the most widely cited and influential works in the field. Suvin published a reference work devoted to Victorian SF after that: Victorian Science Fiction in the UK. Two related articles appeared in Science Fiction Studies in 1979 and 1980: “On What Is and Is Not an SF Narration; With a List of 101 Victorian Books That Should Be Excluded from SF Bibliographies” and “Seventy-Four More Victorian Books That Should Be Excluded from Science Fiction Bibliographies”. It is rare, in my experience, for critics to be so categorical in discussing genre. I thought Suvin’s exclusions would make for a useful test case to begin my exploration of some recent attempts to model literary genre quantitatively.
Ted Underwood’s “The Life Cycles of Genres” is a recent example of this work that both contains an analysis of science fiction as a genre and has publicly available code and data that I can use for my own purposes. I want to note how useful releasing the code and data is. I know that it’s not always easy to do. I personally don’t like to release code that I feel is subpar, and almost everyone who writes code non-professionally feels this way to some extent. (And many professionals don’t have to release code, of course.) I don’t mean to suggest that Underwood’s code is subpar, of course. On the contrary, it is admirably documented. I don’t use python very much, and I was able to pick up on its often-complex logic without too much difficulty.
Underwood’s article advances many arguments about the development of literary genre. This brief post concerns only a very small point: do Suvin’s excluded genres themselves constitute a recognizable genre, if we use Underwood’s methods? I dislike suspense: the answer is yes. I will now discuss how I found the data and adapted Underwood’s code to make this test.
Data Gathering and Cleaning
I printed out Suvin’s two bibliographies of works thought by some other bibliographers to be science fiction, but which are not in fact. His capsule summaries are often quite entertaining: witness his description of Marie Corelli’s The Sorrows of Satan: Or, the Strange Experiences of One Geoffrey Tempest, Millionaire: A Romance—“The narrator, Tempest, meets a sorrowful Satan.” I used the “Word Frequencies in English-Language Literature, 1700-1922” data set, the fiction metadata in particular. I searched for each of Suvin’s titles in that list, found 721 of them, and transferred the tsv files (rename them to *.fic.csv from *.csv first) to the “newdata” folder in Underwood’s repository.2
The next step involves transforming the individual metadata entries for each of the volumes into the format found in finalmeta.csv. I didn’t do anything fancier than manually find and add the entries to a new csv file. I then loaded it into R (I strongly recommend using the read_csv function from readr). These simple commands will perform the request operation on a data frame named “suvin” (requires dplyr):3
suvin <- suvin %>% select(htid,recordid,oclc,locnum,
   author,imprint,date,enumcron,subjects,title)
suvin <- suvin %>% mutate(gender=NA)
suvin <- suvin %>% mutate(nationality=NA)
suvin <- suvin %>% mutate(birthdate=NA)
suvin <- suvin %>% mutate(genretags="suvin")
suvin <- suvin %>% mutate(firstpub=date)
suvin <- suvin %>% select(htid,recordid,oclc,locnum,
   author,imprint, date,birthdate,firstpub,enumcron,
   subjects,title,nationality,gender,genretags)
rename(suvin,docid=htid)Now write the file with write_csv and append it to “finalmeta.csv”. Or load the latter file and append it with R. There are so many pathways to enlightenment. You may note with alarm that I have cavalierly disregarded several important categories and erased the distinction between date of first publication and the date of publication listed in the Hathi metadata. I am aware that these omissions are not ideal, but the metadata categories that I’ve “NA” ’d are not used in the modeling process, though they can be used to filter the models. It wouldn’t be too much extra work to fill those in, and I just realized that the context of Suvin’s bibliographic work means that many of them have nationality of “uk.”
Minor Code Tweaking
Once we have prepared the metadata, we need to modify replicate.py so that we can model the newly added Suvin material. Modifying this particular file is not necessary, but I like to just add an additional option to the menu. Doing so is as simple as adding:
print('  16) Model Suvin.')And
elif userchoice == 16:
        tagstomodel = ['suvin']
        modelname = 'SUVIN'
        allvolumes = model_taglist(tagstomodel, modelname)
        print('Results are in allvolumes.')We’re almost ready to run this code. You’ll need python3 installed on your system, and the numpy, pandas, scikit-learn, and matplotlib modules. Installation methods vary for those, and I’m certainly not the person to ask.4 You’ll also need to delete this lexicon file. The code will regenerate it and include the newly added files.
The Classification Method
Before running the code, you may be wondering: just what is this going to do? How can a computer detect the coherence of a such a complex cultural concept as a literary genre from word-frequency lists? Underwood’s article does a good job of addressing these questions. I’m also mindful of the idea that going into too much detail about methods is less important than sharing and interpreting the results.5 In short, it is a classification method. Andrew Goldstone provides a rational and detailed explanation of the classifier and its assumptions in his replication project of Underwood and Jordan Sellers’s earlier work on predicting reviewed works of poetry. The philosophical issues that arise when thinking about what might fairly be called the unreasonable effectiveness of these methods are deep and fascinating. I do not know what I think about the statistical analysis of culture or language. I have sympathy for the frequently caricatured argument by Noam Chomsky that corpus linguistics (which is not the same thing exactly as applying a classifier to word-frequency lists) can detect many exquisite patterns that are meaningless without the context of explanatory theory.6 But this brief note is not the place to elaborate on these questions.
Suvin’s Exiles
Before running replicate.py the first time, make sure that you create a “results” directory in the parent directory. As the program begins, it will select 72 random volumes from the collection to train against. As there is a degree of randomness in the process, the results will not be the same from trial to trial. Through my various experiments with the same 72 Suvin-exiled volumes, I’ve seen a range from .77 to .80 in both F1 and accuracy. This range is generally not as high as Underwood reports for the various genres surveyed. I find that consistent with my intuition, as a negative claim about a group of books’ generic status should not necessarily mean that they share anything other than being misclassified by an earlier bibliographer (in Suvin’s opinion).
There are also false positives (random works whose probability is above .5) and false negatives (Suvin’s exiles below .5). In my last run of the model, there were fifteen of the former and sixteen of the latter. I was intrigued by these books. What, in the random sampling of volumes from the HathiTrust, would be classified as the most similar to Suvin’s heterogeneous list of Victorian un-SF? In order to find this out, read the output from the “results” folder into R. It will be titled “SUVIN2016-09-30.csv,” with your date substituted. To see the false positives, assuming your data frame is called “results,” try:
results %>% filter(realclass==0 & logistic > .5) %>%
   select(author,title,logistic) %>% 
   arrange(-logistic) %>% print(n=5)| author | title | logistic | 
|---|---|---|
| Pater, Walter, | Marius the Epicurean | 0.6411802 | 
| Bierce, Ambrose, | The monk and the hangman' | 0.6019957 | 
| Lafargue, Paul, | The sale of an appetite | 0.5720375 | 
| Cowan, James, | The adventures of Kimble | 0.5657525 | 
| Lummis, Charles Fletcher, | The gold fish of Gran Chi | 0.5512368 | 
Pater and Bierce, I’m pleased to say, I recognize and can see the general consistency between these odd works and Suvin’s outcasts. Paul Lafargue’s The Sale of An Appetite I must admit I do not know, though the author of The Right to be Lazy is clearly a kindred spirit. As for The Adventures of Kimble Bent, I’ll quote an anonymous Wikipedia editor: “it was something of sensation at the time.” I’m also impressed by what I just learned about Charles Fletcher Lummis, though I haven’t (yet!) read The Gold Fish of Gran Chimu.
What about the false negatives? A simple adjustment to the code snippet above reveals:
results %>% filter(realclass==1 & logistic < .5) %>%
   select(author,title,logistic) %>% 
   arrange(-logistic) %>% print(n=5)| author | title | logistic | 
|---|---|---|
| Diehl, Alice (Mangold) | Dr. Paull’s theory; | 0.4988972 | 
| Adderley, James Granville, | Stephen Remarx; | 0.4968556 | 
| Frith, Walter. | The sack of Monte Carlo | 0.4945157 | 
| Brailsford, Henry Noel, | The broom of the war-god | 0.4921997 | 
| Collins, Wilkie, | The moonstone | 0.4918188 | 
What does Suvin say about The Moonstone?
“The only novelties are the mysterious Brahmins and the assumption that sleepwalking under the influence of opium leads to a reenactment of past behavior. I do not know why Bailey7 calls this assumption a ‘relatively scientific theory’; it is not such, and even if it were, that would not make the whole narration SF.”Interested readers may consult the rest of Suvin’s articles for details on the other texts. I admit to being intrigued by The Sack of Monte Carlo for my own personal research reasons. Perhaps I’m not reading everything, but I don’t remember much general writing about the pleasures of discovery, particularly aleatory discovery, when using computational methods. When I built a topic-browser of fiction from HathiTrust, I was often amazed by the stupendous variety of unexpectedly texts among the “great unread.” I love obscurities, novelties, and the justly forgotten. If nothing else, these methods offer unprecedented opportunities for discovery.
Here’s an interactive visualization of the Suvin-exclusions. “True” indicates those that Suvin claimed were not science fiction; “false” are the random selections that the model tested against. Mouseover the dots to see the names of the books:
And here’s a chart of Underwood’s science fiction collection:
It’s immediately notable that the overall scores of Underwood’s larger SF collection are higher. I suspect that if I were able to gather all 175 of Suvin’s exclusions, they would have some higher scores as well. Again, conceptually, it is not surprising that books only considered science fiction by overzealous bibliographers would not have as strong of generic coherence as mostly unambiguous examples.
Future Directions
The logical thing to do would be to model Suvin’s exclusions against an equal number of texts that his bibliography does classify as science fiction. I suspect that many of those texts are already included in the Underwood model, so the data-gathering wouldn’t be too difficult. Another thing to try would be experimenting with different classifiers. I also would like to examine the words most closely associated and negatively associated with each classification, though I am aware that these are easily overinterpreted. I no longer have comments enabled, but please send me an email or contact me on twitter if you have any questions about this project.
Notes
- 
Why only 72, you ask? That’s a good question. I know that there are more than 72 books on Suvin’s list in the HathiTrust collection. I’m reasonably sure, however, that only 72 are in the fiction collection of the “Word Frequencies in English-Language Literature” data set. If I had searched the catalog for the others, I would have needed to collect the remaining volumes separately, which would have delayed the project. I think that 72 is a reasonable number for the process, though more would be ideal. ↩︎ 
- 
I recommend cloning his repository first, if you’re going to replicate or extend this work. My modifications are very slight, so I’m not going to upload my cloned repository. Instead I’ll describe the very minor changes I made here. ↩︎ 
- 
These simple tasks could easily be performed in base R, of course. I understand that some hipsters are now called “basers” for their refusal to use the Hadleyverse packages. I’m regrettably too dependent on them to be influenced by this trend. ↩︎ 
- 
I came very close to borrowing Andrew Goldstone’s code to reimplement the entire procedure in R, just because I’m more comfortable with it. I still may do this, just to see if the results vary in any noticeable way. ↩︎ 
- 
I don’t mean to imply that I’m capable of going into great detail about logistic regression. I’m currently working on a book that’s partially about the history of statistics. It’s complicated stuff! ↩︎ 
- 
Here is one typical example of his argument: “In linguistics we all know that the kind of phenomena that we inquire about are often exotic. They are phenomena that almost never occur. In fact, those are the most interesting phenomena, because they lead you directly to fundamental principles. You could look at data forever, and you’d never figure out the laws, the rules, that are structure dependent. Let alone figure out why. And somehow that’s missed by the Silicon Valley approach of just studying masses of data and hoping something will come out. It doesn’t work in the sciences, and it doesn’t work here." ↩︎
- 
J. O. Bailey, Pilgrims Throughout Space and Time: Trends and Patterns in Scientific and Utopian Fiction. Greenwood Press, 1972 [1947]. ↩︎