The Stronghold of Bioinformatics

No one likes gamification or MOOCs, as far as I can tell. What I should say is that anyone trained in the hermeneutics of suspicion might even find it hard to accept their existence. It’s hard to come up with a hypothetical concept that would cry more piteously to the heavens for critique, for example. True to form, until a few weeks ago I had never earned a badge in my life and would have regarded the prospect of doing so with contempt and a touch of pity for whoever was naive enough to suggest it.

Topics in Theory

After experimenting with topic models of Critical Inquiry, I thought it would be interesting to collect several of the theoretical journals that JSTOR has in their collection and run the model on a bigger collection with more topics to see how the algorithm would chart developments in theory.

I downloaded all of the articles (word-frequency data for each article, that is) in New Literary History, Critical Inquiry, boundary 2, Diacritics, Cultural Critique, and Social Text. I then ran a model fitted to one-hundred topics. I had to adjust the stop-word list to account for common words and, unsuccessfully, for words in other languages. What I should have done was use the supplied stop-word lists in those languages as well. At least this way there is a chance that interesting words in those languages will cluster together.

Same Stuff, Different Graph

When I started experimenting with graphing changes in topic-proportions over time, I didn’t pay much attention to the design of the graph. I could see that it was far too busy, but I assumed that this would be relatively easy to adjust using ggplot2’s many parameters.

It wasn’t. It didn’t take me too long to figure out that I needed to change the data from discrete to continuous in order to see anything like a sparkline, but it was also apparent from the other data sets I was working with that taking the mean at intervals was the only way to make a reasonably clean graph. I ended up using the aggregate function to create the n-year averages, though I read some intriguing descriptions of the power of data.tables in R. (I refuse to ask for help on stackoverflow, even though it would have saved many hours worth of work. Character flaw.)

Visualizing Topics in ELH

I was impressed with Ian Milligan’s visualizations of Canadian parliamentary debates, and I wanted to try to visualize some of the topic models I’ve been creating from JSTOR’s Data for Research.

ELH I thought would be an interesting journal to try, as it publishes articles in each issue on quite a range of literary periods, often ranging from medieval to twentieth-century material. I assumed that LDA would be likely to identify each of these periods as a topic. To test this, I downloaded the entire set of articles from JSTOR and created a fifty-topic model. From there, I wanted to chart the proportion of each topic in each document. I was able to import the data in R and use ggplot2 to create the following graph:

Experimenting with Dynamic Topic Models

When I first began reading about topic modeling, I very much wanted to experiment with “dynamic” topic modeling, or the tracking of changes in topics over time. David Blei and John Lafferty describe their algorithm in this paper. They also have made a dynamic topic model browser of Science available. I was very impressed with this project and wanted to apply the technique to journals in the humanities using JSTOR’s Data for Research (DfR).

Creating Topic Models with JSTOR's Data for Research (DfR)

Here are some instructions for creating the same types of topic models of JSTOR’s journals that I did with Critical Inquiry and Signs.

These instructions are designed for someone using a Mac or Linux platform. (The differences below between using Linux and a Mac should be apparent to anyone who uses Linux, so I’m not going to indicate them here; it’s mainly where files are stored.) All of this should work on Windows, but you’ll need to install Cygwin or use alternate shell commands. MALLET has slightly different installation instructions for the Windows platform as well, I believe.

Two Critical Inquiry Topic Models

Here are two topic models of Critical Inquiry, generated with the same algorithm but different implementations (MALLET and R topicmodels package, slightly different stopword lists, the latter also was generated with a minimum word frequency of seven):

0 black music musical white african jazz sound performance american racial negro song cultural sounds race rap cage singer composer 1 meaning theory interpretation question philosophy language point claim philosophical sense truth fact argument knowledge intention metaphor text account speech 2 american duke james john trans culture life william things modern cambridge michael david robert soviet shame henry york objects 3 trans time question subject derrida language place order object relation word thing reading moment longer things work thought writing 4 history historical narrative discourse account contemporary terms status context social ways relation discussion essay sense form representation specific position 5 god christian history religious greek ancient modern tradition divine century body early philosophy latin nature religion medieval church soul 6 film cinema films camera screen images frame image movie theater shot early visual cinematic narrative kiss hollywood scene documentary 7 science scientific human knowledge media theory sciences natural studies life social history technology communication machine humanities disciplines system psychology 8 body time game process space affect play form motion hand ways level attention bodies turn making figure physical parts 9 political social politics cultural power culture theory society ideology critique intellectual ideological state economic class liberal struggle revolution marx 10 law legal public case justice trial political war court state violence rights states moral crime speech abuse slave united 11 art painting work visual image picture artist images fig paintings works artists artistic photography aesthetic museum photograph objects object 12 form work time terms art nature individual structure order reality analysis general style works experience concept process elements theory 13 literary literature criticism text reading book work critics texts writing reader fiction language english author readers works read critic 14 life human moral good man sense great experience work fact kind find personal idea mind character people view social 15 time years people day great long house young called read book wrote man word times year men left english 16 story love death man dead face life eyes point scene narrative moment long real james stories heart narrator characters 17 italian di del fig della il fascist spanish inca ii italy autumn giovanni st saint che text building verdi 18 german history benjamin trans historical freud von germany art modern das memory panofsky essay berlin early hegel walter war 19 public war time national city american education work social economic space people urban culture corporate building united market business 20 poetry poem poet language poems poetic poets english lines literary lyric word romantic text verse poetics prose pound milton 21 women sexual female woman male feminist desire sex men sexuality mother gender freud identity body psychoanalytic child psychoanalysis feminine 22 jewish jews israel israeli palestinian state arab jew religious land people political religion identity palestinians muslim islamic rabbi al 23 french en france title sur qui dans paris une paul text ne foucault letter est man jean derrida au 24 cultural european culture colonial western american national chinese african indian native english identity white british racial race south africa

Topic Modeling Signs

Natalia Cecire tweeted during the topic-modeling workshop that she was momentarily excited by thinking that a presentation on the journal Science was on Signs: Journal of Women in Culture and Society. As it turns out, I have been experimenting with creating topic models from JSTOR’s Data for Research, and I decided to see what the Signs corpus would come up with.

I downloaded word-frequency data for all the issues of the journal. I then used a script to convert the CSV files into *.txt files with the word frequencies duplicated (basically the same approach described by Andrew Goldstone here UPDATE: Andrew wrote some more extensive code for this task.) I then used text2ldac, a Python script, to convert the text files into a sparse-matrix readable by the LDA algorithms.

Errol Morris’s A Wilderness of Error The Trials of Jeffrey MacDonald

I am from North Carolina. I’m quite familiar with the eastern part of the state, having lived there off and on for almost a quarter-century. Nothing surprised me more in this unusual book than learning there was apparently a thriving “hippie” scene in Fayetteville in 1970. It seems unimaginable from what I experienced, but the returning military from SE Asia, heroin, etc. dynamic was quite different from anything I remember. Anyway, while I was familiar with the broad outlines of the Jeffrey MacDonald case, I have never read any of the books about it (or seen the mini-series or any of the other documentaries). It’s an intrinsically fascinating story, and Errol Morris is in many ways an ideal author to explore them. Morris has been a philosophy of science student and private investigator in addition to the documentary filmmaker responsible for freeing an innocent man from a Texas prison, among other provocations. In particular, Morris is fascinated with epistemology and what he describes as “postmodernist” attacks upon it. He has written amusingly about his encounters with Thomas Kuhn in this regard, and his interest in this case was certainly furthered by Janet Malcolm’s The Journalist and the Murderer, a book about the Joe McGinniss/Jeffrey MacDonald relationship that Morris thinks argues that the truth of the case is either essentially or practically unknowable. Morris rejects such an attitude with the entirety of his being, it seems. I find his objections either to be overstated or grounded in philosophical presuppositions that I don’t share, but it’s a witty and bracing attitude all the same.

LibraryThing Ownership Relative to Total Sales

The Guardian recently posted some sales data of the Booker Prize winners. I thought it would be interesting to compare those figures with LibraryThing ownership to see how reliable that latter figure might be in determining a book’s total sales. The median was 2.77%, mean 3.88%.

The table is below, not very well-formatted I’m afraid.

1969 PH Newby Something To Answer For Faber & Faber 421 64 15.20% 1970 Bernice Rubens The Elected Member Eyre & Spottiswoode 3,901 133 3.41% 1971 VS Naipaul In a Free State Deutsch 13,533 532 3.93% 1972 John Berger G Weidenfeld & Nicolson 3,863 434 11.23% 1973 JG Farrell The Siege of Krishnapur Weidenfeld & Nicolson 50,246 1097 2.18% 1974 Stanley Middleton Holiday Hutchinson 1,463 55 3.76% 1974 Nadine Gordimer The Conservationist Jonathan Cape 11,282 387 3.43% 1975 Ruth Prawer Jhabvala Heat and Dust John Murray 12,729 684 5.37% 1976 David Storey Saville Jonathan Cape 4,224 112 2.65% 1977 Paul Scott Staying On William Heinemann 19,105 452 2.37% 1978 Iris Murdoch The Sea, The Sea Chatto & Windus 94,986 1918 2.02% 1979 Penelope Fitzgerald Offshore Collins 15,638 550 3.52% 1980 William Golding Rites of Passage Faber & Faber 10,888 600 5.51% 1981 Salman Rushdie Midnight's Children Jonathan Cape 201,629 8251 4.09% 1982 Thomas Keneally Schindler's Ark Hodder & Stoughton 43,498 3898 8.96% 1983 JM Coetzee Life & Times of Michael K Secker & Warburg 30,838 1662 5.39% 1984 Anita Brookner Hotel du Lac Jonathan Cape 21,766 1492 6.85% 1985 Keri Hulme The Bone People Hodder & Stoughton 27,311 2249 8.23% 1986 Kingsley Amis The Old Devils Hutchinson 13,875 711 5.12% 1987 Penelope Lively Moon Tiger Deutsch 25,287 1070 4.23% 1988 Peter Carey Oscar and Lucinda Faber & Faber 65,858 2596 3.94% 1989 Kazuo Ishiguro The Remains of the Day Faber & Faber 178,753 8078 4.52% 1990 AS Byatt Possession Chatto & Windus 92,766 8549 9.22% 1991 Ben Okri The Famished Road Jonathan Cape 47,996 1295 2.70% 1992 Barry Unsworth Sacred Hunger Hamish Hamilton 14,978 892 5.96% 1992 Michael Ondaatje The English Patient Bloomsbury 94,391 7085 7.51% 1993 Roddy Doyle Paddy Clarke Ha Ha Ha Secker & Warburg 95,090 2707 2.85% 1994 James Kelman How Late It Was, How Late Secker & Warburg 12,506 813 6.50% 1995 Pat Barker The Ghost Road Viking 92,080 1718 1.87% 1996 Graham Swift Last Orders Picador 66,643 1784 2.68% 1997 Arundhati Roy The God of Small Things Flamingo 596,847 11905 1.99% 1998 Ian McEwan Amsterdam Jonathan Cape 306,579 4866 1.59% 1999 JM Coetzee Disgrace Secker & Warburg 257,218 6724 2.61% 2000 Margaret Atwood The Blind Assassin Bloomsbury 508,945 10311 2.03% 2001 Peter Carey True History of the Kelly Gang Faber & Faber 260,971 2632 1.01% 2002 Yann Martel Life of Pi Canongate 1,318,508 27429 2.08% 2003 DBC Pierre Vernon God Little Faber & Faber 364,949 3562 0.98% 2004 Alan Hollinghurst The Line of Beauty Picador 242,146 2981 1.23% 2005 John Banville The Sea Picador 199,275 3320 1.67% 2006 Kiran Desai The Inheritance of Loss Hamish Hamilton 184,441 4680 2.54% 2007 Anne Enright The Gathering Jonathan Cape 225,425 2531 1.12% 2008 Aravind Adiga The White Tiger Atlantic 556,764 5534 0.99% 2009 Hilary Mantel Wolf Hall Fourth Estate 630,869 4810 0.76% 1970 "Lost Booker" JG Farrell Troubles Phoenix 43,430 610 1.40% 2010 Howard Jacobson The Finkler Question Bloomsbury 285,531 1329 0.47% 2011 Julian Barnes The Sense of an Ending Jonathan Cape 285,421 2341 0.82%