No one likes gamification or MOOCs, as far as I can tell. What I should
say is that anyone trained in the hermeneutics of suspicion might even
find it hard to accept their existence. It’s hard to come up with a
hypothetical concept that would cry more piteously to the heavens for
critique, for example. True to form, until a few weeks ago I had never
earned a badge in my life and would have regarded the prospect of doing
so with contempt and a touch of pity for whoever was naive enough to
suggest it.
After experimenting with topic models
of Critical Inquiry, I thought it would be interesting to collect
several of the theoretical journals that JSTOR has in their collection
and run the model on a bigger collection with more topics to see how the
algorithm would chart developments in theory.
I downloaded all of the articles (word-frequency data for each article,
that is) in New Literary History, Critical Inquiry, boundary 2,
Diacritics, Cultural Critique, and Social Text. I then ran a model
fitted to one-hundred topics. I had to adjust the stop-word list to
account for common words and, unsuccessfully, for words in other
languages. What I should have done was use the supplied stop-word lists
in those languages as well. At least this way there is a chance that
interesting words in those languages will cluster together.
When I started experimenting with
graphing changes in topic-proportions over time, I didn’t pay much
attention to the design of the graph. I could see that it was far too
busy, but I assumed that this would be relatively easy to adjust using
ggplot2’s many parameters.
It wasn’t. It didn’t take me too long to figure out that I needed to
change the data from discrete to continuous in order to see anything
like a sparkline, but it was also apparent from the other data sets I
was working with that taking the mean at intervals was the only way to
make a reasonably clean graph. I ended up using the aggregate function
to create the n-year averages, though I read some intriguing
descriptions of the power of data.tables in R. (I refuse to ask for help
on stackoverflow, even though it would have saved many hours worth of
work. Character flaw.)
I was impressed with Ian Milligan’s
visualizations
of Canadian parliamentary
debates,
and I wanted to try to visualize some of the topic models I’ve been
creating from JSTOR’s Data for Research.
ELH I thought would be an interesting journal to try, as it publishes
articles in each issue on quite a range of literary periods, often
ranging from medieval to twentieth-century material. I assumed that LDA
would be likely to identify each of these periods as a topic. To test
this, I downloaded the entire set of articles from JSTOR and created a
fifty-topic model. From there, I wanted to chart the proportion of each
topic in each document. I was able to import the data in R and use
ggplot2 to create the following graph:
When I first began reading about topic modeling, I very much wanted to
experiment with “dynamic” topic modeling, or the tracking of changes in
topics over time. David Blei and John Lafferty describe their algorithm
in this
paper.
They also have made a dynamic topic model browser of Scienceavailable. I was very
impressed with this project and wanted to apply the technique to
journals in the humanities using JSTOR’s Data for Research (DfR).
Here are some instructions for creating the same types of topic models
of JSTOR’s journals that I did with Critical
Inquiry and
Signs.
These instructions are designed for someone using a Mac or Linux
platform. (The differences below between using Linux and a Mac should be
apparent to anyone who uses Linux, so I’m not going to indicate them
here; it’s mainly where files are stored.) All of this should work on
Windows, but you’ll need to install Cygwin or
use alternate shell commands. MALLET has slightly different installation
instructions for the Windows platform as well, I believe.
Here are two topic models of Critical Inquiry, generated with the same
algorithm but different implementations (MALLET and R topicmodels
package, slightly different stopword lists, the latter also was
generated with a minimum word frequency of seven):
0 black music musical white african jazz sound performance american
racial negro song cultural sounds race rap cage singer composer 1
meaning theory interpretation question philosophy language point claim
philosophical sense truth fact argument knowledge intention metaphor
text account speech 2 american duke james john trans culture life
william things modern cambridge michael david robert soviet shame henry
york objects 3 trans time question subject derrida language place order
object relation word thing reading moment longer things work thought
writing 4 history historical narrative discourse account contemporary
terms status context social ways relation discussion essay sense form
representation specific position 5 god christian history religious greek
ancient modern tradition divine century body early philosophy latin
nature religion medieval church soul 6 film cinema films camera screen
images frame image movie theater shot early visual cinematic narrative
kiss hollywood scene documentary 7 science scientific human knowledge
media theory sciences natural studies life social history technology
communication machine humanities disciplines system psychology 8 body
time game process space affect play form motion hand ways level
attention bodies turn making figure physical parts 9 political social
politics cultural power culture theory society ideology critique
intellectual ideological state economic class liberal struggle
revolution marx 10 law legal public case justice trial political war
court state violence rights states moral crime speech abuse slave united
11 art painting work visual image picture artist images fig paintings
works artists artistic photography aesthetic museum photograph objects
object 12 form work time terms art nature individual structure order
reality analysis general style works experience concept process elements
theory 13 literary literature criticism text reading book work critics
texts writing reader fiction language english author readers works read
critic 14 life human moral good man sense great experience work fact
kind find personal idea mind character people view social 15 time years
people day great long house young called read book wrote man word times
year men left english 16 story love death man dead face life eyes point
scene narrative moment long real james stories heart narrator characters
17 italian di del fig della il fascist spanish inca ii italy autumn
giovanni st saint che text building verdi 18 german history benjamin
trans historical freud von germany art modern das memory panofsky essay
berlin early hegel walter war 19 public war time national city american
education work social economic space people urban culture corporate
building united market business 20 poetry poem poet language poems
poetic poets english lines literary lyric word romantic text verse
poetics prose pound milton 21 women sexual female woman male feminist
desire sex men sexuality mother gender freud identity body
psychoanalytic child psychoanalysis feminine 22 jewish jews israel
israeli palestinian state arab jew religious land people political
religion identity palestinians muslim islamic rabbi al 23 french en
france title sur qui dans paris une paul text ne foucault letter est man
jean derrida au 24 cultural european culture colonial western american
national chinese african indian native english identity white british
racial race south africa
Natalia Ceciretweeted during
the topic-modeling workshop that
she was momentarily excited by thinking that a presentation on the
journal Science was on Signs: Journal of Women in Culture and
Society. As it turns out, I have been experimenting with creating topic
models from JSTOR’s Data for Research, and I
decided to see what the Signs corpus would come up with.
I downloaded word-frequency data for all the issues of the journal. I
then used a script to convert the CSV files into *.txt files with the
word frequencies duplicated (basically the same approach described by
Andrew Goldstone
here
UPDATE: Andrew wrote some more extensive
code for
this task.) I then used
text2ldac, a Python script, to
convert the text files into a sparse-matrix readable by the LDA
algorithms.
I am from North Carolina. I’m quite familiar with the eastern part of
the state, having lived there off and on for almost a quarter-century.
Nothing surprised me more in this unusual book than learning there was
apparently a thriving “hippie” scene in Fayetteville in 1970. It seems
unimaginable from what I experienced, but the returning military from SE
Asia, heroin, etc. dynamic was quite different from anything I remember.
Anyway, while I was familiar with the broad outlines of the Jeffrey
MacDonald case, I have never read any of the books about it (or seen the
mini-series or any of the other documentaries). It’s an intrinsically
fascinating story, and Errol Morris is in many ways an ideal author to
explore them. Morris has been a philosophy of science student and
private investigator in addition to the documentary filmmaker
responsible for freeing an innocent man from a Texas prison, among other
provocations. In particular, Morris is fascinated with epistemology and
what he describes as “postmodernist” attacks upon it. He has written
amusingly
about his encounters with Thomas Kuhn in this regard, and his interest
in this case was certainly furthered by Janet Malcolm’s The Journalist
and the Murderer, a book about the Joe McGinniss/Jeffrey MacDonald
relationship that Morris thinks argues that the truth of the case is
either essentially or practically unknowable. Morris rejects such an
attitude with the entirety of his being, it seems. I find his objections
either to be overstated or grounded in philosophical presuppositions
that I don’t share, but it’s a witty and bracing attitude all the same.
The Guardian recently posted some sales
data
of the Booker Prize winners. I thought it would be interesting to
compare those figures with LibraryThing ownership to see how reliable
that latter figure might be in determining a book’s total sales. The
median was 2.77%, mean 3.88%.
The table is below, not very well-formatted I’m afraid.
1969 PH Newby Something To Answer For Faber & Faber 421 64 15.20% 1970 Bernice Rubens The Elected Member Eyre & Spottiswoode 3,901 133 3.41% 1971 VS Naipaul In a Free State Deutsch 13,533 532 3.93% 1972 John Berger G Weidenfeld & Nicolson 3,863 434 11.23% 1973 JG Farrell The Siege of Krishnapur Weidenfeld & Nicolson 50,246 1097 2.18% 1974 Stanley Middleton Holiday Hutchinson 1,463 55 3.76% 1974 Nadine Gordimer The Conservationist Jonathan Cape 11,282 387 3.43% 1975 Ruth Prawer Jhabvala Heat and Dust John Murray 12,729 684 5.37% 1976 David Storey Saville Jonathan Cape 4,224 112 2.65% 1977 Paul Scott Staying On William Heinemann 19,105 452 2.37% 1978 Iris Murdoch The Sea, The Sea Chatto & Windus 94,986 1918 2.02% 1979 Penelope Fitzgerald Offshore Collins 15,638 550 3.52% 1980 William Golding Rites of Passage Faber & Faber 10,888 600 5.51% 1981 Salman Rushdie Midnight's Children Jonathan Cape 201,629 8251 4.09% 1982 Thomas Keneally Schindler's Ark Hodder & Stoughton 43,498 3898 8.96% 1983 JM Coetzee Life & Times of Michael K Secker & Warburg 30,838 1662 5.39% 1984 Anita Brookner Hotel du Lac Jonathan Cape 21,766 1492 6.85% 1985 Keri Hulme The Bone People Hodder & Stoughton 27,311 2249 8.23% 1986 Kingsley Amis The Old Devils Hutchinson 13,875 711 5.12% 1987 Penelope Lively Moon Tiger Deutsch 25,287 1070 4.23% 1988 Peter Carey Oscar and Lucinda Faber & Faber 65,858 2596 3.94% 1989 Kazuo Ishiguro The Remains of the Day Faber & Faber 178,753 8078 4.52% 1990 AS Byatt Possession Chatto & Windus 92,766 8549 9.22% 1991 Ben Okri The Famished Road Jonathan Cape 47,996 1295 2.70% 1992 Barry Unsworth Sacred Hunger Hamish Hamilton 14,978 892 5.96% 1992 Michael Ondaatje The English Patient Bloomsbury 94,391 7085 7.51% 1993 Roddy Doyle Paddy Clarke Ha Ha Ha Secker & Warburg 95,090 2707 2.85% 1994 James Kelman How Late It Was, How Late Secker & Warburg 12,506 813 6.50% 1995 Pat Barker The Ghost Road Viking 92,080 1718 1.87% 1996 Graham Swift Last Orders Picador 66,643 1784 2.68% 1997 Arundhati Roy The God of Small Things Flamingo 596,847 11905 1.99% 1998 Ian McEwan Amsterdam Jonathan Cape 306,579 4866 1.59% 1999 JM Coetzee Disgrace Secker & Warburg 257,218 6724 2.61% 2000 Margaret Atwood The Blind Assassin Bloomsbury 508,945 10311 2.03% 2001 Peter Carey True History of the Kelly Gang Faber & Faber 260,971 2632 1.01% 2002 Yann Martel Life of Pi Canongate 1,318,508 27429 2.08% 2003 DBC Pierre Vernon God Little Faber & Faber 364,949 3562 0.98% 2004 Alan Hollinghurst The Line of Beauty Picador 242,146 2981 1.23% 2005 John Banville The Sea Picador 199,275 3320 1.67% 2006 Kiran Desai The Inheritance of Loss Hamish Hamilton 184,441 4680 2.54% 2007 Anne Enright The Gathering Jonathan Cape 225,425 2531 1.12% 2008 Aravind Adiga The White Tiger Atlantic 556,764 5534 0.99% 2009 Hilary Mantel Wolf Hall Fourth Estate 630,869 4810 0.76% 1970 "Lost Booker" JG Farrell Troubles Phoenix 43,430 610 1.40% 2010 Howard Jacobson The Finkler Question Bloomsbury 285,531 1329 0.47% 2011 Julian Barnes The Sense of an Ending Jonathan Cape 285,421 2341 0.82%