Topic Modeling Signs
Natalia Cecire tweeted during the topic-modeling workshop that she was momentarily excited by thinking that a presentation on the journal Science was on Signs: Journal of Women in Culture and Society. As it turns out, I have been experimenting with creating topic models from JSTOR’s Data for Research, and I decided to see what the Signs corpus would come up with.
I downloaded word-frequency data for all the issues of the journal. I
then used a script to convert the CSV files into *.txt files with the
word frequencies duplicated (basically the same approach described by
Andrew Goldstone
here
UPDATE: Andrew wrote some more extensive
code for
this task.) I then used
text2ldac, a Python script, to
convert the text files into a sparse-matrix readable by the LDA
algorithms.
To generate the topic models, I am currently using the R package topicmodels. (There are additional details about how to import the ldac files created by text2ldac into R, but I’ll skip those for the time being.) I ran a Gibbs-sampling 25 topic model on the Signs data, with a minimum frequency of words set at seven (text2ldac handles this as an option). I also used a stop-list of common English words, plus the name of the journal and a few French articles and prepositions.
This is what resulted:
[1] “art” “pocahontas” “english” “woman” “women” “female” “sappho”
[2] “role” “human” “social” “models” “behavior” “women” “primate”
[3] “women” “feminist” “gender” “social” “men” “university” “work”
[4] “women” “marriage” “men” “hispanic” “honor” “village” “university
[5] “women” “state” “rights” “law” “social” “public” “political”
[6] “women” “rape” “percent” “white” “bermuda” “sex” “men”
[7] “nuclear” “language” “feminist” “defense” “war” “weapons” “text”
[8] “feminist” “social” “language” “women” “relations” “gender” “theory”
[9] “women” “labor” “work” “family” “children” “economic” “men”
[10] “women” “psychology” “feminist” “female” “mother” “woolf” “male”
[11] “women” “jane” “michel” “life” “woman” “early” “louise”
[12] “moral” “women” “muslim” “care” “theory” “social” “morality”
[13] “women” “science” “studies” “education” “social” “students” “feminist”
[14] “women” “library” “men” “beauvoir” “language” “ddc” “classification”
[15] “medical” “martineau” “women” “harriet” “alice” “cases” “mother”
[16] “women” “club” “literacy” “printed” “press” “university” “print”
[17] “women” “body” “religious” “female” “st” “dance” “christian”
[18] “freud” “charcot” “field” “lucy” “women” “bronte” “social”
[19] “feminist” “incest” “women” “distance” “theory” “aerial” “sexual”
[20] “women” “role” “female” “woman” “control” “german” “social”
[21] “women” “feminist” “mizrahi” “political” “movement” “israeli” “social”
[22] “black” “women” “white” “race” “community” “work” “men”
[23] “sexual” “lesbian” “sexuality” “women” “sex” “love” “female”
[24] “lesbian” “work” “lesbians” “nature” “gender” “identity” “women”
[25] “women” “chinese” “suffrage” “white” “african” “china” “national”
There’s a lot of experimenting involved with determining a suitable number of topics. Some of these look that might be separated from others, and others look like they belong in different topics. I also see some words to add to the stop-list (“ddc” and “st,” for example). (I couldn’t decide about “women/woman.” Too common, or would it distort the results to leave out?) And stemming would also help.
UPDATE:
I ran the model without “women” or “woman,” since those words were possibly over-represented in the terms above. Here is the result:
[1] “moral” “role” “care” “social” “art” “theory” “models”
[2] “black” “white” “race” “children” “group” “community” “racial”
[3] “men” “african” “kenya” “marriage” “groups” “university” “marry”
[4] “studies” “university” “american” “education” “history” “female” “students”
[5] “science” “social” “human” “sex” “differences” “field” “behavior”
[6] “work” “labor” “family” “social” “men” “children” “economic”
[7] “work” “gender” “martineau” “men” “hispanic” “feminization” “cultural”
[8] “rape” “percent” “bermuda” “white” “sex” “men” “raped”
[9] “religious” “male” “life” “female” “medieval” “early” “university”
[10] “law” “muslim” “legal” “court” “rights” “state” “islamic”
[11] “feminist” “nuclear” “language” “distance” “defense” “theory” “aerial”
[12] “freud” “pocahontas” “english” “charcot” “masque” “smith” “virginia”
[13] “michel” “death” “life” “female” “love” “louise” “political”
[14] “film” “body” “irigaray” “work” “university” “female” “desire”
[15] “chinese” “men” “indian” “china” “public” “library” “asian”
[16] “sexual” “sex” “gender” “men” “sexuality” “female” “social”
[17] “political” “feminist” “gender” “state” “movement” “rights” “social”
[18] “female” “love” “lesbian” “literary” “university” “life” “story”
[19] “nature” “control” “carson” “sea” “human” “natural” “life”
[20] “jane” “bronte” “lucy” “novel” “rochester” “eyre” “marriage”
[21] “club” “literacy” “printed” “press” “university” “print” “white”
[22] “incest” “feminist” “mother” “sexual” “psychology” “mothers” “men”
[23] “feminist” “social” “gender” “theory” “feminism” “political” “relations”
[24] “suffrage” “white” “national” “german” “data” “movement” “nigerian”
[25] “mizrahi” “lesbian” “feminist” “lesbians” “israeli” “identity” “israel”
Still a few oddities, but it looks reasonable.