Topic Modeling Signs

Mon, Nov 12, 2012

Natalia Cecire tweeted during the topic-modeling workshop that she was momentarily excited by thinking that a presentation on the journal Science was on Signs: Journal of Women in Culture and Society. As it turns out, I have been experimenting with creating topic models from JSTOR’s Data for Research, and I decided to see what the Signs corpus would come up with.

I downloaded word-frequency data for all the issues of the journal. I then used a script to convert the CSV files into *.txt files with the word frequencies duplicated (basically the same approach described by Andrew Goldstone ~~here~~ UPDATE: Andrew wrote some more extensive code for this task.) I then used text2ldac, a Python script, to convert the text files into a sparse-matrix readable by the LDA algorithms.

To generate the topic models, I am currently using the R package topicmodels. (There are additional details about how to import the ldac files created by text2ldac into R, but I’ll skip those for the time being.) I ran a Gibbs-sampling 25 topic model on the Signs data, with a minimum frequency of words set at seven (text2ldac handles this as an option). I also used a stop-list of common English words, plus the name of the journal and a few French articles and prepositions.

This is what resulted:

[1] “art” “pocahontas” “english” “woman” “women” “female” “sappho”

[2] “role” “human” “social” “models” “behavior” “women” “primate”

[3] “women” “feminist” “gender” “social” “men” “university” “work”

[4] “women” “marriage” “men” “hispanic” “honor” “village” “university

[5] “women” “state” “rights” “law” “social” “public” “political”

[6] “women” “rape” “percent” “white” “bermuda” “sex” “men”

[7] “nuclear” “language” “feminist” “defense” “war” “weapons” “text”

[8] “feminist” “social” “language” “women” “relations” “gender” “theory”

[9] “women” “labor” “work” “family” “children” “economic” “men”

[10] “women” “psychology” “feminist” “female” “mother” “woolf” “male”

[11] “women” “jane” “michel” “life” “woman” “early” “louise”

[12] “moral” “women” “muslim” “care” “theory” “social” “morality”

[13] “women” “science” “studies” “education” “social” “students” “feminist”

[14] “women” “library” “men” “beauvoir” “language” “ddc” “classification”

[15] “medical” “martineau” “women” “harriet” “alice” “cases” “mother”

[16] “women” “club” “literacy” “printed” “press” “university” “print”

[17] “women” “body” “religious” “female” “st” “dance” “christian”

[18] “freud” “charcot” “field” “lucy” “women” “bronte” “social”

[19] “feminist” “incest” “women” “distance” “theory” “aerial” “sexual”

[20] “women” “role” “female” “woman” “control” “german” “social”

[21] “women” “feminist” “mizrahi” “political” “movement” “israeli” “social”

[22] “black” “women” “white” “race” “community” “work” “men”

[23] “sexual” “lesbian” “sexuality” “women” “sex” “love” “female”

[24] “lesbian” “work” “lesbians” “nature” “gender” “identity” “women”

[25] “women” “chinese” “suffrage” “white” “african” “china” “national”

There’s a lot of experimenting involved with determining a suitable number of topics. Some of these look that might be separated from others, and others look like they belong in different topics. I also see some words to add to the stop-list (“ddc” and “st,” for example). (I couldn’t decide about “women/woman.” Too common, or would it distort the results to leave out?) And stemming would also help.

UPDATE:

I ran the model without “women” or “woman,” since those words were possibly over-represented in the terms above. Here is the result:

[1] “moral” “role” “care” “social” “art” “theory” “models”

[2] “black” “white” “race” “children” “group” “community” “racial”

[3] “men” “african” “kenya” “marriage” “groups” “university” “marry”

[4] “studies” “university” “american” “education” “history” “female” “students”

[5] “science” “social” “human” “sex” “differences” “field” “behavior”

[6] “work” “labor” “family” “social” “men” “children” “economic”

[7] “work” “gender” “martineau” “men” “hispanic” “feminization” “cultural”

[8] “rape” “percent” “bermuda” “white” “sex” “men” “raped”

[9] “religious” “male” “life” “female” “medieval” “early” “university”

[10] “law” “muslim” “legal” “court” “rights” “state” “islamic”

[11] “feminist” “nuclear” “language” “distance” “defense” “theory” “aerial”

[12] “freud” “pocahontas” “english” “charcot” “masque” “smith” “virginia”

[13] “michel” “death” “life” “female” “love” “louise” “political”

[14] “film” “body” “irigaray” “work” “university” “female” “desire”

[15] “chinese” “men” “indian” “china” “public” “library” “asian”

[16] “sexual” “sex” “gender” “men” “sexuality” “female” “social”

[17] “political” “feminist” “gender” “state” “movement” “rights” “social”

[18] “female” “love” “lesbian” “literary” “university” “life” “story”

[19] “nature” “control” “carson” “sea” “human” “natural” “life”

[20] “jane” “bronte” “lucy” “novel” “rochester” “eyre” “marriage”

[21] “club” “literacy” “printed” “press” “university” “print” “white”

[22] “incest” “feminist” “mother” “sexual” “psychology” “mothers” “men”

[23] “feminist” “social” “gender” “theory” “feminism” “political” “relations”

[24] “suffrage” “white” “national” “german” “data” “movement” “nigerian”

[25] “mizrahi” “lesbian” “feminist” “lesbians” “israeli” “identity” “israel”

Still a few oddities, but it looks reasonable.