520 likes | 618 Views
Gender in Twitter: Styles, Stances, and Social Networks. Tyler Schnoebelen (reporting joint work with David Bamman and Jacob Eisenstein). At its most basic. At its most basic. Assumption 1: Men and women use different vocabularies
E N D
Gender in Twitter: Styles, Stances, and Social Networks Tyler Schnoebelen (reporting joint work with David Bamman and Jacob Eisenstein)
At its most basic • Assumption 1: Men and women use different vocabularies • Hypothesis I: Computational methods can cut through noise and predict speaker gender based on the words they use • Assumption 2: Social networks are typically “homophilous” (birds of a feather flock together) • Hypothesis II: Adding the gender make-up of a user’s social network should get even better prediction
Actual goal • Problematize gender prediction as a task • Define a system where we could just “stop” and call it good • But NOT ACTUALLY STOP • Demonstrate that simple gender binaries aren’t actually descriptively accurate • Show ways to combine social theory and computational methods that expand the questions on both sides
“Standard” is a keyword Although standard poodles isn’t what Cheshire (2004), Cameron & Coates (1989), Eckert & McConnell-Ginet (1999), Holmes (1997), or Romaine (2003) have in mind.
Typical findings • Women use standard variables more often than men. • In fact, early dialectologists ignored women completely because they wanted “NORMS”—non-mobile, older, rural male speakers, seen as preserving the purest regional (non-standard) forms • See Chambers and Trudgill (1980). • Did they do it for prestige (to acquire social capital)? • To avoid losing status? • Are women actually creating norms, not following them? • Check out your text book (“Whose speech is more standard”) for more complications to this picture
More computational work • People are fascinated by gender differences • In order to get statistical significance, you have to have enough data where you can detect a signal • In the past, this has led researchers to roll up words into word classes
The most common distinctions • Men use informative language • Prepositions, attributive adjectives, higher word lengths • Women use involved language • First and second person pronouns, present tense verbs, contractions
Or by “contextuality” • Men are formal and explicit • Nouns, adjectives, prepositions, articles • Women are deictic and contextual • Pronouns, verbs, adverbs, interjections • “Contextuality” decreases when an unambiguous understanding is more important or difficult—when people are physically or socially farther away
Our approach also lumps • It’s just at a lower level because instead of “nouns” or “blog words”, we have “unigrams”. • We also ran our work with part-of-speech tagged unigrams for one level less lumping—the results are basically the same but not reported here. • Lumping itself isn’t a problem. In fact, you have to. • But ideologies are going to structure your lumpings, so watch out!
Data • Public Twitter messages in same-gender and cross-gender social networks • Word frequencies (unigrams) • Gender (induced from first names)--e.g., The Social Security Administration says: • Tyler is a male name 97.36% of the time • Penny and Annette are female names 100% of the time • Robin is female 87.69% of the time • 14,464 Twitter users (56% male) • Geolocated in the US • Must use 50 of top 1,000 most frequent words • Between 4 and 100 “mutual @’s” separated by 14-days • Women have 58% female friends • Men have 67% male friends • 9.2M tweets, Jan-Jun 2011
First step: take the “normal” route • Train a statistical model on part of the data. • Logistic regression • Test it on a different part of the data, hiding the gender labels. • 10-fold cross-validation: 10 unique training/test splits (so the test is a different 10% of the data) • State-of-the-art prediction: 88.9% • Lexical features do strongly predict gender • Ignoring syntax (treating tweets as “bags of words”) does pretty good
Are women less standard? • Female markers: • okay, yes, yess, yesss, yessss • nooo, noooo • cannot • Male markers: • yessir • nah, nobody • ain’t • What counts as standard?
Hand classification (94.2% agreement) At a corpus level, women use more non-dictionary words and men use more named entities. In a moment we’ll ask how universal this is.
But wait • “Dictionary” words are really diverse • There’s a sense that dude (m), cute (f), epic (m), and lovely (f) are “stylistic” in a way that ability (m), correct (m), lipstick (f) and sleepy (f) are not, but how would we pin this down? • Part of speech? • But in what way do cute (f), hot (f), epic (m), and solid (m) belong with correct (m), offensive (m), sleepy (f) and glad (f)? • And for hot and solid, the “style” or “content” division depends on the intended word sense.
Involvement • Using traditional definitions, we’d say that our data confirms • men as more informational (all those named entities) • women as more interactive/involved (pronouns, emoticons, etc). • Recall that most of the named entities for the men are sports figures and teams.
Shit Girls Say http://www.youtube.com/watch?feature=player_embedded&v=u-yLGIH7W9Y
Notice • That gender wasn’t really limited to the “gender” column • “Moms” and “dads” are gendered social roles • And that the words “guys” and “girls” aren’t really the same as “male” and “female” • What are the plausible age ranges and social styles for “guys” and “girls”?
Clustering without regard to gender • We clustered authors into 20 clusters, ignoring their gender • Clustering considered text only • K-means with log-linear distributions • (Eisenstein, Ahmed, and Xing, ICML 2011) • Many clusters have strong demographic orientations, including gender, race, and age
Clusters that are majority female At the population level, women use few named entities and many non-dictionary words. But there are clusters of (mostly) women who do the opposite.
Clusters that are majority male At the population level, men use many named entities and few non-dictionary words. But there are clusters of men who do the opposite.
Erasure! • Clusters are highly gendered • For example, let’s consider clusters made up of 60% or more of people of the same gender • That covers 72.79% of all the authors • But what about the 1,420 men who are part of female-majority clusters? • The 1,219 women who are part of male-majority clusters? • The 782 people who are part of clusters that aren’t gender-skewed? • Are they just noise? Odd-balls? Is there no structure to what they’re doing?
The classifier does best classifying women with female networks
The classifier does best classifying men with male networks.
Markers go beyond “you” • The decile of men with the most female-skewed social networks • use far more female lexical markers than male markers (only 25% of the markers they use are male). • For the decile of men with the most male-skewed networks • male and female markers are used at roughly equal rates (because the female markers include more common words). • For the decile of women with the most female-skewed networks • 85% of the lexical markers that they use are female. • For the decile of women with the most male-skewed networks • 75% of the lexical markers that they use are female.
Does social network help prediction? • 89% accuracy with text alone • Logistic regression, 10-fold cross-validation • State-of-the-art
Does social network help prediction? • 89% accuracy with text alone • Logistic regression, 10-fold cross-validation • State-of-the-art • Add network information… • Still 89% accuracy
Wait, why not? • A new feature is only going to improve classification accuracy if it adds new information. • There is strong homophily: 63% of the connections are between same-gender individuals. • But language and social network can’t mutually disambiguate because they aren’t independent views on gender • Individuals who use linguistic resources from “the opposite gender” have consistently denser social network connections to the opposite gender. • Performance, style, accommodation • Gender is not an “A or B” kind of thing
Not so simple • If we want to understand categories, we should start with people in interactions. • Counting is great but we have to watch our bins and investigate them, too. • A binary model of gender is only adequate if you have blinders on • “My mom has never in her life said that’s lovely or omg!...nevermind that!” • And we can’t trust the idea that we’ll just figure out each of the independent parts—if we figure out “woman” and “African American” then we’ll understand “African American women”. • Big data offers us the opportunity to let clusters emerge (and test them against our big bins). • In other words, Twitter and other forms of big data offer a way to show how language reflects and creates the social worlds we live in.