400 likes | 611 Views
Author Gender Identification from Text. By: El Hebri Khiari 200790830 COE 589 – Digital Forensics Due: Tuesday 25 th September 2012. Outline. Introduction & Motivation Authorship Attribution Detecting Genders Contribution(s) Problem Formulation Data Pre-Processing
E N D
Author Gender Identification from Text By: El Hebri Khiari 200790830 COE 589 – Digital Forensics Due: Tuesday 25th September 2012
Outline • Introduction & Motivation • Authorship Attribution • Detecting Genders • Contribution(s) • Problem Formulation • Data Pre-Processing • Reuters Newsgroup Dataset • Enron Email dataset • Feature Selection & Extraction • Classification Techniques • Experimental Results • Tool Results • Conclusion
Introduction & Motivation • Text most prevalent on Internet • Applications • Twitter • Craigslist • Facebook • Statistics • 2008 33.1% increase in online crime • October 2009 1.69 billion Internet users • Motivations [??] • Anonymity • Faking gender • “MySpace mom” • Email • Blogs • Chat rooms
Introduction & Motivation cont. • Question • “Given a short text document, can we identify if the author is a man or a woman?”
Authorship Attribution • Features • Stylistic tendency Stylometric analysis • Over 1000 features • Author’s state of mind • Statistical methods • Word-length Distribution • Bayesian Classifier • Principle Component Analysis • Cluster Analysis • Machine Learning • Decision Tree • Neural Networks • Support Vector Machine (SVM)
Authorship Attribution cont. • Different problem • Abstraction • Length of messages • Special linguistic elements (emoticons) • Time constraints
Detecting Genders • Socially-constructed Gender • Fundamental questions • “Do men & women inherently use different classes of language styles?” “What are reliable linguistic features that indicate gender?” • Robin Lakoff (1975) • Lexical, Syntactic & pragmatic features • Specialized vocabulary, expletives, etc.
Detecting Genders cont. • Mary Talbot (1998) • Influence of social divisions • Mulac et al.(1990), Mulac & Lundell (1994) • Students’ impromptu essays • Descriptions of photographs • Dyadic interactions between strangers • Written communication & face-to-face interaction
Contribution(s) • Little work on GI [??] • Propose • Robust Classifier • Based on content-free text messages • Internet text messages • Features types • Design • Set of measures • Classifiers & Parameter optimization
Problem Formulation • Binary problem • Class1 if author of e is male • Class2 if author of e is female • Set of features • Constant for same gender • d-dimensional vector
Problem Formulation cont.(1) • Classifier • Learning Classifier y = f(x), from a set of training examples D = {(x1,y1), (x2,y2), … , (xN,yN)} Let X = {xi, i = 1,2, … , N} where xiis a d-dimensional vector A Let Y = {yi, i = 1,2, … , N} where yi{+1,-1} indicating class1(-1) or class2(+1)
Dataset Pre-processing • Two extremes • Newsgroup messages • Reuters newsgroup dataset • Private Emails • Enron email dataset
Dataset Pre-processing cont.(1) • Reuters newsgroup dataset • Stories by Reuters journalists, 1996 – 1997 • Few Hundred to Thousand words • Discard neutral names • Remove unnecessary info & XML formatting • Limiting quotes, 0.002 per character • >200 and <1000 words
Dataset Pre-processing cont.(2) • Enron email dataset • Emails made public by Federal Energy Regulatory Commission • Integrity problems some emails removed • Invalid emails • Final set • 517,431 emails • 150 users, 3.5 years • Plain text, no attachments • Removed headers & reply texts • Removed duplicated emails • Removed ultra-short emails • > 50 and <100 words
Feature Set Selection • Question • “What are good linguistic features that indicate gender?” • Human psychology & extensive experimentation • Character-based • Word-based • Syntactic • Structure-based • Function words • Total of 545 features
Feature Set Selection cont.(1) • Character-based features • 29 Stylometric features • Widely adopted in Authorship attribution • Examples • Number of white space characters • Number of special characters
Feature Set Selection cont.(2) • Word-based features • 33 statistical metrics • Vocabulary richness • Yule’s K measure • Entropy measure • 68 pshyco-linguistic features • Linguistic & Word Count (LIWC) • Individuals benefiting from writing • Positive & negative emotional words • Cognitive words (cause, know) • Switch use of pronouns
Feature Set Selection cont.(3) • Syntactic features • Sentence level • Regular and informal punctuation • Mulac(1998) • Women use more question marks
Feature Set Selection cont.(4) • Structure-based features • Layout • Paragraphs length • Use of greetings • Big influence in online documents
Feature Set Selection cont.(5) • Function words • Ambiguous meaning • Grammatical relationships • Different set from word-based • Importance role • 9 gender-linked features • Women use emotionally-intensive & affective adjectives • Men express ‘independence’ First-person singular pronouns
Automatic Extraction • Normalization
Classification Techniques • Three classifiers • Bayesian-based logistic regression • AdaBoost Decision tree • Support Vector Machine (SVM)
Classification Techniques cont.(1) • Bayesian-based logistic regression • Probability • Threshold set to 0.5
Classification Techniques cont.(2) • Avoid overfitting • Assume with Normal distribution • Mean = 0, Variance • Assume with exponential distribution • Transform into Laplace distribution
Classification Techniques cont.(3) • Assume components of are independent • Overall prior of • Posterior density given dataset D
Classification Techniques cont.(4) • Use log posterior • Minimum –l() convex function • Suitable for optimization
Classification Techniques cont.(6) • Decision Tree • Flowchart-like tree structure • Attribute Internal node • Outcome Branch • Class Terminal node • High variance Overfitting • AdaBoost • Solid theoretical background • Simple • Accurate predictions • Proven to be successful
Classification Techniques cont.(5) • Assign equal weights to all training examples • Weights with distribution Dtat tth round • Generate weak learner X ht X Y • Test ht, new weight distributions Dt+1 • Repeat T times
Classification Techniques cont.(7) • Support Vector Machine • Linearly separable classes • Optimal • Linearly inseparable
Classification Techniques cont.(8) • Non-linear problem • Use Kernel trick • Linear • Polynomial • Radial basis
Experimental Results • Feature Extraction Python • Classifiers MatLab • Each experiment 10 times
Experimental Results cont.(1) • SVM outperforms (76.75% & 82.23%) • Sharp improvements in AdaBoost • Small changes in Bayesian Logistic Regression
Experimental Results cont.(2) • Impact of parameters • >50 words • >100 words • >200 words
Experimental Results cont.(3) • Significance of feature sets • >100 words • One feature at a time
Experimental Results cont.(4) • Optimization • 5% Feature size reduction 157 out of 545 • Extraction time reduced from 1.35 to 3.77 seconds • 3.03% drop in accuracy
Tool Results • male 64.46% • male 75.83% • male 59.89% • neutral 96.98% ?? • male 58.31% • male 72.60% • male 63.30% • male 57.57% • male 73.89% • male 59.07% • Actual Results: 5 male out of 10
Conclusion • Differences do exist between genders • SVM outperforms • Significant features [??] • Word-based features • Function words • Structural features • Increase data set better accuracy