1 / 40

Author Gender Identification from Text

Author Gender Identification from Text. By: El Hebri Khiari 200790830 COE 589 – Digital Forensics Due: Tuesday 25 th September 2012. Outline. Introduction & Motivation Authorship Attribution Detecting Genders Contribution(s) Problem Formulation Data Pre-Processing

veata
Download Presentation

Author Gender Identification from Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Author Gender Identification from Text By: El Hebri Khiari 200790830 COE 589 – Digital Forensics Due: Tuesday 25th September 2012

  2. Outline • Introduction & Motivation • Authorship Attribution • Detecting Genders • Contribution(s) • Problem Formulation • Data Pre-Processing • Reuters Newsgroup Dataset • Enron Email dataset • Feature Selection & Extraction • Classification Techniques • Experimental Results • Tool Results • Conclusion

  3. Introduction & Motivation • Text most prevalent on Internet • Applications • Twitter • Craigslist • Facebook • Statistics • 2008  33.1% increase in online crime • October 2009  1.69 billion Internet users • Motivations [??] • Anonymity • Faking gender • “MySpace mom” • Email • Blogs • Chat rooms

  4. Introduction & Motivation cont. • Question • “Given a short text document, can we identify if the author is a man or a woman?”

  5. Authorship Attribution • Features • Stylistic tendency  Stylometric analysis • Over 1000 features • Author’s state of mind • Statistical methods • Word-length Distribution • Bayesian Classifier • Principle Component Analysis • Cluster Analysis • Machine Learning • Decision Tree • Neural Networks • Support Vector Machine (SVM)

  6. Authorship Attribution cont. • Different problem • Abstraction • Length of messages • Special linguistic elements (emoticons) • Time constraints

  7. Detecting Genders • Socially-constructed Gender • Fundamental questions • “Do men & women inherently use different classes of language styles?”  “What are reliable linguistic features that indicate gender?” • Robin Lakoff (1975) • Lexical, Syntactic & pragmatic features • Specialized vocabulary, expletives, etc.

  8. Detecting Genders cont. • Mary Talbot (1998) • Influence of social divisions • Mulac et al.(1990), Mulac & Lundell (1994) • Students’ impromptu essays • Descriptions of photographs • Dyadic interactions between strangers • Written communication & face-to-face interaction

  9. Contribution(s) • Little work on GI [??] • Propose • Robust Classifier • Based on content-free text messages • Internet text messages • Features types • Design • Set of measures • Classifiers & Parameter optimization

  10. Problem Formulation • Binary problem • Class1 if author of e is male • Class2 if author of e is female • Set of features • Constant for same gender • d-dimensional vector

  11. Problem Formulation cont.(1) • Classifier • Learning Classifier y = f(x), from a set of training examples D = {(x1,y1), (x2,y2), … , (xN,yN)} Let X = {xi, i = 1,2, … , N} where xiis a d-dimensional vector A Let Y = {yi, i = 1,2, … , N} where yi{+1,-1} indicating class1(-1) or class2(+1)

  12. Problem Formulation cont.(2)

  13. Dataset Pre-processing • Two extremes • Newsgroup messages • Reuters newsgroup dataset • Private Emails • Enron email dataset

  14. Dataset Pre-processing cont.(1) • Reuters newsgroup dataset • Stories by Reuters journalists, 1996 – 1997 • Few Hundred to Thousand words • Discard neutral names • Remove unnecessary info & XML formatting • Limiting quotes, 0.002 per character • >200 and <1000 words

  15. Dataset Pre-processing cont.(2) • Enron email dataset • Emails made public by Federal Energy Regulatory Commission • Integrity problems  some emails removed • Invalid emails • Final set • 517,431 emails • 150 users, 3.5 years • Plain text, no attachments • Removed headers & reply texts • Removed duplicated emails • Removed ultra-short emails • > 50 and <100 words

  16. Feature Set Selection • Question • “What are good linguistic features that indicate gender?” • Human psychology & extensive experimentation • Character-based • Word-based • Syntactic • Structure-based • Function words • Total of 545 features

  17. Feature Set Selection cont.(1) • Character-based features • 29 Stylometric features • Widely adopted in Authorship attribution • Examples • Number of white space characters • Number of special characters

  18. Feature Set Selection cont.(2) • Word-based features • 33 statistical metrics • Vocabulary richness • Yule’s K measure • Entropy measure • 68 pshyco-linguistic features • Linguistic & Word Count (LIWC) • Individuals benefiting from writing • Positive & negative emotional words • Cognitive words (cause, know) • Switch use of pronouns

  19. Feature Set Selection cont.(3) • Syntactic features • Sentence level • Regular and informal punctuation • Mulac(1998) • Women use more question marks

  20. Feature Set Selection cont.(4) • Structure-based features • Layout • Paragraphs length • Use of greetings • Big influence in online documents

  21. Feature Set Selection cont.(5) • Function words • Ambiguous meaning • Grammatical relationships • Different set from word-based • Importance role • 9 gender-linked features • Women use emotionally-intensive & affective adjectives • Men express ‘independence’  First-person singular pronouns

  22. Automatic Extraction • Normalization

  23. Classification Techniques • Three classifiers • Bayesian-based logistic regression • AdaBoost Decision tree • Support Vector Machine (SVM)

  24. Classification Techniques cont.(1) • Bayesian-based logistic regression • Probability • Threshold set to 0.5

  25. Classification Techniques cont.(2) • Avoid overfitting • Assume with Normal distribution • Mean = 0, Variance • Assume with exponential distribution • Transform into Laplace distribution

  26. Classification Techniques cont.(3) • Assume components of are independent • Overall prior of • Posterior density given dataset D

  27. Classification Techniques cont.(4) • Use log posterior • Minimum –l() convex function • Suitable for optimization

  28. Classification Techniques cont.(6) • Decision Tree • Flowchart-like tree structure • Attribute  Internal node • Outcome  Branch • Class  Terminal node • High variance  Overfitting • AdaBoost • Solid theoretical background • Simple • Accurate predictions • Proven to be successful

  29. Classification Techniques cont.(5) • Assign equal weights to all training examples • Weights with distribution Dtat tth round • Generate weak learner X ht X  Y • Test ht, new weight distributions Dt+1 • Repeat T times

  30. Classification Techniques cont.(7) • Support Vector Machine • Linearly separable classes • Optimal • Linearly inseparable

  31. Classification Techniques cont.(8) • Non-linear problem • Use Kernel trick • Linear • Polynomial • Radial basis

  32. Experimental Results • Feature Extraction  Python • Classifiers  MatLab • Each experiment  10 times

  33. Experimental Results cont.(1) • SVM outperforms (76.75% & 82.23%) • Sharp improvements in AdaBoost • Small changes in Bayesian Logistic Regression

  34. Experimental Results cont.(2) • Impact of parameters • >50 words • >100 words • >200 words

  35. Experimental Results cont.(3) • Significance of feature sets • >100 words • One feature at a time

  36. Experimental Results cont.(4) • Optimization • 5% Feature size reduction  157 out of 545 • Extraction time reduced from 1.35 to 3.77 seconds • 3.03% drop in accuracy

  37. Tool Results • male 64.46% • male 75.83% • male 59.89% • neutral 96.98% ?? • male 58.31% • male 72.60% • male 63.30% • male 57.57% • male 73.89% • male 59.07% • Actual Results: 5 male out of 10

  38. Conclusion • Differences do exist between genders • SVM outperforms • Significant features [??] • Word-based features • Function words • Structural features • Increase data set  better accuracy

More Related