Author Gender Identification from Text

Author Gender Identification from Text By: El Hebri Khiari 200790830 COE 589 – Digital Forensics Due: Tuesday 25th September 2012

Outline • Introduction & Motivation • Authorship Attribution • Detecting Genders • Contribution(s) • Problem Formulation • Data Pre-Processing • Reuters Newsgroup Dataset • Enron Email dataset • Feature Selection & Extraction • Classification Techniques • Experimental Results • Tool Results • Conclusion

Introduction & Motivation • Text most prevalent on Internet • Applications • Twitter • Craigslist • Facebook • Statistics • 2008  33.1% increase in online crime • October 2009  1.69 billion Internet users • Motivations [??] • Anonymity • Faking gender • “MySpace mom” • Email • Blogs • Chat rooms

Introduction & Motivation cont. • Question • “Given a short text document, can we identify if the author is a man or a woman?”

Authorship Attribution • Features • Stylistic tendency  Stylometric analysis • Over 1000 features • Author’s state of mind • Statistical methods • Word-length Distribution • Bayesian Classifier • Principle Component Analysis • Cluster Analysis • Machine Learning • Decision Tree • Neural Networks • Support Vector Machine (SVM)

Authorship Attribution cont. • Different problem • Abstraction • Length of messages • Special linguistic elements (emoticons) • Time constraints

Detecting Genders • Socially-constructed Gender • Fundamental questions • “Do men & women inherently use different classes of language styles?”  “What are reliable linguistic features that indicate gender?” • Robin Lakoff (1975) • Lexical, Syntactic & pragmatic features • Specialized vocabulary, expletives, etc.

Detecting Genders cont. • Mary Talbot (1998) • Influence of social divisions • Mulac et al.(1990), Mulac & Lundell (1994) • Students’ impromptu essays • Descriptions of photographs • Dyadic interactions between strangers • Written communication & face-to-face interaction

Contribution(s) • Little work on GI [??] • Propose • Robust Classifier • Based on content-free text messages • Internet text messages • Features types • Design • Set of measures • Classifiers & Parameter optimization

Problem Formulation • Binary problem • Class1 if author of e is male • Class2 if author of e is female • Set of features • Constant for same gender • d-dimensional vector

Problem Formulation cont.(1) • Classifier • Learning Classifier y = f(x), from a set of training examples D = {(x1,y1), (x2,y2), … , (xN,yN)} Let X = {xi, i = 1,2, … , N} where xiis a d-dimensional vector A Let Y = {yi, i = 1,2, … , N} where yi{+1,-1} indicating class1(-1) or class2(+1)

Problem Formulation cont.(2)

Dataset Pre-processing • Two extremes • Newsgroup messages • Reuters newsgroup dataset • Private Emails • Enron email dataset

Dataset Pre-processing cont.(1) • Reuters newsgroup dataset • Stories by Reuters journalists, 1996 – 1997 • Few Hundred to Thousand words • Discard neutral names • Remove unnecessary info & XML formatting • Limiting quotes, 0.002 per character • >200 and <1000 words

Dataset Pre-processing cont.(2) • Enron email dataset • Emails made public by Federal Energy Regulatory Commission • Integrity problems  some emails removed • Invalid emails • Final set • 517,431 emails • 150 users, 3.5 years • Plain text, no attachments • Removed headers & reply texts • Removed duplicated emails • Removed ultra-short emails • > 50 and <100 words

Feature Set Selection • Question • “What are good linguistic features that indicate gender?” • Human psychology & extensive experimentation • Character-based • Word-based • Syntactic • Structure-based • Function words • Total of 545 features

Feature Set Selection cont.(1) • Character-based features • 29 Stylometric features • Widely adopted in Authorship attribution • Examples • Number of white space characters • Number of special characters

Feature Set Selection cont.(2) • Word-based features • 33 statistical metrics • Vocabulary richness • Yule’s K measure • Entropy measure • 68 pshyco-linguistic features • Linguistic & Word Count (LIWC) • Individuals benefiting from writing • Positive & negative emotional words • Cognitive words (cause, know) • Switch use of pronouns

Feature Set Selection cont.(3) • Syntactic features • Sentence level • Regular and informal punctuation • Mulac(1998) • Women use more question marks

Feature Set Selection cont.(4) • Structure-based features • Layout • Paragraphs length • Use of greetings • Big influence in online documents

Feature Set Selection cont.(5) • Function words • Ambiguous meaning • Grammatical relationships • Different set from word-based • Importance role • 9 gender-linked features • Women use emotionally-intensive & affective adjectives • Men express ‘independence’  First-person singular pronouns

Automatic Extraction • Normalization

Classification Techniques • Three classifiers • Bayesian-based logistic regression • AdaBoost Decision tree • Support Vector Machine (SVM)

Classification Techniques cont.(1) • Bayesian-based logistic regression • Probability • Threshold set to 0.5

Classification Techniques cont.(2) • Avoid overfitting • Assume with Normal distribution • Mean = 0, Variance • Assume with exponential distribution • Transform into Laplace distribution

Classification Techniques cont.(3) • Assume components of are independent • Overall prior of • Posterior density given dataset D

Classification Techniques cont.(4) • Use log posterior • Minimum –l() convex function • Suitable for optimization

Classification Techniques cont.(6) • Decision Tree • Flowchart-like tree structure • Attribute  Internal node • Outcome  Branch • Class  Terminal node • High variance  Overfitting • AdaBoost • Solid theoretical background • Simple • Accurate predictions • Proven to be successful

Classification Techniques cont.(5) • Assign equal weights to all training examples • Weights with distribution Dtat tth round • Generate weak learner X ht X  Y • Test ht, new weight distributions Dt+1 • Repeat T times

Classification Techniques cont.(7) • Support Vector Machine • Linearly separable classes • Optimal • Linearly inseparable

Classification Techniques cont.(8) • Non-linear problem • Use Kernel trick • Linear • Polynomial • Radial basis

Experimental Results • Feature Extraction  Python • Classifiers  MatLab • Each experiment  10 times

Experimental Results cont.(1) • SVM outperforms (76.75% & 82.23%) • Sharp improvements in AdaBoost • Small changes in Bayesian Logistic Regression

Experimental Results cont.(2) • Impact of parameters • >50 words • >100 words • >200 words

Experimental Results cont.(3) • Significance of feature sets • >100 words • One feature at a time

Experimental Results cont.(4) • Optimization • 5% Feature size reduction  157 out of 545 • Extraction time reduced from 1.35 to 3.77 seconds • 3.03% drop in accuracy

Tool Results • male 64.46% • male 75.83% • male 59.89% • neutral 96.98% ?? • male 58.31% • male 72.60% • male 63.30% • male 57.57% • male 73.89% • male 59.07% • Actual Results: 5 male out of 10

Conclusion • Differences do exist between genders • SVM outperforms • Significant features [??] • Word-based features • Function words • Structural features • Increase data set  better accuracy

Author Gender Identification from Text

Author Gender Identification from Text

Presentation Transcript

Author-Topic Models for Large Text Corpora

Gender Identification of Unfamiliar Names

HOW DOES THE AUTHOR ORGANIZE THE TEXT?

Separating Location from Identification

From Pictograph , Text to Hyper-Text:

Title of text Text Author: Degree Course: Year:

Text, author, and reading goal

Gender Text Types

Author Seymour Simon Genre Expository nonfiction Informational Text

Terminology identification from full text: OCLC’s WordSmith experience

Learning from Text

Kdd Cup 2013 Author Paper Identification Final Report

Text independent speaker identification in multilingual environments

Author: Inshin Aleksandr From 8

Learning from Text

Title and Author of Source Text

Author: Philip Steele Genre: Informational Text

Author! Author!

Text Categorization Moshe Koppel Lecture 4: Author Profiling

Separating Location from Identification

Text independent speaker identification in multilingual environments