Authorship Attribution

Authorship Attribution Erik Goldman & Abel Allison

Problem Definition: Identification of the author of an anonymously written document given a set of candidate authors. Applications: • Historical Scholarship • Investigative Forensic Identification • Example: Fake Steve Jobs

Related Work • Support Vector Machine methods [Diederich et al. (2003)] • Document prototypes (interesting documents or part of extracted, salient texts, to match with a document database [Visa et al. (2001)] • Numerical method of fractional counts [Burrel and Rousseau (1995)]

Approach • For each work in the training set, count various feature data (more on features next slide), store as histograms. • Input unknown document and make same counts. • Compare the histograms of each author with those of the unknown. Each feature contributes a weighted vote. • Choose author with the highest comparison score

Metrics • Limit Word Frequency-Words frequently used by the author across multiple works. • Grapheme Frequency-Counts of alphanumeric and symbol characters. • Part-of-speech Bigram Frequency - • Preterminal Tag Bigram Model -

Histogram Comparisons • Two Methods Used • Chi-Squared Metric • Difference Formula – similar to the Chi-Squared formula, except accounts for sparsity of bi-gram counts by normalizing them with respect to the average counts:

Tests • Used the power set of our set of authors. • For each element in the power set, we ran our tests using each of the authors as the unknown and recorded the results.

Results

Authorship Attribution