80 likes | 97 Views
This study focuses on identifying the authors of anonymously written documents using techniques such as Support Vector Machines, document prototypes, and numerical methods. Various metrics like word frequency and grapheme frequency are employed along with histogram comparisons for accurate author identification. Chi-Squared metric and Difference Formula are key methods tested for their efficacy. Results show promise in historical scholarship, forensic identification, and counterfeit detection applications.
E N D
Authorship Attribution Erik Goldman & Abel Allison
Problem Definition: Identification of the author of an anonymously written document given a set of candidate authors. Applications: • Historical Scholarship • Investigative Forensic Identification • Example: Fake Steve Jobs
Related Work • Support Vector Machine methods [Diederich et al. (2003)] • Document prototypes (interesting documents or part of extracted, salient texts, to match with a document database [Visa et al. (2001)] • Numerical method of fractional counts [Burrel and Rousseau (1995)]
Approach • For each work in the training set, count various feature data (more on features next slide), store as histograms. • Input unknown document and make same counts. • Compare the histograms of each author with those of the unknown. Each feature contributes a weighted vote. • Choose author with the highest comparison score
Metrics • Limit Word Frequency-Words frequently used by the author across multiple works. • Grapheme Frequency-Counts of alphanumeric and symbol characters. • Part-of-speech Bigram Frequency - • Preterminal Tag Bigram Model -
Histogram Comparisons • Two Methods Used • Chi-Squared Metric • Difference Formula – similar to the Chi-Squared formula, except accounts for sparsity of bi-gram counts by normalizing them with respect to the average counts:
Tests • Used the power set of our set of authors. • For each element in the power set, we ran our tests using each of the authors as the unknown and recorded the results.