1 / 8

Authorship Attribution

This study focuses on identifying the authors of anonymously written documents using techniques such as Support Vector Machines, document prototypes, and numerical methods. Various metrics like word frequency and grapheme frequency are employed along with histogram comparisons for accurate author identification. Chi-Squared metric and Difference Formula are key methods tested for their efficacy. Results show promise in historical scholarship, forensic identification, and counterfeit detection applications.

bertiec
Download Presentation

Authorship Attribution

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Authorship Attribution Erik Goldman & Abel Allison

  2. Problem Definition: Identification of the author of an anonymously written document given a set of candidate authors. Applications: • Historical Scholarship • Investigative Forensic Identification • Example: Fake Steve Jobs

  3. Related Work • Support Vector Machine methods [Diederich et al. (2003)] • Document prototypes (interesting documents or part of extracted, salient texts, to match with a document database [Visa et al. (2001)] • Numerical method of fractional counts [Burrel and Rousseau (1995)]

  4. Approach • For each work in the training set, count various feature data (more on features next slide), store as histograms. • Input unknown document and make same counts. • Compare the histograms of each author with those of the unknown. Each feature contributes a weighted vote. • Choose author with the highest comparison score

  5. Metrics • Limit Word Frequency-Words frequently used by the author across multiple works. • Grapheme Frequency-Counts of alphanumeric and symbol characters. • Part-of-speech Bigram Frequency - • Preterminal Tag Bigram Model -

  6. Histogram Comparisons • Two Methods Used • Chi-Squared Metric • Difference Formula – similar to the Chi-Squared formula, except accounts for sparsity of bi-gram counts by normalizing them with respect to the average counts:

  7. Tests • Used the power set of our set of authors. • For each element in the power set, we ran our tests using each of the authors as the unknown and recorded the results.

  8. Results

More Related