240 likes | 360 Views
Authorship Attribution Using Probabilistic Context-Free Grammars. Sindhu Raghavan, Adriana Kovashka, Raymond Mooney The University of Texas at Austin. Authorship Attribution. Task of identifying the author of a document Applications Forensics (Luckyx and Daelemans, 2008)
E N D
Authorship Attribution Using Probabilistic Context-Free Grammars Sindhu Raghavan, Adriana Kovashka, Raymond Mooney The University of Texas at Austin
Authorship Attribution • Task of identifying the author of a document • Applications • Forensics(Luckyx and Daelemans, 2008) • Cyber crime investigation (Zheng et al., 2009) • Automatic plagiarism detection (Stamatatos, 2009) • The Federalist papers study (Monsteller and Wallace, 1984) • The Federalist papers are a set of essays of the US constitution • Authorship of these papers were unknown at the time of publication • Statistical analysis was used to find the authors of these documents
Existing Approaches • Style markers (function words) as features for classification (Monsteller and Wallace, 1984; Burrows, 1987; Holmes and Forsyth, 1995; Joachims, 1998; Binongo and Smith, 1999; Stamatatos et al., 1999; Diederich et al., 2000; Luyckx and Daelemans, 2008) • Character-level n-grams (Peng et al., 2003) • Syntactic features from parse trees (Baayen et al., 1996) • Limitations • Capture mostly lexical information • Do not necessarily capture the author’s syntactic style
Our Approach • Using probabilistic context-free grammar (PCFG) to capture the syntactic style of the author • Construct a PCFG based on the documents written by the author and use it as a language model for classification • Requires annotated parse trees of the documents How do we obtain these annotated parse trees?
Algorithm – Step 1 Training documents ………………….. ….…….. ………………….. ….…….. ………………….. ….…….. ………………….. ….…….. ………………….. ….…….. ………………….. ….…….. ………………….. ….…….. ………………….. ….…….. Treebank each document using a statistical parser trained on a generic corpus • Stanford parser(Klein and Manning, 2003) • WSJ or Brown corpus from Penn Treebank(http://www.cis.upenn.edu/~treebank) ………………….. ….…….. ………………….. ….…….. ………………….. ….…….. ………………….. ….…….. Bob Mary John Alice
Algorithm – Step 2 Probabilistic Context-Free Grammars S NP VP .8 S VP .2 NP Det A N .4 NP NP PP .35 NP PropN .25 . . . S NP VP .7 S VP .3 NP Det A N .6 NP NP PP .25 NP PropN .15 . . . S NP VP .9 S VP .1 NP Det A N .3 NP NP PP .5 NP PropN .2 . . . S NP VP .5 S VP .5 NP Det A N .8 NP NP PP .1 NP PropN .1 . . . Bob Mary John Alice Train a PCFG for each author using the treebanked documents from Step 1
Algorithm – Step 3 S NP VP .8 S VP .2 NP Det A N .4 NP NP PP .35 NP PropN .25 Test document .6 Alice ………………….. ….…….. S NP VP .7 S VP .3 NP Det A N .6 NP NP PP .25 NP PropN .15 .5 Bob S NP VP .9 S VP .1 NP Det A N .3 NP NP PP .5 NP PropN .2 .33 Mary S NP VP .5 S VP .5 NP Det A N .8 NP NP PP .1 NP PropN .1 .75 John
Algorithm – Step 3 S NP VP .8 S VP .2 NP Det A N .4 NP NP PP .35 NP PropN .25 Test document .6 Alice ………………….. ….…….. Multiply the probability of the top parse for each sentence in the test document S NP VP .7 S VP .3 NP Det A N .6 NP NP PP .25 NP PropN .15 .5 Bob S NP VP .9 S VP .1 NP Det A N .3 NP NP PP .5 NP PropN .2 .33 Mary S NP VP .5 S VP .5 NP Det A N .8 NP NP PP .1 NP PropN .1 .75 John
Algorithm – Step 3 S NP VP .8 S VP .2 NP Det A N .4 NP NP PP .35 NP PropN .25 Test document .6 Alice ………………….. ….…….. Multiply the probability of the top parse for each sentence in the test document S NP VP .7 S VP .3 NP Det A N .6 NP NP PP .25 NP PropN .15 .5 Bob S NP VP .9 S VP .1 NP Det A N .3 NP NP PP .5 NP PropN .2 .33 Mary S NP VP .5 S VP .5 NP Det A N .8 NP NP PP .1 NP PropN .1 .75 Label for the test document John
Data Blue – News articlesRed – Literary works Data sets available at www.cs.utexas.edu/users/sindhu/acl2010
Methodology • Bag-of-words model (baseline) • Naïve Bayes, MaxEnt • N-gram models (baseline) • N=1,2,3 • Basic PCFG model • PCFG-I (Interpolation)
Methodology • Bag-of-words model (baseline) • Naïve Bayes, MaxEnt • N-gram models (baseline) • N=1,2,3 • Basic PCFG model • PCFG-I (Interpolation)
Basic PCFG • Train PCFG based only on the documents written by the author • Poor performance when few documents are available for training • Increase the number of documents in the training set • Forensics - Do not always have access to a number of documents written by the same author • Need for alternate techniques when few documents are available for training
PCFG-I • Uses the method of interpolationfor smoothing • Augment the training data by adding sections of WSJ/Brown corpus • Up-sample data for the author
Performance of Baseline Models Accuracy in % Dataset Inconsistent performance for baseline models – the same model does not necessarily perform poorly on all data sets
Performance of PCFG and PCFG-I Accuracy in % Dataset PCFG-I performs better than the basic PCFG model on most data sets
PCFG Models vs. Baseline Models Accuracy in % Dataset Best PCFG model outperforms the worst baseline for all data sets, but does not outperform the best baseline for all data sets
PCFG-E • PCFG models do not always outperform N-gram models • Lexical features from N-gram models useful for distinguishing between authors • PCFG-E(Ensemble) • PCFG-I (best PCFG model) • Bigram model (best N-gram model) • MaxEnt based bag-of-words (discriminative classifier)
Performance of PCFG-E Accuracy in % Dataset PCFG-E outperforms or matches with the best baseline on all data sets
Significance of PCFG (PCFG-E – PCFG-I) Accuracy in % Dataset Drop in performance on removing PCFG-I from PCFG-E on most data sets
Conclusions • PCFGs are useful for capturing the author’s syntactic style • Novel approach for authorship attribution using PCFGs • Both syntactic and lexical information is necessary to capture author’s writing style