170 likes | 279 Views
Authorship Verification as a One-Class Classification Problem. Moshe Koppel Jonathan Schler. Introduction. Goal Given examples of the writing of a single author, ask to determine if given texts is written by this author Authorship attribution
E N D
Authorship Verification as a One-Class Classification Problem Moshe Koppel Jonathan Schler
Introduction • Goal • Given examples of the writing of a single author, ask to determine if given texts is written by this author • Authorship attribution • Given examples of several of authors, ask to determine which author wrote the given anonymous texts
Challenge • Negative samples are neither exhaustive nor representative • Single author may consciously vary his/her style from text to text
Authorship Verification • Naïve Approach • Given examples of the writing of author A • Concoct a mishmash of works by other authors • Learn a model for A vs. not-A • Learn A vs. X (an mystery work) • Easy to distinguish between A and X • Different author • Same author (otherwise)
Authorship Verification • Unmasking basic idea • A small number of features do most of the works in distinguish books • Iteratively remove those most useful features • Gauge the speed with which cross-validation accuracy degrades
Authorship Verification Unmasking House of Seven Gables against Hawthorne (actual author), Melville and Cooper
Experiment • Use One-class SVM as baseline • 6 of 20 same-author pairs are correctly classified • 143 of 189 different-author pairs are correctly classified
Experiment • Using Unmasking Approach • Choose feature set with 250 words with highest average frequency in Ax and X • Build Degradation Curve Use 10-fold validation for A again X, for each fold Do 10 iterations { Build a model for A against X Evaluate accuracy results Add accuracy number to degradation curve Remove 6 top contributing feature from data }
Experiment Unmasking An Ideal Husband against each of the ten authors
Experiment • Distinguish same-author curves and different-author curve • Represent degradation curve as feature vector • Feature vector: numerical vector in terms of its essential feature • Accuracy after 6 elimination rounds < 89% • The 2nd highest accuracy drop in two iteration > 16% • Test degradation curve
Experiment Result • 19 of 20 same-author pairs are correctly classified • 181 of 189 different-author pairs are correctly classified • Accuracy 95.7%
Extension • Use negative examples to eliminate some false positive from the unmasking phase • In our case, use elimination method improved accuracy • 189 of 189 different-author pairs are correctly classified • Introduced a single new misclassified
Extension • Elimination If alternative author {A1,…,An} exists then { build model M for classifying A vs. all other alternative authors test each chunk of X with built model M for each alternative author Ai build model Mi for classifying Ai vs. {A or all other alternative authors} test each chunk of X with built model Mi } If number of chunks assigned to Ai > # of chunks assigned to A then return different-author }
Actual Literary Mystery • Two 19th century collection of Hebrew-Aramaic • RP includes 509 documents (by Ben Ish Chai) • TL includes 524 documents (Ben Ish Chai claims to have found in an archive)
Actual Literary Mystery Unmasking TL against Ben Ish Chai and four impostors
Conclusion • Unmasking – complete ignore examples • High accuracy • Unmasking + Elimination (little negative data) • Accuracy better • More experiment need to confirm this methods is also good for other languages