220 likes | 334 Views
Lecture 6: The Ultimate Authorship Problem: Verification for Short Docs. Moshe Koppel and Yaron Winter. The Ultimate Problem. Let’s skip right to the hardest problem: Given two anonymous short documents, determine if they were written by the same author.
E N D
Lecture 6: The Ultimate Authorship Problem: Verification for Short Docs Moshe Koppel and Yaron Winter
The Ultimate Problem • Let’s skip right to the hardest problem: Given two anonymous short documents, determine if they were written by the same author. • If we can solve this, we can solve pretty much any variation of the attribution problem.
Experimental Setup • Construct pairs <Bi,Ej> by choosing the first 500 words of blog i and the last 500 words of blog j. • Create 1000 such pairs, half of which are same-author pairs (i=j). (In the real world, there are many more different-author pairs than same-author pairs, but let’s keep the bookkeeping simple for now.) Note: no individual author appears in more than one pair. • The task is to label each pair as same-author or different-author.
A Simple Unsupervised Baseline Method • Vectorize B and E (e.g., as frequencies of character n-grams) • Compute the cosine similarity of B and E. • If/f it exceeds some (optimally chosen) threshold, assign the pair <B,E> to same-author.
A Simple Unsupervised Baseline Method • Vectorize B and E (e.g., as frequencies of character n-grams) • Compute the cosine similarity of B and E. • If/f it exceeds some threshold, assign the pair <B,E> to same-author. This method yields accuracy of 70.6% (using the optimal threshold).
A Simple Supervised Baseline Method • Suppose that, in addition to the (test) corpus just described, we have a training corpus constructed the same way, but with each pair labeled. • We can do the obvious thing: • Vectorize B and E (e.g., as frequencies of character n-grams) • Compute the difference vector (e.g., terms are |bi-ei|/(bi+ei) ) • Learn on training corpus to find some suitable classifier
A Simple Supervised Baseline Method • Suppose that, in addition to the (test) corpus just described, we have a training corpus constructed the same way, but with each pair labeled. • We can do the obvious thing: • Vectorize B and E (e.g., as frequencies of character n-grams) • Compute the difference vector (e.g., terms are |bi-ei|/(bi+ei) ) • Learn on training corpus to find some suitable classifier • With a lot of effort, we get accuracy of 79.8%. But we suspect we can do better, even without using a labeled training corpus (too much).
Exploiting the Many-Authors Method • Given B and E, generate a list of impostors E1,..,En. • Use our algorithm for the many-candidate problem for anonymous text B and candidates {E, E1,…,En}. • If/f E is selected as the author with sufficiently high score, assign the pair to same-author. • (Optionally, add impostors to B and check if anonymous document E is assigned to author B.)
Design Choices There are some obvious questions we need to consider: • How many impostors is optimal? (Fewer impostors means more false positives; more impostors means more false negatives.) • Where should we get the impostors from? (If the impostors are not convincing enough, we’ll get too many false positives; if the impostors are too convincing – e.g. drawn from the genre of B that is not also the genre of E – we’ll get too many false negatives.)
How Many Impostors? • We generated a random corpus of 25000impostor documents (results of Google searches for medium-frequency words in our corpus). • For each pair, we randomly selected N of these documents as impostors and applied our algorithm (using a fixed score threshold k=5%). • Here are the accuracy results (y-axis) for different values of N:
Random Impostors • Best result: 83.4% at 625 impostors
Random Impostors • Best result: 83.4% at 625 impostors Fewer false negative
Which Impostors? • Now, instead of using random impostors, for each pair <B,E>, we choose the N impostors that have the most “lexical overlap” with B (or E). • The idea is that more convincing impostors should prevent false positives.
Similar Impostors • Best result: 83.8% at 50 impostors K=5%
Similar Impostors • Best accuracy result: 83.8% at 50 impostors K=5% Only 2% false positive
Which Impostors? • It turns out that (for a fixed score threshold k) using similar impostors doesn’t improve accuracy, but it allows us to use fewer impostors. • We can also try to match impostors to the suspect’s genre. • For example, suppose that we know that B and E are drawn from a blog corpus. We can limit impostors to blog posts.
Same-Genre Impostors • Best result: 86.3% at 58 impostors K=5%
Impostors Protocol Optimizing on a development corpus, we settle on the following protocol: • From a large blog universe, choose as potential impostors the 250 blogs most similar to E. • Randomly choose 25 actual impostors from among the potential impostors. • Say that <B,E> are same-author if score(B,E)≥k, where k is used to trade-off precision and accuracy.
Results Optimizing thresholds on a development corpus, we obtain accuracies as follows:
Conclusions • We can use (almost) unsupervised methods to determine if two short documents are by the same author. This actually works better than a supervised baseline method. • The trick is to see how robustly the two can be tied together from among some set of impostors. • The right number of impostors to use depends on the quality of the impostors and the relative cost of false-positives vs. false-negatives. • We assumed throughout that the prior probability of same-author 0.5; we have obtained similar results for skewed corpora (just by changing the score threshold).
Open Questions • What if x and y are in two different genres (e.g. blogs and facebook statuses)? • What if a text was merely “influenced” by x but mostly written by y? Can we discern (or maybe quantify) this influence? • Can we use these methods (or related ones) to identify outlier texts in a corpus (e.g. a play attributed to Shakespeare that wasn’t really written by Shakespeare)?