70 likes | 195 Views
Evaluation of a Stylometry System on Various Length Portions of Books. Ida Schulstad, Mark Boga, Cranston Jordan, Kara Pally, Vinnie Monaco, Richard DeStefano, John Stewart, and Charles Tappert. Stylometry.
E N D
Evaluation of a Stylometry System on Various Length Portions of Books Ida Schulstad, Mark Boga, Cranston Jordan, Kara Pally, Vinnie Monaco, Richard DeStefano, John Stewart, and Charles Tappert
Stylometry • “Stylometry is the application of the study of linguistic style, usually to written language …” and “… is often used to attribute authorship to anonymous or disputed documents” – Wikipedia
Book Text Experiments • In this study, stylometry was used to verify the identity of authors • Data: 30 authors and 10 books from each author • System: earlier developed stylometry system • System enhanced with additional features • Performance of the stylometry system was determined on these literary texts • In particular, the degree of performance increase with increasing text lengths
Classification System: Cha’s Dichotomy Model Used in All of Our Biometric Authentication Systems The feature space is transformed into a feature-difference space by calculating vector distances between pairs of samples of the same person (intra-person distances) and between pairs of samples of different people (inter-person distances). (a) Feature space (b) Feature-difference space Transformation from feature space (a) to feature distance space (b)
Book Text Experiments - #1 • The 30 Author Main Experiment • Training and testing files were split in to 5 books for each author. Strong training – the system was trained on the test subjects. • EERs for word sizes of 2, 5, and 10 K: 34%, 30%, and 25% Receiver Operating Characteristic (ROC) Curves 250, 500, 1K, 2K, 5K, 10K words. The Equal Error Rate (EER) increases with the Text Length
Book Text Experiments - #2 • Strong training on 15 of the authors. • Trained on 5 books from each author, tested on remaining 5 • Performance improved with fewer subjects • EERs ~20% for 10K, 24% for 5K, and 30% for 2K word samples. • Receiver Operating Characteristic (ROC) Curves 2K, 5K, 10K words
Equal Error Rate (EER) vs. Text Length in Literary Book Texts from 30 Authors EER decreases logarithmically as a function of text length