30 likes | 384 Views
JStylo: An Authorship-Attribution Platform and its Applications. Introduction JStylo is a platform designed to conduct supervised stylometry experiments – authorship attribution using linguistic style.
E N D
JStylo: An Authorship-Attribution Platformand its Applications • Introduction • JStylo is a platform designed to conduct supervised stylometry experiments – authorship attribution using linguistic style. • It uses NLP techniques to extract linguistic features from documents, and supervised machine learning methods to classify those documents based on the extracted features. • The platform feature extraction core is based on the JGAAP API [1], and the classifiers available include Weka [2] classifiers and an implementation of the Writeprints [3] classifier. • Source: https://github.com/psal/JStylo-Anonymouth • Motivation • Important for research in history, literature and forensics • Impact on privacy and anonymity in online environments: • Reveal identity: users can use various tools to hide their location, but their writing style may still be exposed. JStylo provides a convenient platform for developing methods to reveal anonymous identities. • Preserve anonymity: On the other hand JStylo can be used for developing and testing methods to secure anonymous communication, like Anonymouth [4]. • Stylometry research is useful not only for revealing identities, but also author characteristics, like age, gender, native language and personality type. • Novelty • Cumulative feature-set analysis (vs. one feature at-a-time) • Added feature extractors and processing tools • Readability / complexity metrics • Regular-expression-based features • Counters (word / letter / regular expression) • High feature-level customizability • Factoring and Normalization • Uses Weka classifiers • Provides implementation of Writeprints Platform Overview Applications • Document Anonymization • Using Anonymouth [4] • JStylo as an authorship-attribution engine to evaluateanonymization level • ProblemDefinition Training Documents Test Documents My docs Author 2 Author 1 Author 1 Author N Author 2 Author N … … ? ? ? “Blend-in” Corpus Learn Styles Suggest Changes FeatureSelection Feature Set fM cL c1 f1 c2 f2 c3 f3 Change Document Feature Document pre-processing Feature Extraction Feature post-processing Normalization Factoring NO Check if Anonymized ClassifiersSelection YES Document to Anonymize Document Anonymized Analysis • Personal Traits Identification: Native Language • Using Language-Family Information • Classify documents by native language • Set the classification probabilities as threshold T • Use language-family reclassification for instances classifiedwith probability p < T to improve language classification Feature Extraction Training Documents Classification Results AN A2 A1 Training Set CV Results c1 cL Document pre-process … Train Train Test Test Test Documents Feature post-process Feature Extraction Candidate languages Candidate families Candidate languages … Classifier Result 1.2 12 5.78 289 5.61 5 … … 41.1 13.7 F1 F2 F3 Fi F1 Li1 Li2 Li3 L11 L12 L13 A3 A15 A7 ? ? ? F2 L21 L22 L23 … … F3 L31 L32 L33 P < T Classify family Classify language Fi Lij Lij P > T Classify language L L • Evaluation • A sample evaluation using the Writeprints feature set with Weka SMO SVM classifier on the Extended Brennan-Greenstadt Adversarial corpus [5]: • 45 authors • > 6,500 words per author, divided into ~500-words documents • 10-fold cross-validation: • Stylometry-Based Authentication • An attacker may have user credentials • Learn legitimate user’s writing style • Record user activity and use stylometry to authenticate theuser is who s/he says s/he is References [1] Juola, P., et al.: JGAAP, a Java-Based, Modular, Program for Textual Analysis, Text Categorization, and Authorship Attribution (2009) [2] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The Weka Data Mining Software: An Update (2009) [3] Abbasi, A., Chen, H.: Writeprints: A Stylometric Approach to Identity-Level Identification and Similarity Detection in Cyberspace (2008) [4] McDonald, A., Afroz, S., Caliskan, A., Stolerman, A. and Greenstadt, R.: Use Fewer Instances of the Letter "i": Toward Writing Style Anonymization (2012) [5] Brennan, M. and Greenstadt, R.: Practical Attacks Against Authorship Recognition Techniques (2009) Train Legitimate user writing …I AM A MALICIOUS USER, BEWARE… Test Malicious user Legitimate credentials