JStylo: An Authorship-Attribution Platform and its Applications

JStylo: An Authorship-Attribution Platformand its Applications • Introduction • JStylo is a platform designed to conduct supervised stylometry experiments – authorship attribution using linguistic style. • It uses NLP techniques to extract linguistic features from documents, and supervised machine learning methods to classify those documents based on the extracted features. • The platform feature extraction core is based on the JGAAP API [1], and the classifiers available include Weka [2] classifiers and an implementation of the Writeprints [3] classifier. • Source: https://github.com/psal/JStylo-Anonymouth • Motivation • Important for research in history, literature and forensics • Impact on privacy and anonymity in online environments: • Reveal identity: users can use various tools to hide their location, but their writing style may still be exposed. JStylo provides a convenient platform for developing methods to reveal anonymous identities. • Preserve anonymity: On the other hand JStylo can be used for developing and testing methods to secure anonymous communication, like Anonymouth [4]. • Stylometry research is useful not only for revealing identities, but also author characteristics, like age, gender, native language and personality type. • Novelty • Cumulative feature-set analysis (vs. one feature at-a-time) • Added feature extractors and processing tools • Readability / complexity metrics • Regular-expression-based features • Counters (word / letter / regular expression) • High feature-level customizability • Factoring and Normalization • Uses Weka classifiers • Provides implementation of Writeprints Platform Overview Applications • Document Anonymization • Using Anonymouth [4] • JStylo as an authorship-attribution engine to evaluateanonymization level • ProblemDefinition Training Documents Test Documents My docs Author 2 Author 1 Author 1 Author N Author 2 Author N … … ? ? ? “Blend-in” Corpus Learn Styles Suggest Changes FeatureSelection Feature Set fM cL c1 f1 c2 f2 c3 f3 Change Document Feature Document pre-processing Feature Extraction Feature post-processing Normalization Factoring NO Check if Anonymized ClassifiersSelection YES Document to Anonymize Document Anonymized Analysis • Personal Traits Identification: Native Language • Using Language-Family Information • Classify documents by native language • Set the classification probabilities as threshold T • Use language-family reclassification for instances classifiedwith probability p < T to improve language classification Feature Extraction Training Documents Classification Results AN A2 A1 Training Set CV Results c1 cL Document pre-process … Train Train Test Test Test Documents Feature post-process Feature Extraction Candidate languages Candidate families Candidate languages … Classifier Result 1.2 12 5.78 289 5.61 5 … … 41.1 13.7 F1 F2 F3 Fi F1 Li1 Li2 Li3 L11 L12 L13 A3 A15 A7 ? ? ? F2 L21 L22 L23 … … F3 L31 L32 L33 P < T Classify family Classify language Fi Lij Lij P > T Classify language L L • Evaluation • A sample evaluation using the Writeprints feature set with Weka SMO SVM classifier on the Extended Brennan-Greenstadt Adversarial corpus [5]: • 45 authors • > 6,500 words per author, divided into ~500-words documents • 10-fold cross-validation: • Stylometry-Based Authentication • An attacker may have user credentials • Learn legitimate user’s writing style • Record user activity and use stylometry to authenticate theuser is who s/he says s/he is References [1] Juola, P., et al.: JGAAP, a Java-Based, Modular, Program for Textual Analysis, Text Categorization, and Authorship Attribution (2009) [2] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The Weka Data Mining Software: An Update (2009) [3] Abbasi, A., Chen, H.: Writeprints: A Stylometric Approach to Identity-Level Identification and Similarity Detection in Cyberspace (2008) [4] McDonald, A., Afroz, S., Caliskan, A., Stolerman, A. and Greenstadt, R.: Use Fewer Instances of the Letter "i": Toward Writing Style Anonymization (2012) [5] Brennan, M. and Greenstadt, R.: Practical Attacks Against Authorship Recognition Techniques (2009) Train Legitimate user writing …I AM A MALICIOUS USER, BEWARE… Test Malicious user Legitimate credentials

JStylo: An Authorship-Attribution Platform and its Applications

JStylo: An Authorship-Attribution Platform and its Applications

Presentation Transcript

Text Categorization Moshe Koppel Lecture 3:Authorship Attribution

Nanobiotechnology and its Applications

Kevlar and its Applications

Authorship Attribution

JVSTM and its applications

Authorship Attribution and Stylometry

An Interactive Background Blurring Mechanism and Its Applications

Authorship Attribution and Stylometry (lecture 5)

Replication and Its Applications

Gramsci ’ s authorship attribution of anonymus newspapers articles

Text Categorization Moshe Koppel Lecture 3:Authorship Attribution

Geant4 and its applications: an overview

Research and Its Applications

Large-scale Plagiarism Detection and Authorship attribution

Authorship Attribution

Authorship Attribution Using Probabilistic Context-Free Grammars

An Interactive Background Blurring Mechanism and Its Applications

Authorship Attribution

Elasticity and its Applications

MEMS and its Applications Optical Routing, an example

Ecommerce and its Applications