ACM SAC 2010

ACM SAC 2010 8 E-mail Authorship Verification for Forensic Investigation Farkhund Iqbal Concordia University Canada Iqbal_f@ciise.concordia.ca Liaquat A. Khan NUST, Pakistan liaquatalikhan@gmail.com Mourad Debbabi Concordia University Canada debbabi@ciise.concordia.ca Benjamin C. M. Fung Concordia University Canada fung@ciise.concordia.ca

Agenda 9 • Introduction • Motivation • Problem Definition • Related Work • Proposed Approach • Experimental Results • Conclusion

Fingerprint Writeprint Written works Motivation 10 • From Fingerprint to Wordprint/Writeprint • Style markers and structural traits, patterns of vocabulary usage, common grammatical and spelling mistakes • The approach is used in a number of courts in US, Australia, England (Court of Criminal Appeal), Ireland (Central Criminal Court), Northern Ireland, and Australia [H. Chen 2003]. • Authorship Analysis • Attribution or identification • Verification or similarity detection • Characterization or profiling

Authorship Analysis 11 • Application domain • Historic authorial disputes • Plagiarism detection • Legacy code • Cyberforensic investigation

Motivation 12 • Anonymity abuse cybercrimes • Identity theft and masquerade • Phishing and spamming • Child pornography • Drug trafficking • Terrorism • Infrastructure crimes: Denial of service attacks Forensic analysis of e-mails with focus on authorship analysis for collecting evidence to prosecute the criminals in the court of law is one way to reduce cybercrimes [Teng 2004]

Online document 13 • Content characteristics • Short in size and limited in vocabulary • Informal and interactive communication • Spelling and grammatical errors • Symbolic and para language • Large candidate set, more sample work • Additional information: time stamp, path, attachment, structural features

Problem Definition 14 To verify whether suspect S is or is not the author of a given malicious e-mail µ • Assumption #1: Investigator have access to previously written e-mails of suspect S • Assumption #2: have access to e-mails {E1,…,En}, collected from sample population U= {u1,…,un} • The task is • to extract stylometric features and develop two models: suspect model & cohort/universal background model (UBM) • classify e-mail µusing the two models Sample population Suspect S Verified ? Anonymous e-mail µ

Related Work 15 • Similarity Detection [Abbasi and Chen 2008] • Application to detect abuse of reputation system in online marketplace (Ensemble SVM) • Similarity detection for plagiarism detection [Van Halteren 2004] • Two-class classification problem [Koppel et. al 2007 ] • Application to authorial disputes over literary works

Proposed Approach 16

Features Extraction 17 • Lexical (word/character based) features • Word length, vocabulary richness, digit/caps distribution • Syntactic features (style marker) • Punctuations and function words (‘of’ ‘anyone’ ‘to’) • Structural and layout features • Sentence length, paragraph length, has a greetings/signature, types of separators between paragraphs • Content specific features • Domain specific key words, special characters • Idiosyncratic Features • Spelling and grammatical mistakes

Model Development 21 • Model type • Universal Background Model • Cohort Model • Verification by classification • Verification by regression • Training & validation: 10-fold cross validation • Model application • Classification score • Regression score

Evaluation Metrics • Two types of error can occur during evaluation • False Positive declaring innocent as guilty • False Negative • declaring guilty as innocent • DET (Detection Error Trade Off curve): Plotting False Positives vs False Negatives

Evaluation Metrics • Two types of evaluation metrics borrowed from speech processing community (NIST SRE) • Equal Error Rate the point on DET curve where the probabilities of false alarm equals the probability of false rejection • Minimum Detection Cost Function • 0.1 x False Rejection Rate + 0.99 x False Acceptance Rate

Experimental Evaluation 24 • Classifiers: • AdaBoost • DMNB • Bayes Net Classifiers implemented in WEKA [Witten, I.H. and Frank, E. ]

Experimental Evaluation 25 • Regression functions • Linear Regression • SVM- SMO Regression • SVM with RBF Regression functions implemented in WEKA [Witten, I.H. and Frank, E. 2005]

Comparative study Values of EER and minDCF for different functions

Conclusion 27 • Application of classifiers and regression functions, and evaluation metric (NIST SRE) • EER of 17% by using real-life e-mails (Enron e-mail corpus) • EER 17% is not convincing in forensic investigation • Corpus issues • Stylistic variation is hard to capture

Features Contributions 28 • Lexical features such as vocabulary richness and word length distribution alone are not very effective only. • Combination of word based and syntactic features contribute significantly. • Structural features are extremely important in e-mail • Content specific features are only effective in specific applications. • Idiosyncratic features needs a comprehensive thesaurus to be maintained. • Optimization of Features space

References 29 • J. Burrows. An ocean where each kind: statistical analysis and some major determinants of literary style. Computers and the Humanities August 1989;23(4–5):309–21. • O. De Vel. Mining e-mail authorship. paper presented at the workshop on text mining. In ACM International Conference on Knowledge Discovery and Data Mining (KDD), 2000. • I. Holmes. The evolution of stylometry in humanities. Literary and Linguistic Computing 1998;13(3):111–7. • F. Iqbal, R. Hadjidj, B. C. M. Fung, and M. Debbabi. A novel approach of mining write-prints for authorship attribution in e-mail forensics. Digital Investigation, 2008. Elsevier.

References 30 • B.C.M. Fung, K. Wang, M. Ester. Hierarchical document clustering using frequent itemsets. In: Proceedings of the third SIAM international conference on data mining (SDM); May 2003. p. 59–70 • I. Holmes I, R.S. Forsyth. The federalist revisited: new directions in authorship attribution. Literary and Linguistic Computing 1995;10(2):111–27. • G.-F. Teng, M.-S. Lai, J.-B. Ma, and Y. Li. E-mail authorship mining based on SVM for computer forensic. In In Proc. of the 3rd International Conference on Machine Learning and Cyhemetics, Shanghai, China, August 2004. • J. Tweedie, R. H. Baayen. How variable may a constant be? Measures of lexical richness in perspective. Computers and the Humanities 1998;32:323–52.

References 31 • G. Yule. On sentence length as a statistical characteristic of style in prose. Biometrika 1938;30:363–90. • G. Yule. The statistical study of literary vocabulary. Cambridge, UK: Cambridge University Press; 1944. • R. Zheng, J. Li, H.Chen, Z. Huang. A framework for authorship identification of online messages: writing-style features and classification techniques. Journal of the American Society for Information Science and Technology 2006;57(3):378–93.

ACM SAC 2010

ACM SAC 2010

Presentation Transcript

SAC

= SAC

SAC

SAC Report

SAC 101

Mt SAC

ACM GIS 2010 San Jose, Nov 2-5, 2010

ACM

ACM SAC April 2006 Dijon, France

SMB SAC meeting March 28, 2010

ACM

ACM

ACM

ACM

ACM