1 / 21

ACM SAC 2010

ACM SAC 2010. 8. E-mail Authorship Verification for Forensic Investigation. Farkhund Iqbal Concordia University Canada Iqbal_f@ciise.concordia.ca. Liaquat A. Khan NUST, Pakistan liaquatalikhan@gmail.com. Mourad Debbabi Concordia University Canada debbabi@ciise.concordia.ca.

jess
Download Presentation

ACM SAC 2010

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ACM SAC 2010 8 E-mail Authorship Verification for Forensic Investigation Farkhund Iqbal Concordia University Canada Iqbal_f@ciise.concordia.ca Liaquat A. Khan NUST, Pakistan liaquatalikhan@gmail.com Mourad Debbabi Concordia University Canada debbabi@ciise.concordia.ca Benjamin C. M. Fung Concordia University Canada fung@ciise.concordia.ca

  2. Agenda 9 • Introduction • Motivation • Problem Definition • Related Work • Proposed Approach • Experimental Results • Conclusion

  3. Fingerprint Writeprint Written works Motivation 10 • From Fingerprint to Wordprint/Writeprint • Style markers and structural traits, patterns of vocabulary usage, common grammatical and spelling mistakes • The approach is used in a number of courts in US, Australia, England (Court of Criminal Appeal), Ireland (Central Criminal Court), Northern Ireland, and Australia [H. Chen 2003]. • Authorship Analysis • Attribution or identification • Verification or similarity detection • Characterization or profiling

  4. Authorship Analysis 11 • Application domain • Historic authorial disputes • Plagiarism detection • Legacy code • Cyberforensic investigation

  5. Motivation 12 • Anonymity abuse cybercrimes • Identity theft and masquerade • Phishing and spamming • Child pornography • Drug trafficking • Terrorism • Infrastructure crimes: Denial of service attacks Forensic analysis of e-mails with focus on authorship analysis for collecting evidence to prosecute the criminals in the court of law is one way to reduce cybercrimes [Teng 2004]

  6. Online document 13 • Content characteristics • Short in size and limited in vocabulary • Informal and interactive communication • Spelling and grammatical errors • Symbolic and para language • Large candidate set, more sample work • Additional information: time stamp, path, attachment, structural features

  7. Problem Definition 14 To verify whether suspect S is or is not the author of a given malicious e-mail µ • Assumption #1: Investigator have access to previously written e-mails of suspect S • Assumption #2: have access to e-mails {E1,…,En}, collected from sample population U= {u1,…,un} • The task is • to extract stylometric features and develop two models: suspect model & cohort/universal background model (UBM) • classify e-mail µusing the two models Sample population Suspect S Verified ? Anonymous e-mail µ

  8. Related Work 15 • Similarity Detection [Abbasi and Chen 2008] • Application to detect abuse of reputation system in online marketplace (Ensemble SVM) • Similarity detection for plagiarism detection [Van Halteren 2004] • Two-class classification problem [Koppel et. al 2007 ] • Application to authorial disputes over literary works

  9. Proposed Approach 16

  10. Features Extraction 17 • Lexical (word/character based) features • Word length, vocabulary richness, digit/caps distribution • Syntactic features (style marker) • Punctuations and function words (‘of’ ‘anyone’ ‘to’) • Structural and layout features • Sentence length, paragraph length, has a greetings/signature, types of separators between paragraphs • Content specific features • Domain specific key words, special characters • Idiosyncratic Features • Spelling and grammatical mistakes

  11. Model Development 21 • Model type • Universal Background Model • Cohort Model • Verification by classification • Verification by regression • Training & validation: 10-fold cross validation • Model application • Classification score • Regression score

  12. Evaluation Metrics • Two types of error can occur during evaluation • False Positive declaring innocent as guilty • False Negative • declaring guilty as innocent • DET (Detection Error Trade Off curve): Plotting False Positives vs False Negatives

  13. Evaluation Metrics • Two types of evaluation metrics borrowed from speech processing community (NIST SRE) • Equal Error Rate the point on DET curve where the probabilities of false alarm equals the probability of false rejection • Minimum Detection Cost Function • 0.1 x False Rejection Rate + 0.99 x False Acceptance Rate

  14. Experimental Evaluation 24 • Classifiers: • AdaBoost • DMNB • Bayes Net Classifiers implemented in WEKA [Witten, I.H. and Frank, E. ]

  15. Experimental Evaluation 25 • Regression functions • Linear Regression • SVM- SMO Regression • SVM with RBF Regression functions implemented in WEKA [Witten, I.H. and Frank, E. 2005]

  16. Comparative study Values of EER and minDCF for different functions

  17. Conclusion 27 • Application of classifiers and regression functions, and evaluation metric (NIST SRE) • EER of 17% by using real-life e-mails (Enron e-mail corpus) • EER 17% is not convincing in forensic investigation • Corpus issues • Stylistic variation is hard to capture

  18. Features Contributions 28 • Lexical features such as vocabulary richness and word length distribution alone are not very effective only. • Combination of word based and syntactic features contribute significantly. • Structural features are extremely important in e-mail • Content specific features are only effective in specific applications. • Idiosyncratic features needs a comprehensive thesaurus to be maintained. • Optimization of Features space

  19. References 29 • J. Burrows. An ocean where each kind: statistical analysis and some major determinants of literary style. Computers and the Humanities August 1989;23(4–5):309–21. • O. De Vel. Mining e-mail authorship. paper presented at the workshop on text mining. In ACM International Conference on Knowledge Discovery and Data Mining (KDD), 2000. • I. Holmes. The evolution of stylometry in humanities. Literary and Linguistic Computing 1998;13(3):111–7. • F. Iqbal, R. Hadjidj, B. C. M. Fung, and M. Debbabi. A novel approach of mining write-prints for authorship attribution in e-mail forensics. Digital Investigation, 2008. Elsevier.

  20. References 30 • B.C.M. Fung, K. Wang, M. Ester. Hierarchical document clustering using frequent itemsets. In: Proceedings of the third SIAM international conference on data mining (SDM); May 2003. p. 59–70 • I. Holmes I, R.S. Forsyth. The federalist revisited: new directions in authorship attribution. Literary and Linguistic Computing 1995;10(2):111–27. • G.-F. Teng, M.-S. Lai, J.-B. Ma, and Y. Li. E-mail authorship mining based on SVM for computer forensic. In In Proc. of the 3rd International Conference on Machine Learning and Cyhemetics, Shanghai, China, August 2004. • J. Tweedie, R. H. Baayen. How variable may a constant be? Measures of lexical richness in perspective. Computers and the Humanities 1998;32:323–52.

  21. References 31 • G. Yule. On sentence length as a statistical characteristic of style in prose. Biometrika 1938;30:363–90. • G. Yule. The statistical study of literary vocabulary. Cambridge, UK: Cambridge University Press; 1944. • R. Zheng, J. Li, H.Chen, Z. Huang. A framework for authorship identification of online messages: writing-style features and classification techniques. Journal of the American Society for Information Science and Technology 2006;57(3):378–93.

More Related