Introduction to Automatic Email Classification

Introduction to Automatic Email Classification Shih-Wen (George) Ke 7th Dec 2005

Overview • Introduction to Enron Corpus • Traditional Text Classification vs Email Classification • Recent Work on Enron Corpus • Our Work on Enron Corpus • Summary • Future Research Directions in Information Retrieval • Further Discussion

Overview • The nature of email classification is very different to that of traditional text classification tasks. • Email is time-dependent, poorly structured and written in informal format and no standard ways of preparing and evaluating email datasets have been proposed.

Introduction • Automatic Email Classification dates back to mid 90’s • Email Classification received little attention until recently because there was no standard email dataset available • Enron Email Corpus available in March 2004

Introduction – Enron Corpus • Distributed by William Cohen at Carnegie Mellon Uni. • Consists of 517,431 messages that belong to 150 users of Enron Corporation • Most users use folders to categorise their emails • Upper bound for the number of folders appears to be the log of the number of messages (Klimt & Yang, 2004)

Email Classification: Assumptions • Categorise email into folders – a.k.a. email foldering • Only personal and professional emails are considered here • Assume that users use folders to organise their emails • Other methods of organising emails, e.g. flag or label, are not considered here although they may provide more information in Email Classification

Recent Work on Enron Corpus

Our Work on Enron Corpus- Introduction • Users sometimes forget which folders they have created or which folders they should file the email under • So users tend to create new (duplicate) folders • Newly created folders adversely affect performance (Bekkerman et al., 2004) • Reduce the likelihood of users creating duplicate folders by improving the accuracy of assigning incoming emails to folders that were created in the first place • Compare state-of-the-art classifiers (kNN, SVM) and our own classifier - PERC in a simulation of real-time situation using various parameter settings

Our Work on Enron Corpus- The PERC • The PERC Classifier (PERsonal email Classifier) • Find a centroid cifor each category Ci • For each test document x: • Find k nearest neighbouring training documents to x • Similarity between x and the training document dj is added to similarity between x and ci • Sort similarity scores sim(x,Ci) in descending order • Decision to assign x to Cican be made using various thresholding strategies

Our Work on Enron Corpus- The PERC • The PERC Classifier (PERsonal email Classifier) where y(dj,Ci){0,1} is the classification for training document djwith respect to category Ci; sim(x,dj) is the similarity between test document x and training document dj; and sim(x,ci) is the similarity between test document x and the centroid ci of the category that dj belongs to.

Rationale for the Hybrid Approach • Centroid method overcomes data sparseness: emails tend to be short. • kNN allows the topic of a folder to drift over time. Considering the vector space locally allows matching against features which are currently dominant.

Our Work on Enron Corpus- Results SVM1 (c=1,j=1), SVM2 (c=0.01,j=1) Micro-averaging and Macro-average F1 over all users with standard deviation for kNN, SVM and PERC For Macro-averaging evaluations, PERC significantly outperformed kNN (t=2.786, p=0.032), SVM1 (t=2.533, p=0.044) and SVM2 (t=5.926, p=0.001)

Our Work on Enron Corpus- Conclusions • PERC has the highest accuracy of assigning test documents to small folders • kNN and PERC performed better with smaller k • Parameters of SVM can be sensitive to the number of training documents available • Investigate various parameter settings and training/test sets splits • Use of time will be investigated • A questionnaire-based study is being conducted in order to indicate the behaviour of real users in email management

Future Research Directions in IR • Use of time information • Training/test sets splits • Feature extraction, selection • Document representation • Qualitative evaluation • Threads detection, TDT for email • Mining sequential patterns • Burst of activity (Kleinberg, 2002)

References • Bekkerman, R., McCallum, A. and Huang, G. (2004) Automatic Categorization of Email into Folders: Benchmark Experiments on Enron and SRI Corpora. Technical Report IR-418, CIIR, University of Massachusetts. • Kleinberg, J. (2002) Bursty and Hierarchical Structure in Streams. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. • Klimt, B. & Yang, Y. (2004) The Enron Corpus: A New Dataset for Email Classification Research. European Conference on Machine Learning.

Introduction to Automatic Email Classification