Enron Corpus: A New Dataset for Email Classification

Enron Corpus:A New Dataset for Email Classification By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee

Introduction • Motivation • Related Works • The Enron Corpus • Methods • Evaluation • Thread Information • Conclusion

Motivation • Other corpuses focus on newsgroups or personal email data • Lack of common data set to evaluate the performance of email classification • Previous research uses different personal data sets • Difficulties to find actual use of email within a company • Obviously, companies do not like to share their internal emails • Privacy concerns for people working for the company

Related Works • Other corpuses • 20 Newsgroups • http://people.csail.mit.edu/people/jrennie/20Newsgroups/ • Related Papers • Y. Diao, H. Lu, and D. Wu, A Comparative Study of Classification Based Personal E-mail Filtering (PAKDD ’00) • I. Androutsopoulos, et. al., An Experimental Comparison of Naïve Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages (SIGIR ‘00) • T. Payne, Learning Email Filtering Rules with Magi (Thesis 1994)

20 Newsgroups • Collection of approximately 20,000 newsgroup documents, spread out evenly across 20 different newsgroups • Sample newsgroups: • comp.graphics, rec.motorcycles, rec.sport.baseball, sci.electronics, talk.politics.misc, talk.religion.misc, etc. • Used originally in Ken Lang’s Newsweeder: Learning to filter netnews paper (ICML 1995) • Dataset on newsgroup data, probably not very useful for research in personal information management

Enron Dataset • 619,446 messages (200,399 after cleaning) by 158 users • Average 757 messages per user • Shows most users do use folders to organize emails • Can use folder information to evaluate effectiveness for folder classification

Enron Corpus’ Characteristics • Number of messages per user varies from a few messages to 10K + messages • Upper bound of folder seems to correlate to the log(# of messages) • Number of messages does not correlate to the lower bound (can have many messages but a few folders) • Question: how can we use this kind of information?

Email Classification Features • Constructive text • BOW approach, feature used the most • Some fields are more important than the others • Stemming, stop word removal used, effectiveness not proven • Categorical text • “to” and “from” fields • BOW, useful for classification, but not as useful as constructive text • Numeric data • Size of message, number of replies, number of words, etc. • Not very useful • Thread information • Indicates how message relates to each other • Not fully exploited

Email Features (Example) Numeric data Categorical text From: Mark Hills <mhills@cs.uiuc.edu> Subject: Re: When is the first lecture? When will the course page be updated? Date: Thu, 26 Aug 2004 13:41:09 -0500 Lines: 11 Message-ID: <cglafa$f3o$1@dcs-news1.cs.uiuc.edu> References: <cgl09c$bll$1@dcs-news1.cs.uiuc.edu> In-Reply-To: <cgl09c$bll$1@dcs-news1.cs.uiuc.edu> Joshua Blatt wrote: > When is the first lecture? When will the course page be updated? > > Thanks > > Josh The first lecture was today, during the normally scheduled time. Mark Thread information Contextual text

Classification Method • Vector space model with SVM • Vector weight wi is evaluated using “ltc” (http://people.csail.mit.edu/people/jrennie/ecoc-svm/smart.html), which means: • l: new-tf = ln (tf) + 1.0 • t: new-wt = new-tf * log (num-docs/coll-freq-of-term) • c: divide each new-wt by sqrt (sum of (new-wts squared))

Classification Method (Cont.) • Sort messages in chronological order, split into train and test set • Run SVM on term weighted vectors of • From • Subject • Body • To, CC • All fields • Linear regression on all fields seem to have the best performance

Clustering Effectiveness

Number of Messages vs. F1 • Number of message does not directly correlate to the accuracy • Question: What about the case where the user has only one folder, which makes classification trivial?

Number of Folders vs. F1 • There’s correlation between the number of folders and the F1 score. • Question: Is this trivial as well? • Some elements in the messages not modeled, since SVM have more messages to train on.

Thread Information • 200,399 messages, 101,786 threads, 71,696 threads with only one message • 61.63% of messages of corpus is in a thread. • Average thread size is 4.1 messages • Average folder per thread is 1.37 (meaning most messages of the thread stays in one folder) • Question: Not clear how threads are detected. How can we use this information?

More Thread • D. Lewis, et. al., Threading Electronic Mail: A Preliminary Study (1997) • Lewis studied finding parent message using BOW, TF/IDF weighted, vector space approach on constructive text Document weight Query weight Similarity

More Thread (Cont.) • Lewis’ work assumes that the thread information is incomplete in the message header. • May not be the case. • Algorithm by Jamie Zawinski is widely used in the original Netscape 4.x (maybe in recent Mozilla as well?) can group threaded messages effectively. • http://www.jwz.org/doc/threading.htm • Questions • How can we leverage the thread information in email messages more effectively? • Does this model extend to the more recent form of conversation such as blog and web forums as well?

Conclusion • Pros • Introduce a new corpus that can be useful in evaluating classification performance on a large collection of personal mail • Unlike small collection of personal mails, corpus can also be used to analyze behavior within a company • Cons • Details on performing SVM and the linear weight for various fields are missing • Not clear how threads are detected

Enron Corpus: A New Dataset for Email Classification