150 likes | 382 Views
Strategies for Cleaning Organizational Emails with an Application to Enron Email Dataset. Yingjie Zhou, Research Assistant, RPI Mark Goldberg, Professor, RPI Malik Magdon-Ismail, Associate Professor, RPI William A. Wallace, Professor, RPI
E N D
Strategies for Cleaning Organizational Emails with an Application to Enron Email Dataset Yingjie Zhou, Research Assistant, RPI Mark Goldberg, Professor, RPI Malik Magdon-Ismail, Associate Professor, RPI William A. Wallace, Professor, RPI Supported by the NSF Grants #0324947, #0323324, #0634875, #0522672, and by the ONR Grant # N00014-06-1-0466
Outline • Introduction • Properties of Organizational Emails • Difficulties in Cleaning Organizational Emails • Procedures of Cleaning Organizational Emails • Introduction to Enron Email Dataset • Application of Cleaning Procedures to Enron Email Dataset • Results • Conclusions and Future Work NAACSOS 2007
6 1 8 5 2 9 7 4 3 Introduction • Emails • Organizational emails • Inter-organizational emails • Intra-organizational emails • The features of organizational email data make it potential for various studies • Email data has its own problems and is noisy NAACSOS 2007
Properties of Organizational Emails • Emails are formatted, and the format is usually defined and followed. • Emails are normally stored in a server and can be easily collected. • Emails are unobtrusive. • Emails are time stamped. In addition, • The senders and recipients of the emails are employees of the organization. • Each employee is normally assigned one or more unique email addresses within the organizational domain. NAACSOS 2007
Difficulties in Cleaning Organizational Emails • Multiple email addresses, names, or IDs exist for the same person. • Duplicate emails exist. • The content of the email is difficult to extract. NAACSOS 2007
Organizational Email Dataset …… Employee N Employee 1 Employee 2 …… Raw Formats Raw Formats Raw Formats …… Extracted Formats Extracted Formats Extracted Formats Generalized Formats Procedures of Cleaning Organizational Emails • Map aliases to employees • Parse last name, first name, and email ID in headers NAACSOS 2007
Organizational Email Dataset Remove Duplicates Generalized Formats Unique Message Email Dataset Employee Email Dataset Date & Time Consolidation Content Extraction Cleaned Employee Email Dataset Procedures of Cleaning Organizational Emails (Cont’d) • Remove duplicate emails • content + date + recipients • Consolidate date and time • Convert to machine time • Extract email Content • Signatures • Features of parent email message • Greetings and names NAACSOS 2007
Introduction to Enron Email Dataset • Federal Energy Regulatory Commission (FERC) posted the Enron email dataset on the web in May of 2002 • 619,446 emails • Professor Leslie Kaelbling from MIT purchased the dataset • SRI - integrity and security • Professor William W. Cohen - CMU dataset • 150 user folders • 517,431 emails • 400Mb NAACSOS 2007
Sender Receiver/Receivers Date + Time Subject Body ? Forwarded or replied text ? Signature Attachment Message-ID: <1017199.1075849811346.JavaMail.evans@thyme> Date: Thu, 30 Nov 2000 08:50:00 -0800 (PST) From: eugenio.perez@enron.com To: sally.beck@enron.com Subject: Self Evaluation - Short Version Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-From: Eugenio Perez X-To: Sally Beck X-cc: X-bcc: X-Folder: \Sally_Beck_Nov2001\Notes Folders\All documents X-Origin: BECK-S X-FileName: sbeck.nsf Please let me know if you need anything else. Regards, Eugenio Introduction to Enron Email Dataset (Cont’d) NAACSOS 2007
From, To, Cc, Bcc X-From, X-To, X-cc, X-bcc Example1: davis-d\deleted_items\101 From: dana.davis@enron.com To: dana.davis@enron.com X-From: Davis, Mark Dana </O=ENRON/OU=NA/CN=RECIPIENTS/CN=MDAVIS> X-To: Davis, Dana </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Ddavis> Example2: cash-m\sent_items\505 From: michelle.cash@enron.com To: legal <.taylor@enron.com> X-From: Cash, Michelle </O=ENRON/OU=NA/CN=RECIPIENTS/CN=MCASH> X-To: Taylor, Mark E (Legal) </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Mtaylo1> Introduction to Enron Email Dataset (Cont’d) Wrong! Doesn’t make sense! NAACSOS 2007
Application of Cleaning Procedures to Enron Email Dataset • phillip k allen • phillip allen • allen, phillip • allen, phillip k. • phillip k allen <phillip k allen/hou/ect@ect> • allen, phillip </o=enron/ou=na/cn=recipients/cn=notesaddr/cn=ba4cd662-58db2db2-862564b8-5b412a> • allen, phillip k. </o=enron/ou=na/cn=recipients/cn=pallen> • phillip.k.allen@enron.com • phillip.allen@enron.com • pallen@enron.com • pallen70@hotmail.com • pallen@ect.enron.com • pallen@hotmail.com pallen@enron.com pallen@enron.com “phillip allen” <pallen@enron.com> “pallen@enron.com" <pallen@enron.com> phillip <pallen@enron.com> phillip allen <pallen@enron.com> “allen, phillip k" <pallen@enron.com> <pallen@enron.com> NAACSOS 2007
Application of Cleaning Procedures to Enron Email Dataset (Cont’d) • 150 folders => 156 employees • 517,431 emails => 252,830 unique emails • All emails are from the same time zone, and emails with wrong dates are discarded • 22,241 emails among 156 employees from Nov. 1998 – Jun. 2002 • “Original Message”, “Forwarded by”, “Thanks”, “Regards”, etc. • Signatures Susan S. Bailey Senior Legal Specialist Enron Wholesale Services Legal Department 1400 Smith Street, Suite 3803A Houston, Texas 77002 phone: (713) 853-4737 fax: (713) 646-3490 email: susan.bailey@enron.com NAACSOS 2007
Conclusions and Future Work • Conclusions In general, the procedures are practical and served well in cleaning the Enron emails. • Future Work • Name disambiguation • Misdirected email detection • Broadcast emails removal • Various analysis NAACSOS 2007
Thank you! Any Comments? NAACSOS 2007