1 / 14

Strategies for Cleaning Organizational Emails with an Application to Enron Email Dataset

Strategies for Cleaning Organizational Emails with an Application to Enron Email Dataset. Yingjie Zhou, Research Assistant, RPI Mark Goldberg, Professor, RPI Malik Magdon-Ismail, Associate Professor, RPI William A. Wallace, Professor, RPI

wan
Download Presentation

Strategies for Cleaning Organizational Emails with an Application to Enron Email Dataset

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Strategies for Cleaning Organizational Emails with an Application to Enron Email Dataset Yingjie Zhou, Research Assistant, RPI Mark Goldberg, Professor, RPI Malik Magdon-Ismail, Associate Professor, RPI William A. Wallace, Professor, RPI Supported by the NSF Grants #0324947, #0323324, #0634875, #0522672, and by the ONR Grant # N00014-06-1-0466

  2. Outline • Introduction • Properties of Organizational Emails • Difficulties in Cleaning Organizational Emails • Procedures of Cleaning Organizational Emails • Introduction to Enron Email Dataset • Application of Cleaning Procedures to Enron Email Dataset • Results • Conclusions and Future Work NAACSOS 2007

  3. 6 1 8 5 2 9 7 4 3 Introduction • Emails • Organizational emails • Inter-organizational emails • Intra-organizational emails • The features of organizational email data make it potential for various studies • Email data has its own problems and is noisy NAACSOS 2007

  4. Properties of Organizational Emails • Emails are formatted, and the format is usually defined and followed. • Emails are normally stored in a server and can be easily collected. • Emails are unobtrusive. • Emails are time stamped. In addition, • The senders and recipients of the emails are employees of the organization. • Each employee is normally assigned one or more unique email addresses within the organizational domain. NAACSOS 2007

  5. Difficulties in Cleaning Organizational Emails • Multiple email addresses, names, or IDs exist for the same person. • Duplicate emails exist. • The content of the email is difficult to extract. NAACSOS 2007

  6. Organizational Email Dataset …… Employee N Employee 1 Employee 2 …… Raw Formats Raw Formats Raw Formats …… Extracted Formats Extracted Formats Extracted Formats Generalized Formats Procedures of Cleaning Organizational Emails • Map aliases to employees • Parse last name, first name, and email ID in headers NAACSOS 2007

  7. Organizational Email Dataset Remove Duplicates Generalized Formats Unique Message Email Dataset Employee Email Dataset Date & Time Consolidation Content Extraction Cleaned Employee Email Dataset Procedures of Cleaning Organizational Emails (Cont’d) • Remove duplicate emails • content + date + recipients • Consolidate date and time • Convert to machine time • Extract email Content • Signatures • Features of parent email message • Greetings and names NAACSOS 2007

  8. Introduction to Enron Email Dataset • Federal Energy Regulatory Commission (FERC) posted the Enron email dataset on the web in May of 2002 • 619,446 emails • Professor Leslie Kaelbling from MIT purchased the dataset • SRI - integrity and security • Professor William W. Cohen - CMU dataset • 150 user folders • 517,431 emails • 400Mb NAACSOS 2007

  9. Sender Receiver/Receivers Date + Time Subject Body ? Forwarded or replied text ? Signature Attachment Message-ID: <1017199.1075849811346.JavaMail.evans@thyme> Date: Thu, 30 Nov 2000 08:50:00 -0800 (PST) From: eugenio.perez@enron.com To: sally.beck@enron.com Subject: Self Evaluation - Short Version Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-From: Eugenio Perez X-To: Sally Beck X-cc: X-bcc: X-Folder: \Sally_Beck_Nov2001\Notes Folders\All documents X-Origin: BECK-S X-FileName: sbeck.nsf Please let me know if you need anything else. Regards, Eugenio Introduction to Enron Email Dataset (Cont’d) NAACSOS 2007

  10. From, To, Cc, Bcc X-From, X-To, X-cc, X-bcc Example1: davis-d\deleted_items\101 From: dana.davis@enron.com To: dana.davis@enron.com X-From: Davis, Mark Dana </O=ENRON/OU=NA/CN=RECIPIENTS/CN=MDAVIS> X-To: Davis, Dana </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Ddavis> Example2: cash-m\sent_items\505 From: michelle.cash@enron.com To: legal <.taylor@enron.com> X-From: Cash, Michelle </O=ENRON/OU=NA/CN=RECIPIENTS/CN=MCASH> X-To: Taylor, Mark E (Legal) </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Mtaylo1> Introduction to Enron Email Dataset (Cont’d) Wrong! Doesn’t make sense! NAACSOS 2007

  11. Application of Cleaning Procedures to Enron Email Dataset • phillip k allen • phillip allen • allen, phillip • allen, phillip k. • phillip k allen <phillip k allen/hou/ect@ect> • allen, phillip </o=enron/ou=na/cn=recipients/cn=notesaddr/cn=ba4cd662-58db2db2-862564b8-5b412a> • allen, phillip k. </o=enron/ou=na/cn=recipients/cn=pallen> • phillip.k.allen@enron.com • phillip.allen@enron.com • pallen@enron.com • pallen70@hotmail.com • pallen@ect.enron.com • pallen@hotmail.com pallen@enron.com pallen@enron.com “phillip allen” <pallen@enron.com> “pallen@enron.com" <pallen@enron.com> phillip <pallen@enron.com> phillip allen <pallen@enron.com> “allen, phillip k" <pallen@enron.com> <pallen@enron.com> NAACSOS 2007

  12. Application of Cleaning Procedures to Enron Email Dataset (Cont’d) • 150 folders => 156 employees • 517,431 emails => 252,830 unique emails • All emails are from the same time zone, and emails with wrong dates are discarded • 22,241 emails among 156 employees from Nov. 1998 – Jun. 2002 • “Original Message”, “Forwarded by”, “Thanks”, “Regards”, etc. • Signatures Susan S. Bailey Senior Legal Specialist Enron Wholesale Services Legal Department 1400 Smith Street, Suite 3803A Houston, Texas 77002 phone: (713) 853-4737 fax: (713) 646-3490 email: susan.bailey@enron.com NAACSOS 2007

  13. Conclusions and Future Work • Conclusions In general, the procedures are practical and served well in cleaning the Enron emails. • Future Work • Name disambiguation • Misdirected email detection • Broadcast emails removal • Various analysis NAACSOS 2007

  14. Thank you! Any Comments? NAACSOS 2007

More Related