1 / 18

Enron email datasets

Enron email datasets. LING 575 Fei Xia 01/04/2011. History of Enron. Enron was formed in 1985 under the direction of Kenneth Lay In 1999, Enron officials began to use the “special purpose entities” (SPE) trick. In Dec 2000, Jeffrey Skilling took over the position of CEO from Kenneth Lay.

tavon
Download Presentation

Enron email datasets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Enron email datasets LING 575 Fei Xia 01/04/2011

  2. History of Enron • Enron was formed in 1985 under the direction of Kenneth Lay • In 1999, Enron officials began to use the “special purpose entities” (SPE) trick. • In Dec 2000, Jeffrey Skilling took over the position of CEO from Kenneth Lay. • In Aug 2001, Skilling surprisingly resigned. Lay became CEO again. Watkins wrote an anonymous letter to Lay about possible fraud. • In Oct 2001, the losses transferred from Enron to SPE totaled over $618 million. SEC started an inquiry into Enron. • In Jan 2002, Lay resigned as chairman and CEO. Enron collapsed in the same year. • In 2003, Enron emerged from bankruptcy as two separate companies. Most creditors would receive about 1/5 of the $67 billion they were owed.

  3. History of Enron email dataset • Made public by the Federal Energy Regulatory Commission during its investigation in May 2002 • Later collected and prepared by SRI for the CALO project • William Cohen from CMU put up the dataset on the web for the researchers (the CMU dataset) in March 2004 • ISI cleaned the CMU dataset and created a MySql database (the ISI database) • Various teams did data cleaning and annotation

  4. Several corpora • Raw data: emails between 1998 and 2002 • the CMU dataset • the ISI database • … • Annotated data • Personal vs. business • Email zoning • …

  5. The CMU dataset

  6. The CMU dataset • Paper: (B. Klimt and Y. Yang, 2004) • Available at http://www.cs.cmu.edu/~enron/ • Stored on patas under /corpora/enron_email_dataset/cmu/

  7. CMU dataset • Raw corpus: • 619,446 messages from 158 users • Cleanup: • remove folders such as “discussion_threads” • remove duplicates • Cleaned corpus: • 200,399 messages from 158 users

  8. Messages per user A few people sent out a lot of messages

  9. Correlation of folders and messages Most users do use folders to organize their emails, but their usage of folders varies a lot.

  10. Distribution of thread sizes • Thread: same subject line among the same users. • Out of 200,399 messages, 61.6% of emails are in threads (123,501 emails in 30,091 threads). • Most threads are of small size:

  11. The ISI database

  12. The ISI database • Paper: Shettyand Adibi’sreport • Report and data are available at http://www.isi.edu/~adibi/Enron/Enron.htm • Stored on patas under $data_dir/isi/ • Stored on capuchin as a mysql database called “enron”.

  13. Data cleaning • Start from the CMU dataset • Remove duplicate emails • Remove folders such as “discussion_threads”, “all documents”, and “sent_mail” • …

  14. Cleaned Enron email dataset • 252,759 emails • from 151 employees • distributed in about 3000 user defined folders • The dataset has been used by many research groups.

  15. MySql database: four tables rtype: TO, CC, or BCC rvalue: recipient email value

  16. Distribution of sent emails per user A few employees sent out a lot of messages.

  17. Distribution of email over time Notice the spike around Nov 2001

  18. Social network

More Related