1 / 32

Modeling Identity in Archival Collections of Email: A Preliminary study

Modeling Identity in Archival Collections of Email: A Preliminary study. Tamer Elsayed and Douglas W. Oard. Institute for Advanced Computer Studies. Department of Computer Science. College of Information Studies. Conference on Email and Anti-Spam (CEAS), July 28 th , 2006.

xiomara
Download Presentation

Modeling Identity in Archival Collections of Email: A Preliminary study

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modeling Identity in Archival Collections of Email: A Preliminary study Tamer Elsayed and Douglas W. Oard Institute for Advanced Computer Studies Department of Computer Science College of Information Studies Conference on Email and Anti-Spam (CEAS), July 28th, 2006

  2. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ National Archives Real Problem Clinton White House search request Tobacco Policy 32 million emails 80,000 hired 25 persons for 6 months … 200,000

  3. Email Search Searcher • Meaning  Modeling Content • People  Modeling Identity

  4. Identity Nickname Nickname sent email to Name Name Email Address Email Address Sender Receivers ~~~~~~~~~~~Email~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~ sent received mentioned to mentioned mentions Mentioned Email Address Name Nickname

  5. Outline • Problem • Identity Resolution Architecture • Evaluation • Conclusion

  6. Entity Example Nickname Name “Bob” “Robert Bruce” Main Headers (915) Quoted Headers (8) Salutations (7) Free Signatures (9) Email Address “robert.bruce@enron.com” Static Signature (140) Robert E. Bruce Senior Counsel Enron North America Corp. T (713) 345-7780 F (713) 646-3393 robert.bruce@enron.com Signature Block

  7. Enron Collection • Example of large organizational collection • CMU version • about half million emails • 133,581 unique email addresses • ~52% of emails are duplicates! • same address, subject, body

  8. Typical Enron Email Message-ID: <1494.1584620.JavaMail.evans@thyme> Date: Mon, 30 Jul 2001 12:40:48 -0700 (PDT) From: elizabeth.sager@enron.com To: sstack@reliant.com Subject: RE: Shhhh.... it's a SURPRISE ! X-From: Sager, Elizabeth </O=ENRON/OU=NA/CN=RECIPIENTS/CN=ESAGER> X-To: 'SStack@reliant.com@ENRON' Message Header Hi Shari Salutation Message Body Main Body Hope all is well. Count me in for the group present. See ya next week if not earlier Liza Elizabeth Sager 713-853-6349 Signature Block -----Original Message----- From: SStack@reliant.com@ENRON Sent: Monday, July 30, 2001 2:24 PM To: Sager, Elizabeth; Murphy, Harlan; jcrespo@hess.com; wfhenze@jonesday.com Cc: ntillett@reliant.com Subject: Shhhh.... it's a SURPRISE ! Quoted Header Quoted Text Quoted Main Body Please call me (713) 207-5233 Thanks! Shari Quoted Signature

  9. Identity Resolution Architecture Entities Clustering Associations Address-Address Associations Address-Name Associations Address-Nickname Associations Nickname Extraction Salutation lines Signature lines Extraction from Quoted Header Signature Line Detection Salutation Line Detection Main body Quoted headers Extraction from Main Header Body and Quoted Text Separation Unique emails Duplicate Detection

  10. Extraction From Main Headers Name-Address Association Message-ID: <1486175.1075858665169.JavaMail.evans@thyme> Date: Wed, 26 Sep 2001 09:25:19 -0700 (PDT) From: jmathes@nbchamber.com To: mark.vandini@enron.com, steve.urbon@enron.com, sapienza.tony@enron.com, o'rourke.tom@enron.com, lyons.tom@enron.com Subject: New Email Address X-From: Jim Mathes <jmathes@nbchamber.com> X-To: Vandini, Mark <Mark_Vandini@nstaronline.com>, Urbon Steve <surbon@s-t.com>, Tony Sapienza <sapiena@gftusa.com>, Tom O'Rourke <tom@plymouthchamber.com>, Tom Lyons <tlyons@frfive.com>, Tom Hodgson <sheriff@BCSO-MA.org> X-cc: X-bcc: We have just launched our "New & Improved Website", www.newbedfordchamber.com and I have a new email address: jmathes@newbedfordchamber.com Please make the appropriate changes in your email address book. Thank you, Jim Mathes, President New Bedford Area Chamber of Commerce Address-Address Association Name-Address Association

  11. Extraction From Quoted Headers Hi Jeff, Did you get our registration packet? If not, stop by and pick one up because you need it. Make sure you get the one for new students. Shawn On Wednesday, November 03, 1999 11:18 AM, Jeff Dasovich [SMTP:jdasovic@enron.com] wrote: > > > ok, don't shoot me, but what's the deadline for scheduling for classes? > > signed, > clueless Name-Address Association ---------------------- Forwarded by Elizabeth Sager/HOU/ECT on 02/09/2000 12:02 PM --------------------------- "Patricia Young" <PYoung@eei.org> on 02/09/2000 08:50:59 AM To: Elizabeth Sager/HOU/ECT@ECT cc: Subject: If possible, would you forward your resume to me electronically? Thanks. If possible, would you forward your resume to me electronically? Thanks. Name-Address Association

  12. Signature & Salutation Detection From: susan.scott@enron.com Had another sleepless night Sun. and finally took some Unisom and had a good night's sleep last night. What a relief. I have really never had this problem before. It's good to have a lot of energy, but you have to shut down sometime. Am sending you my travel schedule for next week. The following week (May 29 - June 2) I'm planning to be in SF also, but I'm not sure I'll actually have to be there that long. Have a good afternoon! love, sooz Procurement, Logistics, and Contracts Enron Broadband Services, Inc. 1400 Smith, Suite EB-4573A Houston, TX 77002 The week is going OK. All the tennis and swimming has left me with sore muscles so this is my night off. Am planning to do some more house chores so I do not end up with another weekend like the last. I'm still planning on coming to Austin next weekend, I'm just not sure when, but I'll let you know. Call if you get lonely! Love, Sooz Procurement, Logistics, and Contracts Enron Broadband Services, Inc. 1400 Smith, Suite EB-4573A Houston, TX 77002 The kiddies are going back to school already so now would be a good time to plan a trip to D.C. at last. Maybe early Sept? Also I'd be game for a girls' trip to Destin. Time to work! Love, -Sooz Procurement, Logistics, and Contracts Enron Broadband Services, Inc. 1400 Smith, Suite EB-4573A Houston, TX 77002

  13. Nickname Extraction From: susan.scott@enron.com 3,151 address-nickname associations Had another sleepless night Sun. and finally took some Unisom and had a good night's sleep last night. What a relief. I have really never had this problem before. It's good to have a lot of energy, but you have to shut down sometime. Am sending you my travel schedule for next week. The following week (May 29 - June 2) I'm planning to be in SF also, but I'm not sure I'll actually have to be there that long. Have a good afternoon! love, sooz Procurement, Logistics, and Contracts Enron Broadband Services, Inc. 1400 Smith, Suite EB-4573A Houston, TX 77002 nickname

  14. Identifying Entities Nickname Name “Bob” “Robert Bruce” Main Headers (915) Quoted Headers (8) Salutations (7) Free Signatures (9) 3,151 addr-nickname 82,084 addr-name Email Address “robert.bruce@enron.com” 19,708 addr-addr Main Headers (7) Static Signature (140) Email Address Robert E. Bruce Senior Counsel Enron North America Corp. T (713) 345-7780 F (713) 646-3393 robert.bruce@enron.com “rbruce@hotmail.com” Quoted Headers (5) Signature Block “Robert” 66,715 entities Name

  15. Outline • Problem • Identity Resolution Architecture • Evaluation • Conclusion • Future Work

  16. Stratified Sampling

  17. Judgment Process Incorrect kmpresto@msn.com  "home email" terrie.james@enron.com  "alexis james-petty" Correct but not informative june-deadrick@reliantenergy.com “june deadrick” robbie.lewis@enron.com “robbie lewis” Correct and somewhat informative terriecovarrubias@hotmail.com  "terrie covarrubias" randal.maffett@enron.com  "randy" Correct and very informative lemelpe@nu.com  "phyllis" piazzet@wharton.upenn.edu  "tom"

  18. Evaluation Measures Judged Associations Correct Very Informative Informative

  19. Accuracy • 100% accuracy with multiple sources of evidence. • Address-name association was nearly perfect • 80% minimum accuracy in address-nickname • 96.7% entity accuracy Address-Name Associations Address-Nickname Associations Address-Address Associations

  20. Informativeness Address-Name Associations Address-Nickname Associations Address-Address Associations

  21. Outline • Problem • Identity Resolution Architecture • Evaluation • Conclusion

  22. Conclusion • Introduced a computational model of identity • a set of simple techniques put together • provide a useful baseline • assessed its potential utility in the context of one fairly complex email collection • Automatic detection of nicknames in salutations and signature lines. • Most informative results from weakest evidence & least accurate • Accuracy and informativeness are both important

  23. Limitations • Email address associated with single identity • Strength of evidence not exploited • Heuristics hand-tuned for Enron collection • Focus on personal attributes • No reconciliation of multiple identities for single person • No attempt to classify identities as machines or groups • Recall?

  24. Thank You! Questions?

  25. Backup

  26. Future Work • extend the model to exploit temporal features and behavioral evidence • implement machine learning techniques • perform ablation studies • characterize the coverage of our methods in more detail • replicate this work in other contexts • integrate these techniques with the ultimate applications for which computational models of identity are needed (e.g., social network analysis).

  27. Helping in Judgments

  28. Identity Framework Group Person Machine Identity Identity Identity Entity Entity Entity Entity Entity Entity Candidates

  29. Modeling Identity • Attributes (stable explicit features) • email addresses, names, nickname, contact info • Associations • Link attributes together • Based on observations • Entities • Representation of an identity • Set of attributes in undirected graph • Linked by weighted associations

  30. Identifying Entities • First round • limited transitive closure • Merging associations • based on unique attributes • Address-address associations • No use of strength of evidence yet • 66,715 entities • Covering 77,420 unique email address (58% of all addresses)

  31. Related Work • Attribute/association extraction • Name recognition and reference resolution • Applications: • Social network analysis • Finding experts

  32. Unjudged Associations Address-Name Associations Address-Nickname Associations Address-Address Associations Only 19  ~3%

More Related