320 likes | 415 Views
Modeling Identity in Archival Collections of Email: A Preliminary study. Tamer Elsayed and Douglas W. Oard. Institute for Advanced Computer Studies. Department of Computer Science. College of Information Studies. Conference on Email and Anti-Spam (CEAS), July 28 th , 2006.
E N D
Modeling Identity in Archival Collections of Email: A Preliminary study Tamer Elsayed and Douglas W. Oard Institute for Advanced Computer Studies Department of Computer Science College of Information Studies Conference on Email and Anti-Spam (CEAS), July 28th, 2006
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ National Archives Real Problem Clinton White House search request Tobacco Policy 32 million emails 80,000 hired 25 persons for 6 months … 200,000
Email Search Searcher • Meaning Modeling Content • People Modeling Identity
Identity Nickname Nickname sent email to Name Name Email Address Email Address Sender Receivers ~~~~~~~~~~~Email~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~ sent received mentioned to mentioned mentions Mentioned Email Address Name Nickname
Outline • Problem • Identity Resolution Architecture • Evaluation • Conclusion
Entity Example Nickname Name “Bob” “Robert Bruce” Main Headers (915) Quoted Headers (8) Salutations (7) Free Signatures (9) Email Address “robert.bruce@enron.com” Static Signature (140) Robert E. Bruce Senior Counsel Enron North America Corp. T (713) 345-7780 F (713) 646-3393 robert.bruce@enron.com Signature Block
Enron Collection • Example of large organizational collection • CMU version • about half million emails • 133,581 unique email addresses • ~52% of emails are duplicates! • same address, subject, body
Typical Enron Email Message-ID: <1494.1584620.JavaMail.evans@thyme> Date: Mon, 30 Jul 2001 12:40:48 -0700 (PDT) From: elizabeth.sager@enron.com To: sstack@reliant.com Subject: RE: Shhhh.... it's a SURPRISE ! X-From: Sager, Elizabeth </O=ENRON/OU=NA/CN=RECIPIENTS/CN=ESAGER> X-To: 'SStack@reliant.com@ENRON' Message Header Hi Shari Salutation Message Body Main Body Hope all is well. Count me in for the group present. See ya next week if not earlier Liza Elizabeth Sager 713-853-6349 Signature Block -----Original Message----- From: SStack@reliant.com@ENRON Sent: Monday, July 30, 2001 2:24 PM To: Sager, Elizabeth; Murphy, Harlan; jcrespo@hess.com; wfhenze@jonesday.com Cc: ntillett@reliant.com Subject: Shhhh.... it's a SURPRISE ! Quoted Header Quoted Text Quoted Main Body Please call me (713) 207-5233 Thanks! Shari Quoted Signature
Identity Resolution Architecture Entities Clustering Associations Address-Address Associations Address-Name Associations Address-Nickname Associations Nickname Extraction Salutation lines Signature lines Extraction from Quoted Header Signature Line Detection Salutation Line Detection Main body Quoted headers Extraction from Main Header Body and Quoted Text Separation Unique emails Duplicate Detection
Extraction From Main Headers Name-Address Association Message-ID: <1486175.1075858665169.JavaMail.evans@thyme> Date: Wed, 26 Sep 2001 09:25:19 -0700 (PDT) From: jmathes@nbchamber.com To: mark.vandini@enron.com, steve.urbon@enron.com, sapienza.tony@enron.com, o'rourke.tom@enron.com, lyons.tom@enron.com Subject: New Email Address X-From: Jim Mathes <jmathes@nbchamber.com> X-To: Vandini, Mark <Mark_Vandini@nstaronline.com>, Urbon Steve <surbon@s-t.com>, Tony Sapienza <sapiena@gftusa.com>, Tom O'Rourke <tom@plymouthchamber.com>, Tom Lyons <tlyons@frfive.com>, Tom Hodgson <sheriff@BCSO-MA.org> X-cc: X-bcc: We have just launched our "New & Improved Website", www.newbedfordchamber.com and I have a new email address: jmathes@newbedfordchamber.com Please make the appropriate changes in your email address book. Thank you, Jim Mathes, President New Bedford Area Chamber of Commerce Address-Address Association Name-Address Association
Extraction From Quoted Headers Hi Jeff, Did you get our registration packet? If not, stop by and pick one up because you need it. Make sure you get the one for new students. Shawn On Wednesday, November 03, 1999 11:18 AM, Jeff Dasovich [SMTP:jdasovic@enron.com] wrote: > > > ok, don't shoot me, but what's the deadline for scheduling for classes? > > signed, > clueless Name-Address Association ---------------------- Forwarded by Elizabeth Sager/HOU/ECT on 02/09/2000 12:02 PM --------------------------- "Patricia Young" <PYoung@eei.org> on 02/09/2000 08:50:59 AM To: Elizabeth Sager/HOU/ECT@ECT cc: Subject: If possible, would you forward your resume to me electronically? Thanks. If possible, would you forward your resume to me electronically? Thanks. Name-Address Association
Signature & Salutation Detection From: susan.scott@enron.com Had another sleepless night Sun. and finally took some Unisom and had a good night's sleep last night. What a relief. I have really never had this problem before. It's good to have a lot of energy, but you have to shut down sometime. Am sending you my travel schedule for next week. The following week (May 29 - June 2) I'm planning to be in SF also, but I'm not sure I'll actually have to be there that long. Have a good afternoon! love, sooz Procurement, Logistics, and Contracts Enron Broadband Services, Inc. 1400 Smith, Suite EB-4573A Houston, TX 77002 The week is going OK. All the tennis and swimming has left me with sore muscles so this is my night off. Am planning to do some more house chores so I do not end up with another weekend like the last. I'm still planning on coming to Austin next weekend, I'm just not sure when, but I'll let you know. Call if you get lonely! Love, Sooz Procurement, Logistics, and Contracts Enron Broadband Services, Inc. 1400 Smith, Suite EB-4573A Houston, TX 77002 The kiddies are going back to school already so now would be a good time to plan a trip to D.C. at last. Maybe early Sept? Also I'd be game for a girls' trip to Destin. Time to work! Love, -Sooz Procurement, Logistics, and Contracts Enron Broadband Services, Inc. 1400 Smith, Suite EB-4573A Houston, TX 77002
Nickname Extraction From: susan.scott@enron.com 3,151 address-nickname associations Had another sleepless night Sun. and finally took some Unisom and had a good night's sleep last night. What a relief. I have really never had this problem before. It's good to have a lot of energy, but you have to shut down sometime. Am sending you my travel schedule for next week. The following week (May 29 - June 2) I'm planning to be in SF also, but I'm not sure I'll actually have to be there that long. Have a good afternoon! love, sooz Procurement, Logistics, and Contracts Enron Broadband Services, Inc. 1400 Smith, Suite EB-4573A Houston, TX 77002 nickname
Identifying Entities Nickname Name “Bob” “Robert Bruce” Main Headers (915) Quoted Headers (8) Salutations (7) Free Signatures (9) 3,151 addr-nickname 82,084 addr-name Email Address “robert.bruce@enron.com” 19,708 addr-addr Main Headers (7) Static Signature (140) Email Address Robert E. Bruce Senior Counsel Enron North America Corp. T (713) 345-7780 F (713) 646-3393 robert.bruce@enron.com “rbruce@hotmail.com” Quoted Headers (5) Signature Block “Robert” 66,715 entities Name
Outline • Problem • Identity Resolution Architecture • Evaluation • Conclusion • Future Work
Judgment Process Incorrect kmpresto@msn.com "home email" terrie.james@enron.com "alexis james-petty" Correct but not informative june-deadrick@reliantenergy.com “june deadrick” robbie.lewis@enron.com “robbie lewis” Correct and somewhat informative terriecovarrubias@hotmail.com "terrie covarrubias" randal.maffett@enron.com "randy" Correct and very informative lemelpe@nu.com "phyllis" piazzet@wharton.upenn.edu "tom"
Evaluation Measures Judged Associations Correct Very Informative Informative
Accuracy • 100% accuracy with multiple sources of evidence. • Address-name association was nearly perfect • 80% minimum accuracy in address-nickname • 96.7% entity accuracy Address-Name Associations Address-Nickname Associations Address-Address Associations
Informativeness Address-Name Associations Address-Nickname Associations Address-Address Associations
Outline • Problem • Identity Resolution Architecture • Evaluation • Conclusion
Conclusion • Introduced a computational model of identity • a set of simple techniques put together • provide a useful baseline • assessed its potential utility in the context of one fairly complex email collection • Automatic detection of nicknames in salutations and signature lines. • Most informative results from weakest evidence & least accurate • Accuracy and informativeness are both important
Limitations • Email address associated with single identity • Strength of evidence not exploited • Heuristics hand-tuned for Enron collection • Focus on personal attributes • No reconciliation of multiple identities for single person • No attempt to classify identities as machines or groups • Recall?
Thank You! Questions?
Future Work • extend the model to exploit temporal features and behavioral evidence • implement machine learning techniques • perform ablation studies • characterize the coverage of our methods in more detail • replicate this work in other contexts • integrate these techniques with the ultimate applications for which computational models of identity are needed (e.g., social network analysis).
Identity Framework Group Person Machine Identity Identity Identity Entity Entity Entity Entity Entity Entity Candidates
Modeling Identity • Attributes (stable explicit features) • email addresses, names, nickname, contact info • Associations • Link attributes together • Based on observations • Entities • Representation of an identity • Set of attributes in undirected graph • Linked by weighted associations
Identifying Entities • First round • limited transitive closure • Merging associations • based on unique attributes • Address-address associations • No use of strength of evidence yet • 66,715 entities • Covering 77,420 unique email address (58% of all addresses)
Related Work • Attribute/association extraction • Name recognition and reference resolution • Applications: • Social network analysis • Finding experts
Unjudged Associations Address-Name Associations Address-Nickname Associations Address-Address Associations Only 19 ~3%