590 likes | 664 Views
Identity Resolution in Email Collections. Tamer Elsayed and Douglas W. Oard. Department of Computer Science, UMIACS, and iSchool. CLIP Colloquium, UMD, Feb 2009. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~.
E N D
Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard Department of Computer Science, UMIACS, and iSchool CLIP Colloquium, UMD, Feb 2009
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ National Archives Real Problem Clinton White House search request Tobacco Policy 32 million emails 80,000 hired 25 persons for 6 months … 200,000
Identity Resolution in Email Date: Wed Dec 20 08:57:00 EST 2000 From: Kay Mann <kay.mann@enron.com> To: Suzanne Adams <suzanne.adams@enron.com> Subject: Re: GE Conference Call has be rescheduled Did Sheila want Scott to participate? Looks like the call will be too late for him. Sheila WHO?
55 Sheila’s !! weisman pardo glover rich jones breeden huckaby tweed mcintyre chadwick birmingham kahanek foraker tasman fisher petitt Dombo Robbins chang maynes nacey ferrarini dey macleod howard darling watson perlick advani hester kenner lewis walton whitman berggren osowski kelly jarnot kirby knudsen boehringer lutz glover wollam jortner neylon whanger nagel graves mclaughlin venville rappazzo miller swatek hollis Enron Collection Message-ID: <1494.1584620.JavaMail.evans@thyme> Date: Mon, 30 Jul 2001 12:40:48 -0700 (PDT) From: elizabeth.sager@enron.com To: sstack@reliant.com Subject: RE: Shhhh.... it's a SURPRISE ! X-From: Sager, Elizabeth </O=ENRON/OU=NA/CN=RECIPIENTS/CN=ESAGER> X-To: 'SStack@reliant.com@ENRON' Hi Shari Hope all is well. Count me in for the group present. See ya next week if not earlier Rank Candidates Liza Elizabeth Sager 713-853-6349 -----Original Message----- From: SStack@reliant.com@ENRON Sent: Monday, July 30, 2001 2:24 PM To: Sager, Elizabeth; Murphy, Harlan; jcrespo@hess.com; wfhenze@jonesday.com Cc: ntillett@reliant.com Subject: Shhhh.... it's a SURPRISE ! Please call me (713) 207-5233 Thanks! Shari
Generative Model • Choose “person” c to mention p(c) • Choose appropriate “context” X to mention c p(X | c) • Choose a “mention” l p(l | X, c) GEconferencecall “sheila”
Posterior Distribution 3-Step Solution (1) IdentityModeling (2) Context Reconstruction (3) Mention Resolution
Outline • Introduction and Approach Overview • Computational Model of Identity • Context Reconstruction • Mention Resolution • Evaluation on Existing Collections • Scalable MapReduce Implementation • New Test Collection • Conclusion and Future Work
“Easy References” of Identity Message-ID: <1494.1584620.JavaMail.evans@thyme> Date: Mon, 30 Jul 2001 12:40:48 -0700 (PDT) From: elizabeth.sager@enron.com To: sstack@reliant.com Subject: RE: Shhhh.... it's a SURPRISE ! X-From: Sager, Elizabeth </O=ENRON/OU=NA/CN=RECIPIENTS/CN=ESAGER> X-To: 'SStack@reliant.com@ENRON' Email Standards Hi Shari Hope all is well. Count me in for the group present. See ya next week if not earlier Liza Elizabeth Sager 713-853-6349 Email-Client Behavior User Regularities -----Original Message----- From: SStack@reliant.com@ENRON Sent: Monday, July 30, 2001 2:24 PM To: Sager, Elizabeth; Murphy, Harlan; jcrespo@hess.com; wfhenze@jonesday.com Cc: ntillett@reliant.com Subject: Shhhh.... it's a SURPRISE ! Please call me (713) 207-5233 Thanks! Shari
Representational Model sheila glover 1170 (User Name) sheila.glover@enron.com 932 (Main Headers) 216 (Signature) 19 (Salutation) 14 (Quoted Headers) sheila sheila glover sg 19 (Signature) Representational Model of Identity 77,240 “non-trivial” models
identity c name type t m observed mention Computational Model of Identity
Candidates Likelihood: p ( “sheila” | c) Identity Models Candidates
Outline • Introduction and Approach Overview • Computational Model of Identity • Context Reconstruction • Mention Resolution • Evaluation on Existing Collections • Scalable MapReduce Implementation • New Test Collection • Conclusion and Future Work
Topical Context Conversational Context LocalContext Contextual Space
Topical Context Date: Wed Dec 20 08:57:00 EST 2000 From: Kay Mann <kay.mann@enron.com> To: Suzanne Adams <suzanne.adams@enron.com> Subject: Re: GE Conference Call has be rescheduled Did Sheila want Scott to participate? Looks like the call will be too late for him. Date: Fri Dec 15 05:33:00 EST 2000 From: david.oxley@enron.com To: vince j kaminski <vince.kaminski@enron.com> Cc: sheila walton <sheila.walton@enron.com> Subject: Re: Grant Masson Great news. Lets get this moving along. Sheila, can you work out GE letter? Vince, I am in London Monday/Tuesday, back Weds late. I'll ask Sheila to fix this for you and if you need me call me on my cell phone. GE Sheila call sheila.walton@enron.com GE Sheila call
Topical Context Social Context Conversational Context LocalContext Contextual Space
Social Context Date: Wed Dec 20 08:57:00 EST 2000 From: Kay Mann <kay.mann@enron.com> To: Suzanne Adams <suzanne.adams@enron.com> Subject: Re: GE Conference Call has be rescheduled Did Sheila want Scott to participate? Looks like the call will be too late for him. kay.mann@enron.com Date: Tue, 19 Dec 2000 07:07:00 -0800 (PST) From: rebecca.walker@enron.com To: kay.mann@enron.com Subject: ESA Option Execution Kay Can you initial the ESA assignment and assumption agreement or should I ask Sheila Tweed to do it? I believe she is currently en route from Portland. Thanks, Rebecca kay.mann@enron.com Sheila Tweed
Formally • A context of an email is a probability distribution over emails • Probability estimated based on type of context • Contextual Space is a linear combination of 4 contexts
local conversational Context Expansion people topical social time content Temporal similarity affects social and topical similarity
Temporal Similarity • Decay over time • Gaussian and Linear functions • Time difference / Rank
Social Similarity • Two sets of participants (email adresses) • Binary, Overlap, Jacaard, Both
Temporal Effect Temporal Sim Pure Social Sim Normalize Social Context Social Sim
email reply / forward Topical Similarity • Standard IR Similarity: BM25 • Email as a DOCUMENT? • Subject • Body (+Subject) • Root of thread • Concatenated path to root • Combined similarly with temporal similarity
Topical Context Social Context Conversational Context LocalContext Contextual Space (emails)
Contextual Space (mentions) “Sheila Tweed” “jsheila@enron.com” social social “Sheila Walton” “Sheila” topical topical “sheila” social “Sheila” topical conversational “sg”
Outline • Introduction and Approach Overview • Computational Model of Identity • Context Reconstruction • Mention Resolution • Evaluation on Existing Collections • Scalable MapReduce Implementation • New Test Collection • Conclusion and Future Work
Mention Resolution 1 Date: Wed Dec 20 08:57:00 EST 2000 From: Kay Mann <kay.mann@enron.com> To: Suzanne Adams <suzanne.adams@enron.com> Subject: Re: GE Conference Call has be rescheduled Did Sheila want Scott to participate? Looks like the call will be too late for him. ? Sheila 2 3 Likelihood: p ( “sheila” | c) Candidates Goal: estimate p(c|m, X(m)) and rank accordingly
[1] Context-Free Resolution Context-FreeResolution “Sheila Tweed” “jsheila@enron.com” social social “Sheila Walton” “Sheila” “Sheila” topical topical “sheila” social “Sheila” topical conversational “sg” X
“Sheila Tweed” “jsheila@enron.com” “Sheila Walton” “sheila” “Sheila” “sg” [2] Contextual Resolution Context-FreeResolution social social “Sheila” topical social
Outline • Introduction and Approach Overview • Computational Model of Identity • Context Reconstruction • Mention Resolution • Evaluation on Existing Collections • Scalable MapReduce Implementation • New Test Collection • Conclusion
Test Collections Enron-all Enron-subset Sager Shapiro
Evaluation Measures Commonly used in “known-item” retrieval • Success @1 (i.e., Precision @1) • One-best • MRR (Mean Reciprocal Rank) • Inverse of the harmonic mean of the ranks of true answer ri
Comparison w/Literature ContextExpansion Lit.Best ContextExpansion Lit.Best Earlier expansion approach, reported in ACL 2008 Improved expansion 0.92 0.87
Limitations • Resolving single mentions • All mention-queries are sampled from Enron to Enron emails • All mention-queries refer to Enron Employee • Small for train/test split Scalable Implementation for Resolving All Mentions New Test Collection
Outline • Introduction and Approach Overview • Computational Model of Identity • Context Reconstruction • Mention Resolution • Evaluation on Existing Collections • Scalable MapReduce Implementation • New Test Collection • Conclusion
Scalable Implementation Two Bottlenecks: • Context expansion of ALL emails For each email: ranked list of “Similar” emails • Resolution of ALL mentions Resolution of one mention depends on resolution of all other mentions in context
Context Expansion of ALL Emails • Goal: For each email: ranked list of “Similar” emails • Need for BOTH social and topical contexts • Efficient implementation Abstract Problem:Computing Pairwise Similarity
Trivial Solution • load each vector o(N) times • load each term o(dft2) times Goal scalable and efficient solutionfor large collections
Better Solution Each term contributes only if appears in • Load weights for each term once • Each term contributes o(dft2) partial scores
MapReduce Framework (a) Map (b) Shuffle (c) Reduce (k1, v1) [k2, v2] Shuffling group values by: [keys] [(k3, v3)] map (k2, [v2]) input reduce output map input reduce output map input reduce output map input handles low-level detailstransparently
Decomposition Each term contributes only if appears in • Load weights for each term once • Each term contributes o(dft2) partial scores reduce map
contextsim model temporal sim model topical : body/root/path ~~~~~~~~~~~-- ~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~ contextgraph doc rep. social : participants time window df-cut rankcut-off Expansion Using MapReduce • Using generic pairwise-similarity for both topical and social expansion
“Sheila Tweed” “jsheila@enron.com” “Sheila Walton” “sheila” “Sheila” “sg” Context Mention-Graph map Context-FreeResolution map social social reduce “Sheila” topical topical map social topical map conversational
Preprocessing Resolution System Using MapReduce Threads Emails Identity Models Packing Dict. Conv.Expansion LocalExpansion Topical Expansion Social Expansion Mention Recognition and Prior Computation Expansion Conv. Graph Local Graph Topical Graph Social Graph Prior Prior Prior Conv.Resolution LocalResolution Topical Resolution Social Resolution Prior Resolution Resolution Merging Context Resolutions Posterior Resolution
Outline • Introduction and Approach Overview • Computational Model of Identity • Context Reconstruction • Mention Resolution • Evaluation on Existing Collections • Scalable MapReduce Implementation • New Test Collection • Conclusion
New Test Collection • Random Sample from CMU-Enron collection • “Annotation + Search” interface available • Total annotators: 3 • Annotation time: ~50 hours. • Not only resolutions • Time, difficulty, confidence, evidence, and comments • Total mention-queries : 584 • 80% resolvable, 82% of them to Enron domain • Overall inter-annotator agreement: ~81 %