1 / 27

Contextual Search and Name Disambiguation in Email using Graphs

Explore a graph-based approach in email data for person name disambiguation using graph walks, learning methods, and threaded message categorization. Evaluation and future research directions are also discussed.

rbrinkman
Download Presentation

Contextual Search and Name Disambiguation in Email using Graphs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Contextual Search and Name Disambiguation in Email using Graphs Einat Minkov William W. Cohen Andrew Y. Ng SIGIR-2006

  2. Outline • Extended similarity measure using graph walks • Instantiation for Email • Learning • Evaluation • Person Name Disambiguation • Threading • Summary and future directions

  3. Object Similarity • Textual similarity measures model D-D (or Q:D) similarity • However, in structured data, documents are not isolated. • We are interested in extending the text-based similarity measures to complex structure settings: • Represent structured data as a graph • Derive object similarity using lazy graph walks. • We instantiate this framework for Email (a private case of structured data)

  4. Email as a Graph Chris.germany@enron.com alias Chris sent_from sent_from_email Mgermany@ch2m.com sent_to_email 1.22.00 file1 On_date sent_to has_subj_term Melissa Germany has_term work where yo I’m you

  5. Email as a Graph • A directed graph • A node carries an entity type • An edge carries a relation type • Edges are bi-directional (cyclic) • Nodes inter-connect via linked entities.

  6. Edge Weights • Graph G : - nodes x, y, z - node types T(x), T(y), T(z) - edge labels - parameters • Edge weight x y: • Prob. Distribution: a. Pick an outgoing edge label b. Pick node y uniformly

  7. Graph Similarity Defined by lazy graph walks over k steps. Given: Stay probability: (larger values favor shorter paths) A transition matrix: Initial node distribution: Output node distribution: We use this platform to perform SEARCH of related items in the graph:a query is initial distribution Vq over nodes and a desired output type Tout

  8. file2 file1 file3 term1 term2 term3 term4 term6 term7 term5 Relation to IDF • Reduce the graph to files and terms only. • One-dimensional search of files, over one step (Query = multiple source term nodes) • A natural IDF filter:terms occurring in multiple files will ‘spread’ their probability mass into small fractions over many file nodes.

  9. Learning • Learn how to better rank graph nodes per a particular task. • The parameters can be adjusted using gradient descent methods (Diligenti et-al, IJCAI 2005) • We suggest a re-ranking approach (Collins and Koo, Computational Linguistics, 2005) • take advantage of ‘global’ features • A training example includes: • a ranked list of li nodes. • Each node represented through m features • At least one known correct node • Features will describe the graph walk paths

  10. Path describing Features • The full set of paths to a target node in step k can be recovered. X1 ‘Edge unigrams’:was edge type l used in reaching x from Vq. X2 X3 X4 ‘Edge bigrams’:were edge types l1 and l2 used (in that order) in reaching x from Vq. X5 K=0 K=1 K=2 ‘Top edge bigrams’:were edge types l1and l2 used (in that order) in reaching x from Vq, among the top two highest scoring paths. Paths (x3, k=2): x2  x3 x2  x1  x3 x4 x1  x3 x2  x2  x3

  11. Outline • Extended similarity measure using graph walks • Instantiation for Email • Learning • Evaluation • Person Name Disambiguation • Threading • Summary and future directions

  12. Person Name Disambiguation file Person file Person “who is Andy?” file andy • Given: • a term that is known to be a personal name • is not mentioned ‘as is’ in header (otherwise, easy) • Output: • ranked person nodes. Person

  13. Corpora and Datasets • Example types : • Andy  Andrew • Kai  Keiko • Jenny  Xing • Two-fold problem: • Map terms to person nodes (co-occurrence) • Disambiguation (context)

  14. 2. Graph walk: TermVq: name term node • Co-occurrence • 4. G: Term + File, RerankedRe-rank (3), using: • Path-describing features • ‘source count’ : do the paths originate from a single/two source nodes • string similarity 3. Graph walk: Term + FileVq:name term + file nodes • Co-occurrence • Ambiguity • but, incorporates additional noise Methods • 1. Baseline: String matching (& common nicknames) • Find persons that are similar to the name term (Jaro measure) • Lexical similarity

  15. Results Mgmt. game

  16. Results Mgmt. game

  17. Results Mgmt. game

  18. Results Mgmt. game

  19. Results Mgmt. Game Enron:Sager-E Enron:Shapiro-R

  20. Threading • There are often irregularities in thread structural information (Lewis and Knowles, 1997) • Threading can improve message categorization into topical folders (Klimt and Yang, 2004) • Adjacent messages in a thread can be assumed to be most similar to each other in the corpus. An approximation for finding similar messages in a corpus. • Given: • a file • Output: • ranked file nodes • adjacent files in a thread are correct answers

  21. The Joint Graph filex Shared content Social network Timeline

  22. Threading: experiments • Baseline: TF-IDF SimilarityConsider all the available information (header & body) as text • Graph walk: Uniform weights Vq:file, 2 steps • Graph walk: Random weights Vq:file, 2 steps(best out of 10) • Graph walk: reranked Rerank the output of (3) using the graph-describing features

  23. Results Mgmt. Game 73.8 71.5 60.3 58.4 50.2 MAP 36.2 Header & Body Subject Reply lines Header & Body Subject - Header & Body - - 79.8 Enron:Farmer 65.7 65.1 MAP 36.1 Header & Body Subject Reply lines Header & Body Subject - Header & Body - -

  24. Main Contributions • Presented an extended similarity measure incorporating non-textual objects • Perform finite lazy random walks for typed search • A re-ranking paradigm to improve on graph walk results • Instantiation of this framework for Email • Enron Datasets and corpora are available online

  25. Future directions • Scalability: • Sampling-based approximation to iterative matrix multiplication • 10-step walks on a million-node corpus in 10-15 seconds • Language Model • Learning: • Adjust the weights • Eliminate noise in contextual/complex queries • Timeline

  26. Related Research • IR: • Infinite walks for node centrality • Graph walks for query expansion • Spreading activation over semantic/association networks • Data Mining • Relational data representation • Machine Learning • Semi supervised learning in graphs

  27. Thank you! Questions?

More Related