Contextual Search and Name Disambiguation in Email using Graphs

Contextual Search and Name Disambiguation in Email using Graphs Einat Minkov William W. Cohen Andrew Y. Ng SIGIR-2006

Outline • Extended similarity measure using graph walks • Instantiation for Email • Learning • Evaluation • Person Name Disambiguation • Threading • Summary and future directions

Object Similarity • Textual similarity measures model D-D (or Q:D) similarity • However, in structured data, documents are not isolated. • We are interested in extending the text-based similarity measures to complex structure settings: • Represent structured data as a graph • Derive object similarity using lazy graph walks. • We instantiate this framework for Email (a private case of structured data)

Email as a Graph Chris.germany@enron.com alias Chris sent_from sent_from_email Mgermany@ch2m.com sent_to_email 1.22.00 file1 On_date sent_to has_subj_term Melissa Germany has_term work where yo I’m you

Email as a Graph • A directed graph • A node carries an entity type • An edge carries a relation type • Edges are bi-directional (cyclic) • Nodes inter-connect via linked entities.

Edge Weights • Graph G : - nodes x, y, z - node types T(x), T(y), T(z) - edge labels - parameters • Edge weight x y: • Prob. Distribution: a. Pick an outgoing edge label b. Pick node y uniformly

Graph Similarity Defined by lazy graph walks over k steps. Given: Stay probability: (larger values favor shorter paths) A transition matrix: Initial node distribution: Output node distribution: We use this platform to perform SEARCH of related items in the graph:a query is initial distribution Vq over nodes and a desired output type Tout

file2 file1 file3 term1 term2 term3 term4 term6 term7 term5 Relation to IDF • Reduce the graph to files and terms only. • One-dimensional search of files, over one step (Query = multiple source term nodes) • A natural IDF filter:terms occurring in multiple files will ‘spread’ their probability mass into small fractions over many file nodes.

Learning • Learn how to better rank graph nodes per a particular task. • The parameters can be adjusted using gradient descent methods (Diligenti et-al, IJCAI 2005) • We suggest a re-ranking approach (Collins and Koo, Computational Linguistics, 2005) • take advantage of ‘global’ features • A training example includes: • a ranked list of li nodes. • Each node represented through m features • At least one known correct node • Features will describe the graph walk paths

Path describing Features • The full set of paths to a target node in step k can be recovered. X1 ‘Edge unigrams’:was edge type l used in reaching x from Vq. X2 X3 X4 ‘Edge bigrams’:were edge types l1 and l2 used (in that order) in reaching x from Vq. X5 K=0 K=1 K=2 ‘Top edge bigrams’:were edge types l1and l2 used (in that order) in reaching x from Vq, among the top two highest scoring paths. Paths (x3, k=2): x2  x3 x2  x1  x3 x4 x1  x3 x2  x2  x3

Outline • Extended similarity measure using graph walks • Instantiation for Email • Learning • Evaluation • Person Name Disambiguation • Threading • Summary and future directions

Person Name Disambiguation file Person file Person “who is Andy?” file andy • Given: • a term that is known to be a personal name • is not mentioned ‘as is’ in header (otherwise, easy) • Output: • ranked person nodes. Person

Corpora and Datasets • Example types : • Andy  Andrew • Kai  Keiko • Jenny  Xing • Two-fold problem: • Map terms to person nodes (co-occurrence) • Disambiguation (context)

2. Graph walk: TermVq: name term node • Co-occurrence • 4. G: Term + File, RerankedRe-rank (3), using: • Path-describing features • ‘source count’ : do the paths originate from a single/two source nodes • string similarity 3. Graph walk: Term + FileVq:name term + file nodes • Co-occurrence • Ambiguity • but, incorporates additional noise Methods • 1. Baseline: String matching (& common nicknames) • Find persons that are similar to the name term (Jaro measure) • Lexical similarity

Results Mgmt. game

Results Mgmt. Game Enron:Sager-E Enron:Shapiro-R

Threading • There are often irregularities in thread structural information (Lewis and Knowles, 1997) • Threading can improve message categorization into topical folders (Klimt and Yang, 2004) • Adjacent messages in a thread can be assumed to be most similar to each other in the corpus. An approximation for finding similar messages in a corpus. • Given: • a file • Output: • ranked file nodes • adjacent files in a thread are correct answers

The Joint Graph filex Shared content Social network Timeline

Threading: experiments • Baseline: TF-IDF SimilarityConsider all the available information (header & body) as text • Graph walk: Uniform weights Vq:file, 2 steps • Graph walk: Random weights Vq:file, 2 steps(best out of 10) • Graph walk: reranked Rerank the output of (3) using the graph-describing features

Results Mgmt. Game 73.8 71.5 60.3 58.4 50.2 MAP 36.2 Header & Body Subject Reply lines Header & Body Subject - Header & Body - - 79.8 Enron:Farmer 65.7 65.1 MAP 36.1 Header & Body Subject Reply lines Header & Body Subject - Header & Body - -

Main Contributions • Presented an extended similarity measure incorporating non-textual objects • Perform finite lazy random walks for typed search • A re-ranking paradigm to improve on graph walk results • Instantiation of this framework for Email • Enron Datasets and corpora are available online

Future directions • Scalability: • Sampling-based approximation to iterative matrix multiplication • 10-step walks on a million-node corpus in 10-15 seconds • Language Model • Learning: • Adjust the weights • Eliminate noise in contextual/complex queries • Timeline

Related Research • IR: • Infinite walks for node centrality • Graph walks for query expansion • Spreading activation over semantic/association networks • Data Mining • Relational data representation • Machine Learning • Semi supervised learning in graphs

Thank you! Questions?

Contextual Search and Name Disambiguation in Email using Graphs

Contextual Search and Name Disambiguation in Email using Graphs

Presentation Transcript

Disambiguation

Improving the performance of personal name disambiguation using web directories

Contextual Insight in Search Enabling Technologies and Applications

ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Person Name Disambiguation by Bootstrapping

Search Query Disambiguation from Short Sessions

Graphs and Search Trees

Name Disambiguation in Digital Libraries

Incremental Unsupervised Name Disambiguation in Cleaned Digital Libraries

Author Name Disambiguation for Citations Using Topic and Web Correlation

Contextual Image Search

Contextual Image Search

Co-occurrence and place name disambiguation.

Graphs and basic search algorithms

Graphs – Breadth First Search

Using Charts and Graphs in the Classroom

Improving the performance of personal name disambiguation using web directories

Contextual Search and Name Disambiguation in Email Using Graphs

Contextual Insight in Search Enabling Technologies and Applications

Trees, Graphs and Search

Contextual Search and Name Disambiguation in Email using Graphs

TEAM NAME Name, email, phone