270 likes | 282 Views
Explore a graph-based approach in email data for person name disambiguation using graph walks, learning methods, and threaded message categorization. Evaluation and future research directions are also discussed.
E N D
Contextual Search and Name Disambiguation in Email using Graphs Einat Minkov William W. Cohen Andrew Y. Ng SIGIR-2006
Outline • Extended similarity measure using graph walks • Instantiation for Email • Learning • Evaluation • Person Name Disambiguation • Threading • Summary and future directions
Object Similarity • Textual similarity measures model D-D (or Q:D) similarity • However, in structured data, documents are not isolated. • We are interested in extending the text-based similarity measures to complex structure settings: • Represent structured data as a graph • Derive object similarity using lazy graph walks. • We instantiate this framework for Email (a private case of structured data)
Email as a Graph Chris.germany@enron.com alias Chris sent_from sent_from_email Mgermany@ch2m.com sent_to_email 1.22.00 file1 On_date sent_to has_subj_term Melissa Germany has_term work where yo I’m you
Email as a Graph • A directed graph • A node carries an entity type • An edge carries a relation type • Edges are bi-directional (cyclic) • Nodes inter-connect via linked entities.
Edge Weights • Graph G : - nodes x, y, z - node types T(x), T(y), T(z) - edge labels - parameters • Edge weight x y: • Prob. Distribution: a. Pick an outgoing edge label b. Pick node y uniformly
Graph Similarity Defined by lazy graph walks over k steps. Given: Stay probability: (larger values favor shorter paths) A transition matrix: Initial node distribution: Output node distribution: We use this platform to perform SEARCH of related items in the graph:a query is initial distribution Vq over nodes and a desired output type Tout
file2 file1 file3 term1 term2 term3 term4 term6 term7 term5 Relation to IDF • Reduce the graph to files and terms only. • One-dimensional search of files, over one step (Query = multiple source term nodes) • A natural IDF filter:terms occurring in multiple files will ‘spread’ their probability mass into small fractions over many file nodes.
Learning • Learn how to better rank graph nodes per a particular task. • The parameters can be adjusted using gradient descent methods (Diligenti et-al, IJCAI 2005) • We suggest a re-ranking approach (Collins and Koo, Computational Linguistics, 2005) • take advantage of ‘global’ features • A training example includes: • a ranked list of li nodes. • Each node represented through m features • At least one known correct node • Features will describe the graph walk paths
Path describing Features • The full set of paths to a target node in step k can be recovered. X1 ‘Edge unigrams’:was edge type l used in reaching x from Vq. X2 X3 X4 ‘Edge bigrams’:were edge types l1 and l2 used (in that order) in reaching x from Vq. X5 K=0 K=1 K=2 ‘Top edge bigrams’:were edge types l1and l2 used (in that order) in reaching x from Vq, among the top two highest scoring paths. Paths (x3, k=2): x2 x3 x2 x1 x3 x4 x1 x3 x2 x2 x3
Outline • Extended similarity measure using graph walks • Instantiation for Email • Learning • Evaluation • Person Name Disambiguation • Threading • Summary and future directions
Person Name Disambiguation file Person file Person “who is Andy?” file andy • Given: • a term that is known to be a personal name • is not mentioned ‘as is’ in header (otherwise, easy) • Output: • ranked person nodes. Person
Corpora and Datasets • Example types : • Andy Andrew • Kai Keiko • Jenny Xing • Two-fold problem: • Map terms to person nodes (co-occurrence) • Disambiguation (context)
2. Graph walk: TermVq: name term node • Co-occurrence • 4. G: Term + File, RerankedRe-rank (3), using: • Path-describing features • ‘source count’ : do the paths originate from a single/two source nodes • string similarity 3. Graph walk: Term + FileVq:name term + file nodes • Co-occurrence • Ambiguity • but, incorporates additional noise Methods • 1. Baseline: String matching (& common nicknames) • Find persons that are similar to the name term (Jaro measure) • Lexical similarity
Results Mgmt. game
Results Mgmt. game
Results Mgmt. game
Results Mgmt. game
Results Mgmt. Game Enron:Sager-E Enron:Shapiro-R
Threading • There are often irregularities in thread structural information (Lewis and Knowles, 1997) • Threading can improve message categorization into topical folders (Klimt and Yang, 2004) • Adjacent messages in a thread can be assumed to be most similar to each other in the corpus. An approximation for finding similar messages in a corpus. • Given: • a file • Output: • ranked file nodes • adjacent files in a thread are correct answers
The Joint Graph filex Shared content Social network Timeline
Threading: experiments • Baseline: TF-IDF SimilarityConsider all the available information (header & body) as text • Graph walk: Uniform weights Vq:file, 2 steps • Graph walk: Random weights Vq:file, 2 steps(best out of 10) • Graph walk: reranked Rerank the output of (3) using the graph-describing features
Results Mgmt. Game 73.8 71.5 60.3 58.4 50.2 MAP 36.2 Header & Body Subject Reply lines Header & Body Subject - Header & Body - - 79.8 Enron:Farmer 65.7 65.1 MAP 36.1 Header & Body Subject Reply lines Header & Body Subject - Header & Body - -
Main Contributions • Presented an extended similarity measure incorporating non-textual objects • Perform finite lazy random walks for typed search • A re-ranking paradigm to improve on graph walk results • Instantiation of this framework for Email • Enron Datasets and corpora are available online
Future directions • Scalability: • Sampling-based approximation to iterative matrix multiplication • 10-step walks on a million-node corpus in 10-15 seconds • Language Model • Learning: • Adjust the weights • Eliminate noise in contextual/complex queries • Timeline
Related Research • IR: • Infinite walks for node centrality • Graph walks for query expansion • Spreading activation over semantic/association networks • Data Mining • Relational data representation • Machine Learning • Semi supervised learning in graphs