On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data

On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language Technology Institute + Center for Bioimage Informatics + Joint CMU-Pitt Program in Bioinformatics Carnegie Mellon University

On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Machine Learning Department + Language Technology Institute + Center for Bioimage Informatics + Joint CMU-Pitt Program in Bioinformatics Carnegie Mellon University

On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Carnegie Mellon University joint work with: Einat Minkov (CMU) Andrew Ng (Stanford)

Outline • Motivation: why I’m interested in • structured data that is partly text; • structured data represented as graphs; • measuring similarity of nodes in graphs • Contributions: • a simple query language for graphs; • experiments on natural types of queries; • techniques for learning to answer queries of a certain type better

“A Little Knowledge is A Dangerous Thing” [A. Pope, 1709] • Three centuries later, we’ve learned that a lot of knowledge is also sort of dangerous.... • ... so how do we deal with information overload?

IE NAME TITLE ORGANIZATION Bill GatesCEOMicrosoft Bill VeghteVPMicrosoft Richard StallmanfounderFree Soft.. One approach: adding structure to unstructured information October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today,Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself saysMicrosoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… ... by recognizing entity names... ... and relationships between them...

One approach: adding structure to unstructured information [Carvalho, Cohen SIGIR05; Cohen, Carvalho, Mitchell EMNLP 04]

One approach: adding structure to unstructured information [Mitchell et al CEAS 2004]

One approach: adding structure to unstructured information

One approach: adding structure to unstructured information [McCallum et al IJCAI05]

Is converting unstructured data to structured data enough?

Limitations of structured data What is the email address for the person named “Halevy” mentioned in this presentation? What files from my home machine will I need for this meeting? What people will attend this meeting? ... ? • Diversity: many different types of information from many different sources, that arise to fill many different needs. • Uncertainty: information from many sources (like IE programs or the web) need not be correct. • Complexity of interaction: formulating ‘information needs’ as queries to a DB can be difficult...especially a heterogeneous DB, with a complex/changing schema. How do you discover & access the tens or hundreds of structured databases? How do you understand & combine the hundreds of schemata, with thousands of fields? How do you relate the thousands or millions or ... of entity identifiers from the different databases? ? How can you include many diverse sources of information in single database?

Bell Labs Bell Telephone Labs AT&T Bell Labs AT&T Labs AT&T Labs—Research AT&T Labs Research, Shannon Laboratory Shannon Labs Bell Labs Innovations Lucent Technologies/Bell Labs Innovations When are two entities the same?When is referent(oid1)=referent(oid2) ? [1925] History of Innovation: From 1925 to today, AT&T has attracted some of the world's greatest scientists, engineers and developers…. [www.research.att.com] Bell Labs Facts: Bell Laboratories, the research and development arm of Lucent Technologies, has been operating continuously since 1925… [bell-labs.com]

Is there a definition of ‘entity identity’ that is user- and purpose- independent? = ≠ Bell Telephone Labs =

When are two entities are the same? “Buddhism rejects the key element in folk psychology: the idea of a self (a unified personal identity that is continuous through time)… King Milinda and Nagasena (the Buddhist sage) discuss … personal identity… Milinda gradually realizes that "Nagasena" (the word) does not stand for anything he can point to: … not … the hairs on Nagasena's head, nor the hairs of the body, nor the "nails, teeth, skin, muscles, sinews, bones, marrow, kidneys, ..." etc… Milinda concludes that "Nagasena" doesn't stand for anything… If we can't say what a person is, then how do we know a person is the same person through time? … There's really no you, and if there's no you, there are no beliefs or desires for you to have… The folk psychology picture is profoundly misleading and believing it will make you miserable.” -S. LaFave

Traditional approach: Linkage Queries Uncertainty about what to link must be decided by the integration system, not the end user

SELECT R.a,S.a,S.b,T.b FROM R,S,T WHERE R.a=S.a and S.b=T.b Link items as needed by Q Query Q WHIRL vision: Strongest links: those agreeable to most users Weaker links: those agreeable to some users even weaker links…

SELECT R.a,S.a,S.b,T.b FROM R,S,T WHERE R.a~S.a and S.b~T.b (~ TFIDF-similar) Query Q WHIRL vision: DB1 + DB2 ≠ DB Link items as needed by Q Incrementally produce a ranked list of possible links, with “best matches” first. User (or downstream process) decides how much of the list to generate and examine.

Outline • Motivation: why I’m interested in • structured data that is partly text: similarity! • structured data represented as graphs; • measuring similarity of nodes in graphs • Contributions: • a simple query language for graphs; • experiments on natural types of queries; • techniques for learning to answer queries of a certain type better There are general-purpose, fast, robust similarity measures for text, which are useful for data integration....and hence, combining information from multiple sources.

Limitations of structured data What is the email address for the person named “Halevy” mentioned in this presentation? What files from my home machine will I need for this meeting? What people will attend this meeting? ... ? • Diversity: many different types of information from many different sources, that arise to fill many different needs. • Uncertainty: information from many sources (like IE programs or the web) need not be correct. • Complexity of interaction: formulating ‘information needs’ as queries to a DB can be difficult...especially a heterogeneous one. ? How can you exploit structure without understanding the structure?

Schema-free structured search • DataSpot (DTL)/Mercado Intuifind: [VLDB 98] • Proximity Search: [VLDB98] • Information units (linked Web pages): [WWW10] • Microsoft DBExplorer, Microsoft English query • BANKS (Browsing ANd Keyword Search): [Chakrabarti & others, VLDB 02, VLDB 05]

BANKS: Basic Data Model • Database is modeled as a graph • Nodes = tuples • Edges = references between tuples • edges are directed. • foreign key, inclusion dependencies, .. User need not know organization of database to formulate queries. BANKS: Keyword search… MultiQuery Optimization paper writes Charuta S. Sudarshan Prasan Roy author

BANKS: Answer to Query Query: “sudarshan roy” Answer: subtree from graph paper MultiQuery Optimization writes writes author author S. Sudarshan Prasan Roy

BANKS: Basic Data Model • Database is modeled as a graph • Nodes = tuples • Edges = references between tuples • edges are directed. • foreign key, inclusion dependencies, ..

not quite so basic BANKS: Basic Data Model • Database All information is modeled as a graph • Nodes = tuples or documents or strings or words • Edges = references between tuples nodes • edges are directed, labeled and weighted • foreign key, inclusion dependencies, ... • doc/string D to word contained by D (TFIDF weighted, perhaps) • word W to doc/string containing W (inverted index) • [string S to strings ‘similar to’ S]

Outline • Motivation: why I’m interested in • structured data that is partly text – similarity! • structured data represented as graphs; all sorts of information can be poured into this model. • measuring similarity of nodes in graphs • Contributions: • a simple query language for graphs; • experiments on natural types of queries; • techniques for learning to answer queries of a certain type better

Yet another schema-free query language • Assume data is encoded in a graph with: • a node for each object x • a type of each object x, T(x) • an edge for each binary relation r:x  y • Queries are of this form: • Given type t* and node x, find y:T(y)=t* and y~x. • We’d like to construct a general-purpose similarity function x~y for objects in the graph: • We’d also like to learn many such functions for different specific tasks (like “who should attend a meeting”) Node similarity

Similarity of Nodes in Graphs Given type t* and node x, find y:T(y)=t* and y~x. • Similarity defined by “damped” version of PageRank • Similarity between nodes x and y: • “Random surfer model”: from a node z, • with probability α, stop and “output” z • pick an edge label r using Pr(r | z) ... e.g. uniform • pick a y uniformly from { y’ : z  y with label r } • repeat from node y .... • Similarity x~y = Pr( “output” y | start at x) • Intuitively, x~y is summation of weight of all paths from x to y, where weight of path decreases exponentially with length.

not quite so basic BANKS: Basic Data Model • Database All information is modeled as a graph • Nodes = tuples or documents or strings or words • Edges = references between tuples nodes • edges are directed, labeled and weighted • foreign key, inclusion dependencies, ... • doc/string D to word contained by D (TFIDF weighted, perhaps) • word W to doc/string containing W (inverted index) • [string S to strings ‘similar to’ S] “William W. Cohen, CMU” cohen optional—strings that are similar in TFIDF/cosine distance will still be “nearby” in graph (connected by many length=2 paths) william w cmu dr “Dr. W. W. Cohen”

Similarity of Nodes in Graphs • Random surfer on graphs: • natural extension to PageRank • closely related to Lafferty’s heat diffusion kernel • but generalized to directed graphs • somewhat amenable to learning parameters of the walk (gradient search, w/ various optimization metrics): • Toutanova, Manning & NG, ICML2004 • Nie et al, WWW2005 • Xi et al, SIGIR 2005 • can be sped up and adapted to longer walks by sampling approaches to matrix multiplication (e.g. Lewis & E. Cohen, SODA 1998), similar to particle filtering • our current implementation (GHIRL): Lucene + Sleepycat with extensive use of memory caching (sampling approaches visit many nodes repeatedly)

Query: “sudarshan roy” Answer: subtree from graph paper MultiQuery Optimization writes writes author author S. Sudarshan Prasan Roy

y: paper(y) & y~“roy” w: paper(y) & w~“roy” AND Query: “sudarshan roy” Answer: subtree from graph

Evaluation on Personal Information Management Tasks [Minkov et al, SIGIR 2006] Many tasks can be expressed as simple, non-conjunctive search queries in this framework. Such as: • Person Name Disambiguation in Email • Threading • Finding email-address aliases given a person’s name • Finding relevant meeting attendees What is the email address for the person named “Halevy” mentioned in this presentation? What files from my home machine will I need for this meeting? What people will attend this meeting? ... ? novel [eg Diehl, Getoor, Namata, 2006] [eg Lewis & Knowles 97] novel Also consider a generalization: x  Vq Vqis a distribution over nodes x novel

Email as a graph sent_date date2 sent_to alias Email address1 person name1 a_inv +1_day Email address2 person name2 sf_inv Sent_from Sent_to st_inv sent_date file1 Email address3 person name3 file2 date1 sd_Inv sent_from in_file Email address4 person name4 in_subj sent_to If_inv is_inv Email address5 person name5 term8 term9 term1 term2 term3 term7 term4 term6 term11 term5 term10

Person Name Disambiguation file Person file Person: Andrew Johns Q: “who is Andy?” • Given: a term that is not mentioned ‘as is’ in header (otherwise, easy), that is known to be a personal name • Output: ranked person nodes. file term:andy Person * This task is complementary to person name annotation in email (E. Minkov, R. Wang, W.Cohen, Extracting Personal Names from Emails: Applying Named Entity Recognition to Informal Text, HLT/EMNLP 2005)

Corpora and Datasets a. Corpora Example nicknames: Dave for David, Kai for Keiko, Jenny for Qing b. Types of names

Person Name Disambiguation • 1. Baseline: String matching (& common nicknames) • Find persons that are similar to the name term (Jaro) • Successful in many cases • Not successful for some nicknames • Can not handle ambiguity (arbitrary) • 3. Graph walk: term+file • Vq: name term + file nodes (2 steps) • The file node is natural available context • Solves the ambiguity problem! • But, incorporates additional noise. • 4. Graph walk: term+file, reranked using learning • Re-rank the output of (3), using: • path-describing features • ‘source count’ : do the paths originate from a single or two source nodes • string similarity • 2. Graph walk: term • Vq: name term node (2 steps) • Models co-occurrences. • Can not handle ambiguity (dominant)

Results

Results after learning-to-rank graph walk from {name,file} graph walk from name baseline: string match, nicknames

Results Enron execs

Results

Learning • There is no single “best” measure of similarity: • How can you learn how to better rank graph nodes, for a particular task? • Learning methods for graph walks: • The parameters can be adjusted using gradient descent methods (Diligenti et-al, IJCAI 2005) • We explored a node re-ranking approach – which can take advantage of a wider range of features features (and is complementary to parameter tuning) • Features of candidate answer y describe the set of paths from query x to y

Re-ranking overview Boosting-based reranking, following (Collins and Koo, Computational Linguistics, 2005): A training example includes: • a ranked list of li nodes. • Each node is represented through m features • At least one known correct node Scoring function:Find w that minimizes (boosted version):Requires binary features and has a closed form formula to find best feature and delta in each iteration. linear combination of features original score y~x , where

Path describing Features • The set of paths to a target node in step k is recovered in full. X1 ‘Edge unigram’ features:was edge type l used in reaching x from Vq. X2 X3 X4 ‘Edge bigram’ features:were edge types l1 and l2 used (in that order) in reaching x from Vq. X5 K=0 K=1 K=2 ‘Top edge bigram’ features:were edge types l1 and l2 used (in that order) in reaching x from Vq, among the top two highest scoring paths. • Paths (x3, k=2): • x2 x1  x3 • x4 x1  x3 • x2  x2  x3 • x2  x3

On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data