790 likes | 815 Views
A Framework for Learning to Query Heterogeneous Data. William W. Cohen Machine Learning Department and Language Technologies Institute School of Computer Science Carnegie Mellon University joint work with: Einat Minkov, Andrew Ng, Richard Wang, Anthony Tomasic, Bob Frederking. Outline.
E N D
A Framework for Learning to Query Heterogeneous Data William W. Cohen Machine Learning Department and Language Technologies Institute School of Computer Science Carnegie Mellon University joint work with: Einat Minkov, Andrew Ng, Richard Wang, Anthony Tomasic, Bob Frederking
Outline • Two views on data quality: • Cleaning your data vs living with the mess. • “A lazy/Bayesian view of data cleaning” • A framework for querying dirty data • Data model • Query language • Baseline results (biotext and email) • How to improve results with learning • Learning to re-rank query output • Conclusions
A Bayesian Looks at Record Linkage • Record linkage problem: given two sets of records A={a1,…,am} and B={b1,…,bn}, determine when referent(ai)=referent(bj) • Idea: compute for each ai,bj pair Pr(referent(ai)=referent(bj)) • Pick two thresholds: • Pr(a=b) > HI accept pairing • Pr(a=b) < LO reject pairing • otherwise, “clerical review” by a human clerk • Every optimal decision boundary is defined by a threshold on the ranked list. • Thresholds depend on prior probability of a and b matching.
A Bayesian Looks at Record Linkage • Every optimal decision boundary is defined by a threshold on the ranked list. 2n*m ways to link • In other words: • 2n*m – n*m linkages can be discarded as impossible* • of the remaining n*m, all but HI-LO can be discarded as “improbable” . . . • But wait: why doesn’t the human clerk pick a threshold between LO and HI? n*m pairs
A Bayesian Looks at Record Linkage • An alternate view of the process: • F-S’s method answers the question directly for the cases that everyone would agree on. • Human effort is used to answer the cases that are a little harder.
A Bayesian Looks at Record Linkage • An alternate view of the process: • F-S’s method answers the question directly for the cases that everyone would agree on. • Human effort is used to answer the cases that are a little harder. Q: is A43 in B? A: yes (p=0.98) Q: is A83 in B? A: not clear… Q: is A21 in B? A: unlikely ?
Passing linkage decisions along to the user Usual goal: link records and create a single highly accurate database for users query. • Equality is often uncertain, given available information about an entity • “name: T. Kennedy occupation: terrorist” • The interpretation of “equality” may change from user to user and application to application • Does “Boston Market” = “McDonalds” ? • Alternate goal: wait for a query, then answer it, propogating uncertainty about linkage decisions on that query to the enduser X
WHIRL project (1997-2000) • WHIRL initiated when at AT&T Bell Labs AT&T Research AT&T Labs - Research AT&T Research AT&T Labs AT&T Research – Shannon Laboratory AT&T Shannon Labs
Bell Labs Bell Telephone Labs AT&T Bell Labs A&T Labs AT&T Labs—Research AT&T Labs Research, Shannon Laboratory Shannon Labs Bell Labs Innovations Lucent Technologies/Bell Labs Innovations When are two entities the same? [1925] History of Innovation: From 1925 to today, AT&T has attracted some of the world's greatest scientists, engineers and developers…. [www.research.att.com] Bell Labs Facts: Bell Laboratories, the research and development arm of Lucent Technologies, has been operating continuously since 1925… [bell-labs.com]
When are two entities are the same? “Buddhism rejects the key element in folk psychology: the idea of a self (a unified personal identity that is continuous through time)… King Milinda and Nagasena (the Buddhist sage) discuss … personal identity… Milinda gradually realizes that "Nagasena" (the word) does not stand for anything he can point to: … not … the hairs on Nagasena's head, nor the hairs of the body, nor the "nails, teeth, skin, muscles, sinews, bones, marrow, kidneys, ..." etc… Milinda concludes that "Nagasena" doesn't stand for anything… If we can't say what a person is, then how do we know a person is the same person through time? … There's really no you, and if there's no you, there are no beliefs or desires for you to have… The folk psychology picture is profoundly misleading and believing it will make you miserable.” -S. LaFave
Traditional approach: Linkage Queries Uncertainty about what to link must be decided by the integration system, not the end user
SELECT R.a,S.a,S.b,T.b FROM R,S,T WHERE R.a=S.a and S.b=T.b Link items as needed by Q Query Q WHIRL vision: Strongest links: those agreeable to most users Weaker links: those agreeable to some users even weaker links…
SELECT R.a,S.a,S.b,T.b FROM R,S,T WHERE R.a~S.a and S.b~T.b (~ TFIDF-similar) Query Q WHIRL vision: DB1 + DB2 ≠ DB Link items as needed by Q Incrementally produce a ranked list of possible links, with “best matches” first. User (or downstream process) decides how much of the list to generate and examine.
WHIRL queries • Assume two relations: review(movieTitle,reviewText): archive of reviews listing(theatre, movieTitle, showTimes, …): now showing
WHIRL queries • “Find reviews of sci-fi comedies [movie domain] FROM review SELECT * WHERE r.text~’sci fi comedy’ (like standard ranked retrieval of “sci-fi comedy”) • “ “Where is [that sci-fi comedy] playing?” FROM review as r, LISTING as s, SELECT * WHERE r.title~s.title and r.text~’sci fi comedy’ (best answers: titles are similar to each other – e.g., “Hitchhiker’s Guide to the Galaxy” and “The Hitchhiker’s Guide to the Galaxy, 2005” and the review text is similar to “sci-fi comedy”)
WHIRL queries • Similarity is based on TFIDF rare wordsare most important. • Search for high-ranking answers uses inverted indices….
Years are common in the review archive, so have low weight WHIRL queries • Similarity is based on TFIDF rare wordsare most important. • Search for high-ranking answers uses inverted indices…. - It is easy to find the (few) items that match on “important” terms - Search for strong matches can prune “unimportant terms”
WHIRL results • This sort of worked: • Interactive speeds (<0.3s/q) with a few hundred thousand tuples. • For 2-way joins, average precision (sort of like area under precision-recall curve) from 85% to 100% on 13 problems in 6 domains. • Average precision better than 90% on 5-way joins
WHIRL worked for a number of web-based demo applications. e.g., integrating data from 30-50 smallish web DBs with <1 FTE labor WHIRL could link many data types reasonably well, without engineering WHIRL generated numerous papers (Sigmod98, KDD98, Agents99, AAAI99, TOIS2000, AIJ2000, ICML2000, JAIR2001) WHIRL was relational But see ELIXIR (SIGIR2001) WHIRL users need to know schema of source DBs WHIRL’s query-time linkage worked only for TFIDF, token-based distance metrics Text fields with few misspellimgs WHIRL was memory-based all data must be centrally stored—no federated data. small datasets only WHIRL and soft integration
SELECT R.a,S.a,S.b,T.b FROM R,S,T WHERE R.a~S.a and S.b~T.b (~ TFIDF-similar) Query Q WHIRL vision: very radical, everything was inter-dependent Link items as needed by Q To make SQL-like queries, user must understand the schema of the underlying DB (and hence someone must understand DB1, DB2, DB3, ... Incrementally produce a ranked list of possible links, with “best matches” first. User (or downstream process) decides how much of the list to generate and examine. ?
Outline • Two views on data quality: • Cleaning your data vs living with the mess. • A lazy/Bayesian view of data cleaning • A framework for querying dirty data • Data model • Query language • Baseline results (biotext and email) • How to improve results with learning • Learning to re-rank query output • Conclusions
BANKS: Basic Data Model • Database is modeled as a graph • Nodes = tuples • Edges = references between tuples • foreign key, inclusion dependencies, .. • Edges are directed. User need not know organization of database to formulate queries. BANKS: Keyword search… MultiQuery Optimization paper writes Charuta S. Sudarshan Prasan Roy author
BANKS: Answer to Query Query: “sudarshan roy” Answer: subtree from graph paper MultiQuery Optimization writes writes author author S. Sudarshan Prasan Roy
BANKS: Basic Data Model • Database is modeled as a graph • Nodes = tuples • Edges = references between tuples • edges are directed. • foreign key, inclusion dependencies, ..
not quite so basic BANKS: Basic Data Model • Database All information is modeled as a graph • Nodes = tuples or documents or strings or words • Edges = references between tuples nodes • edges are directed, labeled and weighted • foreign key, inclusion dependencies, ... • doc/string D to word contained by D (TFIDF weighted, perhaps) • word W to doc/string containing W (inverted index) • [string S to strings ‘similar to’ S]
Similarity in a BANKS-like system • Motivation: why I’m interested in • structured data that is partly text – similarity! • structured data represented as graphs; all sorts of information can be poured into this model. • measuring similarity of nodes in graphs • Coming up next: • a simple query language for graphs; • experiments on natural types of queries; • techniques for learning to answer queries of a certain type better
Yet another schema-free query language • Assume data is encoded in a graph with: • a node for each object x • a type of each object x, T(x) • an edge for each binary relation r:x y • Queries are of this form: • Given type t* and node x, find y:T(y)=t* and y~x. • We’d like to construct a general-purpose similarity function x~y for objects in the graph: • We’d also like to learn many such functions for different specific tasks (like “who should attend a meeting”) Node similarity
Similarity of Nodes in Graphs Given type t* and node x, find y:T(y)=t* and y~x. • Similarity defined by “damped” version of PageRank • Similarity between nodes x and y: • “Random surfer model”: from a node z, • with probability α, stop and “output” z • pick an edge label r using Pr(r | z) ... e.g. uniform • pick a y uniformly from { y’ : z y with label r } • repeat from node y .... • Similarity x~y = Pr( “output” y | start at x) • Intuitively, x~y is summation of weight of all paths from x to y, where weight of path decreases exponentially with length.
not quite so basic BANKS: Basic Data Model • Database All information is modeled as a graph • Nodes = tuples or documents or strings or words • Edges = references between tuples nodes • edges are directed, labeled and weighted • foreign key, inclusion dependencies, ... • doc/string D to word contained by D (TFIDF weighted, perhaps) • word W to doc/string containing W (inverted index) • [string S to strings ‘similar to’ S] “William W. Cohen, CMU” cohen optional—strings that are similar in TFIDF/cosine distance will still be “nearby” in graph (connected by many length=2 paths) william w cmu dr “Dr. W. W. Cohen”
Similarity of Nodes in Graphs • Random surfer on graphs: • natural extension to PageRank • closely related to Lafferty’s heat diffusion kernel • but generalized to directed graphs • somewhat amenable to learning parameters of the walk (gradient search, w/ various optimization metrics): • Toutanova, Manning & NG, ICML2004 • Nie et al, WWW2005 • Xi et al, SIGIR 2005 • can be sped up and adapted to longer walks by sampling approaches to matrix multiplication (e.g. Lewis & E. Cohen, SODA 1998), similar to particle filtering • our current implementation (GHIRL): Lucene + Sleepycat with extensive use of memory caching (sampling approaches visit many nodes repeatedly)
Query: “sudarshan roy” Answer: subtree from graph paper MultiQuery Optimization writes writes author author S. Sudarshan Prasan Roy
y: paper(y) & y~“roy” w: paper(y) & w~“roy” AND Query: “sudarshan roy” Answer: subtree from graph
Evaluation on Personal Information Management Tasks [Minkov et al, SIGIR 2006] Many tasks can be expressed as simple, non-conjunctive search queries in this framework. Such as: • Person Name Disambiguation in Email • Threading • Finding email-address aliases given a person’s name • Finding relevant meeting attendees What is the email address for the person named “Halevy” mentioned in this presentation? What files from my home machine will I need for this meeting? What people will attend this meeting? ... ? novel [eg Diehl, Getoor, Namata, 2006] [eg Lewis & Knowles 97] novel Also consider a generalization: x Vq Vqis a distribution over nodes x novel
Email as a graph sent_date date2 sent_to alias Email address1 person name1 a_inv +1_day Email address2 person name2 sf_inv Sent_from Sent_to st_inv sent_date file1 Email address3 person name3 file2 date1 sd_Inv sent_from in_file Email address4 person name4 in_subj sent_to If_inv is_inv Email address5 person name5 term8 term9 term1 term2 term3 term7 term4 term6 term11 term5 term10
Person Name Disambiguation file Person file Person: Andrew Johns Q: “who is Andy?” • Given: a term that is not mentioned ‘as is’ in header (otherwise, easy), that is known to be a personal name • Output: ranked person nodes. file term:andy Person * This task is complementary to person name annotation in email (E. Minkov, R. Wang, W.Cohen, Extracting Personal Names from Emails: Applying Named Entity Recognition to Informal Text, HLT/EMNLP 2005)
Corpora and Datasets a. Corpora Example nicknames: Dave for David, Kai for Keiko, Jenny for Qing b. Types of names
Person Name Disambiguation • 1. Baseline: String matching (& common nicknames) • Find persons that are similar to the name term (Jaro) • Successful in many cases • Not successful for some nicknames • Can not handle ambiguity (arbitrary) • 3. Graph walk: term+file • Vq: name term + file nodes (2 steps) • The file node is natural available context • Solves the ambiguity problem! • But, incorporates additional noise. • 4. Graph walk: term+file, reranked using learning • Re-rank the output of (3), using: • path-describing features • ‘source count’ : do the paths originate from a single or two source nodes • string similarity • 2. Graph walk: term • Vq: name term node (2 steps) • Models co-occurrences. • Can not handle ambiguity (dominant)
Results after learning-to-rank graph walk from {name,file} graph walk from name baseline: string match, nicknames
Results Enron execs
Learning • There is no single “best” measure of similarity: • How can you learn how to better rank graph nodes, for a particular task? • Learning methods for graph walks: • The parameters can be adjusted using gradient descent methods (Diligenti et-al, IJCAI 2005) • We explored a node re-ranking approach – which can take advantage of a wider range of features features (and is complementary to parameter tuning) • Features of candidate answer y describe the set of paths from query x to y
Re-ranking overview Boosting-based reranking, following (Collins and Koo, Computational Linguistics, 2005): A training example includes: • a ranked list of li nodes. • Each node is represented through m features • At least one known correct node Scoring function:Find w that minimizes (boosted version):Requires binary features and has a closed form formula to find best feature and delta in each iteration. linear combination of features original score y~x , where
Path describing Features • The set of paths to a target node in step k is recovered in full. X1 ‘Edge unigram’ features:was edge type l used in reaching x from Vq. X2 X3 X4 ‘Edge bigram’ features:were edge types l1 and l2 used (in that order) in reaching x from Vq. X5 K=0 K=1 K=2 ‘Top edge bigram’ features:were edge types l1 and l2 used (in that order) in reaching x from Vq, among the top two highest scoring paths. • Paths (x3, k=2): • x2 x1 x3 • x4 x1 x3 • x2 x2 x3 • x2 x3