280 likes | 399 Views
Self-tuning in Graph-Based Reference Disambiguation. Rabia Nuray-Turan Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California, Irvine. Overview. Intro to Data Cleaning Entity resolution RelDC Framework Past work Adapting to data The new part
E N D
Self-tuning in Graph-Based Reference Disambiguation Rabia Nuray-Turan Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California, Irvine
Overview • Intro to Data Cleaning • Entity resolution • RelDC Framework • Past work • Adapting to data • The new part • Reduction to an Optimization problem • Linear programming • Experiments DASFAA 2007, Bangkok, Thailand
Data Cleaning Analysis on bad data leads to wrong conclusions DASFAA 2007, Bangkok, Thailand
Example of the problem: CiteSeer top-K Suspicious entries • Lets go to DBLP website • which stores bibliographic entries of many CS authors • Lets check two people • “A. Gupta” • “L. Zhang” CiteSeer: the top-k most cited authors DBLP DBLP DASFAA 2007, Bangkok, Thailand
Two Most Common Entity-Resolution Challenges Fuzzy lookup • reference disambiguation • match references to objects • list of all objects is given Fuzzy grouping • group together object repre-sentations, that correspond to the same object DASFAA 2007, Bangkok, Thailand
Standard Approach to Entity Resolution DASFAA 2007, Bangkok, Thailand
Overview • Intro to Data Cleaning • RelDC Framework • Past work • Adapting to data • The new part • Reduction to an Optimization problem • Linear programming • Experiments DASFAA 2007, Bangkok, Thailand
RelDC Framework DASFAA 2007, Bangkok, Thailand
RelDC Framework • Past work • SDM’05, TODS’06 • Domain-independent framework • Viewing the dataset as an Entity Relationship Graph • Analyzes paths in this graph • Solid theoretic foundation • Optimization problem • Scales to large datasets • Robust under uncertainty • High disambiguation quality • No Self-tuning • This paper solves this challenge DASFAA 2007, Bangkok, Thailand
Entity-Relationship Graph • Choice node • For uncertain references • To encode options/possibilities yr1, … yrN • Among options yr1, … yrN • Pick the most strongly connected one • CAP principle • Analyze paths in G • that exist between xr and yrj, for all j • Use a model to measure connection strength • “Connection strength” model • c(u,v), for nodes u and v in G • how strongly u and v are connected in G • RandomWalk-based • Fixed • Based onIntuition!!! • This paper, instead, learns such a model from data. DASFAA 2007, Bangkok, Thailand
Overview • Intro to Data Cleaning • RelDC Framework • Past work • Adapting to data • The new part • Reduction to an Optimization problem • Linear programming • Experiments DASFAA 2007, Bangkok, Thailand
Adaptive Solution • Classify the found paths in the graph into a finite set of path types ST ={ T1, T2, …, TN} • If paths p1 and p2 are of the same type then they are treated as identical. • We can show the connection between nodes u and v with a path-type count vector: Tuv = { c1, c2, …, cN} • If there is a way to associate path Ti to wi then connection strengthwill be: DASFAA 2007, Bangkok, Thailand
Problems to Answer • How will we classify the paths? • How will we associate each path type with a weight? DASFAA 2007, Bangkok, Thailand
Classifying Paths • Path Type Model (PTM): • Views each path as a sequence of edges • <e1,e2,e3,…,en> • Each edge ei has a type Ei associated with it • Thus, can associate each path p with a string • <E1,E2,E3,…,En> • Different strings correspond to different path types • Associate each string a weight • Different models are also possible DASFAA 2007, Bangkok, Thailand
Learning Path Weights : Optimization Problem • CAP Principle states that: • the right option will be better connected • Linear programming • Learn path types weight w’s. DASFAA 2007, Bangkok, Thailand
Final Solution • The value of c(xr,yrj)- c(xr,yrl) should be maximized for all r, l≠j • Then final solution: DASFAA 2007, Bangkok, Thailand
Example -Graph P1= e1-e3-e1 P2= e1-e1-e3 P3= e1-e2-e2-e3 P4= e1-e2-e3-e2-e3 DASFAA 2007, Bangkok, Thailand
Example- Solution • w1 =1 • w3 = w4 = 0 • w2 can be anything between 0 and 1. DASFAA 2007, Bangkok, Thailand
Overview • Intro to Data Cleaning • RelDC Framework • Past work • Adapting to data • The new part • Reduction to an Optimization problem • Linear programming • Experiments DASFAA 2007, Bangkok, Thailand
Experimental Setup Parameters • When looking for L-short simple paths, L = 5 • L is the path-length limit RealMov: • movies (12K) • people (22K) • actors • directors • producers • studious (1K) • producing • distributing • ground truth is known SynPub datasets: • many ds of five different types • emulation of RealPub • publications (5K) • authors (1K) • organizations (25K) • departments (125K) • ground truth is known DASFAA 2007, Bangkok, Thailand
Experimental Results on Movies • Parameters : • Fraction : fraction of uncertain references in the dataset • Each reference has 2 choices DASFAA 2007, Bangkok, Thailand
Experimental Results on Movies- II Number of options based on PMF Distribution DASFAA 2007, Bangkok, Thailand
Hybrid Model : Experimental Results on SynPub RandomWalk, PTM and the Hybrid Model have the same accuracy Is RandomWalk the optimum model for Publications domain? DASFAA 2007, Bangkok, Thailand
Effect of Random Relationships in the Publications Domain DASFAA 2007, Bangkok, Thailand
Summary • Main Contribution • An adaptive solution for connection strength • Model learns the weights of different path types • Ongoing work • Using different models to learn the importance of paths in the connection strength • Use of standard machine learning techniques for learning: such as decision trees, etc… • Different ways to classify paths DASFAA 2007, Bangkok, Thailand
Contact Information • RelDC project • www.ics.uci.edu/~dvk/RelDC • www.itr-rescue.org (RESCUE) • Rabia Nuray-Turan (contact author) • www.ics.uci.edu/~rnuray • Dmitri V. Kalashnikov • www.ics.uci.edu/~dvk • Sharad Mehrotra • www.ics.uci.edu/~sharad DASFAA 2007, Bangkok, Thailand