210 likes | 316 Views
Constraint-Based Entity Matching. Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana. Entity Matching. Decide if mentions refer to the same real-world entity Key problem in numerous applications Information integration Natural language understanding
E N D
Constraint-Based Entity Matching Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana
Entity Matching • Decide if mentions refer to the same real-world entity • Key problem in numerous applications • Information integration • Natural language understanding • Semantic Web Chris Li, Jane Smith. “Numerical Analysis”. SIAM 2001 Chen Li, Doug Chan. “Ensemble Learning” C. Li, D. Chan. “Ensemble Learning”. ICML 2003
State of the Art • Numerous solutions in the AI, Database, and Web communities • Cohen, Ravikumar, & Fienberg 2003 • Li, Morie, & Roth 2004 • Bhattacharya & Getoor 2004 • McCallum, Nigam, & Ungar 2000 • Pasula et. al. 2003 • Wellner et. al. 2004 • Most solutions largely exploit only syntactic similarity • “Jeff Smith” ≈ “J. Smith” • “(217) 235-1234” ≈ “235-1234”
Semantic Constraints Incompatible Subsumption Layout C. Li. “User Interfaces”. SIGCHI 2000 C. Li, J. Smith. “Numerical Analysis”. SIAM 2001 “Numerical Analysis”, SIAM 2001 withJ. Smith. Chris Li’s Homepage Chris Li, Jane Smith. “Numerical Analysis”. SIAM 2001 DBLP Chen Li, Doug Chan. “Ensemble Learning”. ICML 2003 C. Li. “Data Mining”. KDD 2000 Chen Li’s Homepage
Our Contributions • Develop a solution to exploit semantic constraints • Models constraints in a uniform probabilistic manner • Clusters mentions using a generative model • Uses relaxation labeling to handle constraints • Adds a pairwise layer to further improve accuracy • Experimental results on two real-world domains • Researchers, IMDB • Improved accuracy over state of the art by 3-12% F-1
m1: Chen Li e1 m2: C. Li Probabilistic Modeling of Constraints • Modeled as the effect on the probability that a mention refers to a real-world entity “If two mentions in the same document share similar names, they are likely to match”: • Constraint probabilities have a natural interpretation • Can be learned or manually specified by a domain expert • P (m2=e1| m1 = e1) = 0.8
m3:Chris Lee m1:Chen Li m2:C. Li Documents: d2 d1 c1 = layout constraint p(c1) = 0.8 Constraints: m1 = m2 Matching Pairs: The Entity Matching Problem Solution • Model document generation • Cluster mentions using this model
E e1 Chen Li e2 Chris Lee e2 Chris Lee m3: Chris Lee m1:Chen Li m2:C. Li c1: layout constraint p(c1) = 0.8 d2 d1 Modeling Document Generation • Generate mentions for each document • Select entities • Generate and “sprinkle” mentions • Check constraints for each mention • Decide whether to enforce constraint c • If enforced, check if mention violates c • If yes, discard documents and repeat process (Extension of model in Li, Morie & Roth 2004)
. . . Clustering with the Generative Model • Find mention assignments F and model parameters to maximize P (D, F | ) • Difficult to compute exactly, so use a variant of EM
Incorporating Constraints • Extend the step that assigns mentions • Basic mention assignment: • Extension: Use constraints to improve mention assignments
Compute parameters Assign mentions Apply constraints Enforcing Constraints on Clusters • Apply constraints at each iteration • Use relaxation labeling to apply constraints to mention assignments
Relaxation Labeling • Start with an initial labeling of mentions with entities • Iteratively improve mention labels, given constraints • Can be extended to probabilistic constraints • Scalable Chen Li = e1 C. Li = e2 Y. Lee = e3 Chris Lee = e2 Jane Smith = e4 C. Lee = e2 Smith, J = e4 Constraints: c1 = layout constraint p(c1) = 0.8
Relaxation Labeling • Start with an initial labeling of mentions with entities • Iteratively improve mention labels, given constraints • Can be extended to probabilistic constraints • Scalable Chen Li = e1 C. Li = e2 e1 Y. Lee = e3 Chris Lee = e2 Jane Smith = e4 C. Lee = e2 Smith, J = e4 Constraints: c1 = layout constraint p(c1) = 0.8
Handling Probabilistic Constraints • Relaxation labeling can combine multiple probabilistic constraints
Compute parameters Assign mentions Apply constraints Li, Chen Chen Li C. Li Li, C. Pairwise Layer • So far, we have applied constraints to clusters • It may be unclear how to enforce constraints on clusters • Add a pairwise layer • Convert clusters into predicted matching pairs • Remove only pairs that negative pairwise hard constraints apply to Constraint: C. Li≠ Li, C. Remove C. Li or Li, C. ?
Empirical Evaluation • Two real-world domains • Researchers, IMDB • For each domain • Collected documents • Researchers: homepages from DBLP and the web • IMDB: text and structured records from IMDB • Marked up mentions and their attributes • 4,991 researcher mentions • 3,889 movie titles from IMDB • Manually identified all correct matching pairs • Evaluation Metric: Precision = # true positives / # predicted pairs Recall = # true positives / # correct pairs F1 = (2 * P * R) / (P + R)
F1 (P / R) Researchers Movies Baseline .66 (.67/.65) .69 (.61/.79) Baseline + Relax .78 (.78/.78) .72 (.63/.83) Baseline + Relax + Pairwise .79 (.80/.79) .73 (.64/.83) Using Constraints Improves Accuracy • Relaxation labeler improves F-1 by 3-12% • Relaxation labeling very fast
Researchers F1 (P / R) Baseline .66 (.67/.65) + Rare Value .66 (.67/.66) + Subsumption .67 (.68/.65) + Neighborhood .70 (.68/.72) + Individual .70 (.77/.64) + Layout .71 (.68/.74) Movies F1 (P / R) Baseline .69 (.61/.79) + Incompatible .70 (.62/.79) + Neighborhood .70 (.62/.81) + Individual .71 (.62/.82) Using Constraints Individually • Each constraint makes a contribution
Related Work • Much work in entity matching Cohen, Ravikumar, & Fienberg 2003 Li, Morie, & Roth 2004 Bhattacharya & Getoor 2004 McCallum, Nigam, & Ungar 2000 Pasula et. al. 2003 Wellner et. al. 2004 • Recent work has looked at exploiting semantic constraints • Personal Information Management (Dong et. al. 2004) • Profiler based entity matching (Doan et. al. 2003) • Semantic constraints successfully exploited in other applications • Clustering algorithms (Bilenko et. al. 2004), ontology matching (Doan et. al. 2002)
Summary and Future Work • Exploit semantic constraints in entity matching • Models constraints in a uniform probabilistic manner • Uses a generative model and relaxation labeling to handle constraints in a scalable way • Experimental results on two real-world domains show effectiveness • Future work: Learning constraints effectively from current or external data