250 likes | 443 Views
Active Sampling for Entity Matching. Aditya Parameswaran Stanford University Jointly with: Kedar Bellare , Suresh Iyengar , Vibhor Rastogi Yahoo! Research. Entity Matching. Goal : Find duplicate entities in a given data set Fundamental data cleaning primitive decades of prior work
E N D
Active Sampling for Entity Matching AdityaParameswaran Stanford University Jointly with: KedarBellare, Suresh Iyengar, VibhorRastogi Yahoo! Research
Entity Matching Goal: Find duplicate entities in a given data set Fundamental data cleaning primitive decades of prior work Especially important at Yahoo! (and other web companies) Homma’s Brown Rice Sushi California Avenue Palo Alto Homma’s Sushi Cal Ave Palo Alto
Why is it important? Applications: • Business Listings in Y! Local • Celebrities in Y! Movies • Events in Y! Upcoming • …. Websites Yelp Zagat Foursq ??? Find Duplicates Deduplicated Entities Dirty Entities Databases Content Providers
How? Reformulated Goal: Construct a high quality classifier identifying duplicate entity pairs Problem: How do we select training data? Answer: Active Learning with Human Experts!
Reformulated Workflow Websites Our Technique Deduplicated Entities Dirty Entities Databases Content Providers
Active Learning (AL) Primer Work even under noisy settings } Properties of an AL algorithm: • Label Complexity • Time Complexity • Consistency Prior work: • Uncertainty Sampling • Query by Committee • … • Importance Weighted Active Learning (IWAL) • Online IWAL without Constraints • Implemented in VowpalWabbit (VW) • 0-1 Metric • Time and Label efficient • Provably Consistent
Problem One: Imbalanced Data • Typical to have 100:1 even after blocking • Solution: Metric from [Arasu11]: • Maximize Recall • Such that Precision > τ Non-matches Matches 100 1 • Solution: All Non-matches • Precision 100% • 0-1 Error ≈ 0 Correctly identified matches % of correct matches
Problem Two: Guarantees • Prior work on Entity Matching • No guarantees on Recall/Precision • Even if they do, they have: • High time + label complexity • Can we adapt prior work on AL for the new objective: • Maximize recall, such that precision > τ • With: • Sub-linear label complexity • Efficient time complexity
Overview of Our Approach Recall Optimization with Precision Constraint This talk Reduction: Convex-hull Search in Relaxed Lagrangian Weighted 0-1 Error Paper Reduction: Rejection Sampling Active Learning with 0-1 Error
Objective Given: • Hypothesis class H, • Threshold τin[0,1] Objective: Find h in H that • Maximizes recall(h) • Such that: precision(h) >= τ Equivalently: • Maximize-falseneg(h) • Such that:εtruepos(h) -falsepos(h) >= 0 • Where ε = τ/(1-τ)
Unconstrained Objective X(h) Y(h) Weighted 0-1 objective Current formulation: • Maximize-falseneg(h) εtruepos(h) -falsepos(h) >= 0 If we introduce lagrange multiplier λ: • Maximize X(h) + λ Y(h), can be rewritten as: • Minimizeδfalseneg (h) + (1 – δ) falsepos(h)
Convex Hull of Classifiers Convex shape formed by joining classifiers strictly dominating others Y(h) We want a classifier here 0 Can have exponential number of points inside X(h) Maximize X(h) Such that Y(h) >= 0
Convex Hull of Classifiers u-v For any λ>0, there is a point / line with largest value of X + λ Y u Y(h) Plug λ into weighted objective, get classifier h with highest X(h) + λ Y(h) v If λ=-1/slope of a line, we get a classifier on the line, else we get a vertex classifier. X(h) Maximize X(h) Such that Y(h) >= 0
Convex Hull of Classifiers Naïve strategy: try all λ Equivalently, try all slopes Too long! Y(h) Worst case, we get this point Instead, do binary search for λ • Problem: When to stop? • 1) Bounds • 2) Discretization of λ • Details in Paper! X(h) Maximize X(h) Such that Y(h) >= 0
Algorithm I (Ours Weighted) • Given: AL black box C for weighted 0-1 error • Goal: Precision constrained objective • Range of λ: [Λmin,Λmax] • Don’t enumerate all candidate λ too expensive; O(n3) • Instead, discretized using factor θ see paper! • Binary search over discretized values • Same complexity as binary search • O(log n)
Algorithm II (Weighted 0-1) • Given: AL black box B for 0-1 error • Goal: AL Black box C for weighted 0-1 error • Use trick from Supervised Learning [Zadrozny03] • Cost-sensitive objective Binary • Reduction by rejection sampling
Overview of Our Approach Recall Optimization with Precision Constraint This talk O(log n) Reduction: Convex-hull Search in Relaxed Lagrangian Weighted 0-1 Error Paper O(log n) Reduction: Rejection Sampling Labels = O(log2 n) L(B) • Time = O(log2 n) T(B) Active Learning with 0-1 Error
Experiments • Four real-world data sets • All labels known • Simulate active learning • Two approaches for AL with Precision Constraint: • Ours • With VowpalWabbit as 0-1 AL Black Box • Monotone [Arasu11] • Assumes monotonicity of similarity features • High computational + label complexity
Results I (Runtime with #Features) Computational complexity on UCI Person
Results II (Quality & #Label Queries) Business Person
Results II (Contd.) DBLP-ACM Scholar
Results III (0-1 Active Learning) Precision Constraint Satisfaction % of 0-1 AL
Conclusion • Active learning for Entity Matching • Can use any 0-1 AL as black box • Great real world performance: • Computationally efficient (600k examples in 25 seconds) • Label efficient and better F-1 on four real-world tasks • Guaranteed • Precision of matcher • Time and label complexity