Active Sampling for Entity Matching

Active Sampling for Entity Matching AdityaParameswaran Stanford University Jointly with: KedarBellare, Suresh Iyengar, VibhorRastogi Yahoo! Research

Entity Matching Goal: Find duplicate entities in a given data set Fundamental data cleaning primitive  decades of prior work Especially important at Yahoo! (and other web companies) Homma’s Brown Rice Sushi California Avenue Palo Alto Homma’s Sushi Cal Ave Palo Alto

Why is it important? Applications: • Business Listings in Y! Local • Celebrities in Y! Movies • Events in Y! Upcoming • …. Websites Yelp Zagat Foursq ??? Find Duplicates Deduplicated Entities Dirty Entities Databases Content Providers

How? Reformulated Goal: Construct a high quality classifier identifying duplicate entity pairs Problem: How do we select training data? Answer: Active Learning with Human Experts!

Reformulated Workflow Websites Our Technique Deduplicated Entities Dirty Entities Databases Content Providers

Active Learning (AL) Primer Work even under noisy settings } Properties of an AL algorithm: • Label Complexity • Time Complexity • Consistency Prior work: • Uncertainty Sampling • Query by Committee • … • Importance Weighted Active Learning (IWAL) • Online IWAL without Constraints • Implemented in VowpalWabbit (VW) • 0-1 Metric • Time and Label efficient • Provably Consistent

Problem One: Imbalanced Data • Typical to have 100:1 even after blocking • Solution: Metric from [Arasu11]: • Maximize Recall • Such that Precision > τ Non-matches Matches 100 1 • Solution: All Non-matches • Precision 100% • 0-1 Error ≈ 0 Correctly identified matches % of correct matches

Problem Two: Guarantees • Prior work on Entity Matching • No guarantees on Recall/Precision • Even if they do, they have: • High time + label complexity • Can we adapt prior work on AL for the new objective: • Maximize recall, such that precision > τ • With: • Sub-linear label complexity • Efficient time complexity

Overview of Our Approach Recall Optimization with Precision Constraint This talk Reduction: Convex-hull Search in Relaxed Lagrangian Weighted 0-1 Error Paper Reduction: Rejection Sampling Active Learning with 0-1 Error

Objective Given: • Hypothesis class H, • Threshold τin[0,1] Objective: Find h in H that • Maximizes recall(h) • Such that: precision(h) >= τ Equivalently: • Maximize-falseneg(h) • Such that:εtruepos(h) -falsepos(h) >= 0 • Where ε = τ/(1-τ)

Unconstrained Objective X(h) Y(h) Weighted 0-1 objective Current formulation: • Maximize-falseneg(h) εtruepos(h) -falsepos(h) >= 0 If we introduce lagrange multiplier λ: • Maximize X(h) + λ Y(h), can be rewritten as: • Minimizeδfalseneg (h) + (1 – δ) falsepos(h)

Convex Hull of Classifiers Convex shape formed by joining classifiers strictly dominating others Y(h) We want a classifier here 0 Can have exponential number of points inside X(h) Maximize X(h) Such that Y(h) >= 0

Convex Hull of Classifiers u-v For any λ>0, there is a point / line with largest value of X + λ Y u Y(h) Plug λ into weighted objective, get classifier h with highest X(h) + λ Y(h) v If λ=-1/slope of a line, we get a classifier on the line, else we get a vertex classifier. X(h) Maximize X(h) Such that Y(h) >= 0

Convex Hull of Classifiers Naïve strategy: try all λ Equivalently, try all slopes Too long! Y(h) Worst case, we get this point Instead, do binary search for λ • Problem: When to stop? • 1) Bounds • 2) Discretization of λ • Details in Paper! X(h) Maximize X(h) Such that Y(h) >= 0

Algorithm I (Ours  Weighted) • Given: AL black box C for weighted 0-1 error • Goal: Precision constrained objective • Range of λ: [Λmin,Λmax] • Don’t enumerate all candidate λ too expensive; O(n3) • Instead, discretized using factor θ see paper! • Binary search over discretized values • Same complexity as binary search • O(log n)

Algorithm II (Weighted  0-1) • Given: AL black box B for 0-1 error • Goal: AL Black box C for weighted 0-1 error • Use trick from Supervised Learning [Zadrozny03] • Cost-sensitive objective  Binary • Reduction by rejection sampling

Overview of Our Approach Recall Optimization with Precision Constraint This talk O(log n) Reduction: Convex-hull Search in Relaxed Lagrangian Weighted 0-1 Error Paper O(log n) Reduction: Rejection Sampling Labels = O(log2 n) L(B) • Time = O(log2 n) T(B) Active Learning with 0-1 Error

Experiments • Four real-world data sets • All labels known • Simulate active learning • Two approaches for AL with Precision Constraint: • Ours • With VowpalWabbit as 0-1 AL Black Box • Monotone [Arasu11] • Assumes monotonicity of similarity features • High computational + label complexity

Results I (Runtime with #Features) Computational complexity on UCI Person

Results II (Quality & #Label Queries) Business Person

Results II (Contd.) DBLP-ACM Scholar

Results III (0-1 Active Learning) Precision Constraint Satisfaction % of 0-1 AL

Conclusion • Active learning for Entity Matching • Can use any 0-1 AL as black box • Great real world performance: • Computationally efficient (600k examples in 25 seconds) • Label efficient and better F-1 on four real-world tasks • Guaranteed • Precision of matcher • Time and label complexity

Active Sampling for Entity Matching

Active Sampling for Entity Matching

Presentation Transcript

Microbial Sampling for

Optimizing Estimated Loss Reduction for Active Sampling in Rank Learning

Constraint-Based Entity Matching

Sampling for Surveys

Sampling for EHES

Large-Scale Collective Entity Matching

Sampling for Research

Paired Sampling in Density-Sensitive Active Learning

Active Learning: Sampling Method

ACTIVE PLAN on Softina Public Sampling

Cross Market Modeling for Query-Entity Matching

Active Sampling for Accelerated Learning of Performance Models

Sampling for EHES

SAMPLING FOR CONTAMINANTS

Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations

Pair Hidden Markov Model for Named Entity Matching

Multi-Criteria-based Active Learning for Named Entity Recognition

Character Gazetteer for Named Entity Recognition with Linear Matching Complexity

Mining Document Collections to Facilitate Accurate Approximate Entity Matching

A Brief Overview on Active Air Sampling Procedure for Environment Monitoring