Exploiting Context Analysis for Combining Multiple Entity Resolution Systems

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems Zhaoqi Chen, Dmitri V. Kalashnikov, SharadMehrotra University of California, Irvine ACM SIGMOD 2009 Conference, Providence, RI, USA, June 30 – July 2, 2009 © 2009 Dmitri V. Kalashnikov

Information Quality Data Processing Flow Quality of Analysis Quality of Decisions Analysis Decisions Quality of Data (Raw) Data • Quality of data is critical • $1 Billion market • Estimated by Forrester Group 2

Entity Resolution Entity Resolution (ER) • One of the Information Quality challenges • Disambiguating uncertain references to objects (in raw data) Lookup • List of all objects is given • Match references to objects Grouping • No list of objects is given • Group references that corefer 3

Example of Analysis on Bad Data: CiteSeer Unexpected Entries • Lets check two people in DBLP • “A. Gupta” • “L. Zhang” Analysis Decisions Raw Data • Analysis on bad data can lead to incorrect results • Fix errors before analysis Data Quality Engine CiteSeer: Top-k most cited authors DBLP DBLP

Motivating ER Ensembles • Many ER solutions exist • No single ER solution is consistently the best • In terms of quality • Different ER solutions perform better in different contexts • Example: • LetKbe the true number of clusters • K is part of context • Assume that we use Agglomerative Clustering (Merging) if (K is large) then use Solution1: high threshold if (K is small) then use Solution2: low threshold • Observe that Kis unknown beforehand in this case!

Graphical View of ER Problem • Virtual Connected Subgraph • Use simple techniques to create • similarity edges (or connect all refs.) • Similarity edges form VCSs • VCS properties • Virtual • Contains only similarity edges • Connected • A path exists between any 2 nodes • Subgraph • Subgraphs of the ER graph • 4. Complete • Adding more nodes/edges would violate (1) or (2) [CKM: JCDL 2007] Logically, the goal of ER is to partition each VCS correctly

Problem Definition • Black boxes • Apply each to dataset • Outputs as graphs: • node - per each ref. • edges - connect each pair of references • For each edge ej, system Si makes decision dji{-1,+1} • Goal: combine • dj1,dj2, …,djn to make the final decision aj* for ej, such that the final clustering is as close to the ground truth as possible Raw Dataset Base-level ER Systems … S1 S2 SN Output of S2 Output of SN Output of S1 … Ensemble Techniques Final Result

A B E F C D G A B E F A B E F C D G C D G VCS1 VCS2 Toy Example: Notation Graph ER system S1 ER system S2

Naïve Solutions: Voting and Weighted Voting • Weighted Voting • Assign weight wi to each system Si • For ej count weighted decisions dji made by Si’s • Proceed like in voting Voting For each edge ej count decisions dji made by each Si: if (sum ≥ 0) then ej - positive (+1) else ej - negative (-1) Notice: if (n -1) systems perform poorly and only one performs well - the majority will win…

Limitations of Weighted Voting • No matter how we choose the weights, in our running example Accuracy ≤ 56% • Problem: WV is static non-adaptive to the context

Choosing Context Features • Error Features • Measure how far the prediction of a parameter by Si is different from the estimated true value of that parameter • The more the error is, the likely is that Si ’s solution is off • Combining Features • Number of Clusters (K) • K+ can help (merging ex.) • But, K+ is unknown! • Use regression to predict • K1, K2, …, Kn→ K* • Ki is # of clusters by Si • Features for edge ej: • Node Fanout • Nv+ is # of pos. edges of v • Also unknown • Use regression to predict • Nv1, Nv2,…,Nvn→ Nv* • Nvi is # according to Si • Features for edge ej: Effectiveness – should capture well which ER systems work well in the given context Generality– should be generic, not be present just in few datasets 11

Training & Testing (training only)

f2 ≤0.9 >0.9 d1 d2 -1 1 -1 1 d2 C=-1 C=1 C=-1 -1 1 C=1 C=-1 Approach 1: Context-Extended Classification • Three Methods • Method1: learn • Method2: • Method3: 2n features → n • Confidence in “merge” • Learn Context features:

Approach 2: Context-Weighted Classification • Idea • For each Si learn model Mi of how well Si performs in context • Learn fj → cj • Algorithm • Apply Si, get dj and fj for ej • Apply Mi on fj, get c*ji and pji • pji is confidence in c*ji • vji = dji·c*ji· pji; vj = (vj1, vj2,…,vjn) • May reverse some decisions • Learn/Use vj → a*j mapping

Clustering • Correlation Clustering • Once a*j{-1,+1} are known, we need to cluster • CC is designed to handle conflicts in labeling • Finds clustering that agrees the most with the labeling • CC can behave as Agglom. Clustering • Set params. accordingly • More generic scheme • Example • Simple merging will merge • CC will not • 2 negative vs. 1 positive

Experimental Setup • Dataset • Web domain: [WWW’05] • Publication domain: RealPub [TODS’05] • Baseline Algorithms • BestBase - Si that produces the best overall result • Majority Voting • Weighted Voting • Three clustering-aggregation algos from [GMT05] • Standard ER ensemble [ZR05] • Base-level Systems Si • TF-IDF+merging, with different merging threshold • Feature+relationship+Correlation Clustering • Etc.

Sample of Base-level systems

Experiment 1: “Sanity Check” • Introduce one “perfect” base-level system that always gets perfect results • Does not exist in practice • Utilizes the ground truth (unknown, of course) • As expected, the algorithms were able to learn to use that “perfect” system, and to ignore the results of other base-level systems

Comparing Various Aggregation Algorithms • WeightedERE is #1 • ExtendedERE is #2 • Both are statistically better • According to t-test  = 0.05 • Consistent improvement • 5 → 10 → 20 • Measures: FP, FB,F1 • Num. systems: 5, 10, 20 • MajorVot < BestBase • Many base-algo’s do not perform well

Detailed results for 20 systems and Fp • None of the baselines is consistently better • See “BestIndiv” • That is why ER Ensemble outperforms the rest

Results on RealPub • Results are similar to those on WePS data

Comparing Different Combinations of Base-line Systems on Real Pub • Combination 1 • 1 Context, 3 RelER (t=0.05;0.01;0.005), and 1 RelAA (t=0.1) • Combination 2 • 3 RelER (t=0.0005;0.0001;0.00005) and 2 RelAA (t=0.01;0.001) • W_ERE #1, E_ERE #2, Comb2 > Comb1

Efficiency Issues • Running time consist of • Running (in parallel) base-level systems • To get decision features • Running (in parallel) two regression classifiers • To get context features • Applying meta-classifier • Depends on the type of classifier • Usually not a bottleneck (1-5 sec on 5K to 50K data) • Applying correlation clustering • Not a bottleneck (under second) • Blocking • 1-2 order magnitude of improvement

Future Work • Efficiency • How to determine which base-level systems to run • And on which parts of data • Trade efficiency for quality • Features • Look into more feature types • Improve the quality of predictions • Apply framework iteratively 24

Questions? • Stella Chen SharadMehrotra www.ics.uci.edu/~sharad GDF Project www.ics.uci.edu/~dvk/GDF Dmitri V. Kalashnikov www.ics.uci.edu/~dvk

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems