A Comparison of String Matching Distance Metrics for Name-Matching Tasks

A Comparison of String Matching Distance Metrics for Name-Matching Tasks William Cohen, Pradeep RaviKumar, Stephen Fienberg

Motivating Example • List of people and some attributes compiled by one source • Updates by another source need to be merged • Need to locate matching records • Forcing exact match not sufficient • Typographical errors (letter “B” vs. letter “V”) • Scanning errors (letter “I” vs. numeral “1”) • Such errors exceed 20% in some cases • Decide when two records match  Decide when two strings (or words) are identical

History – String Matching • Statistics • Treat as a classification problem [Fellegi & Sunter] • Use of other prior knowledge • String represented as a feature vector • Databases • No prior knowledge • Use of distance functions – edit distance, Monge & Elkan, TFIDF • Knowledge-intensive approaches • User interaction [Hernandez & Stolfo] • Artificial Intelligence • Learn the parameters of the edit distance functions • Combine the results of different distance functions • Compare string matching distance functions for the task of name matching

Edit Distance • Number of edit operations needed to go from string s to string t • Operations: insert, delete, substitution • Levenstein: assigns unit cost • Distance (“smile”, “mile”) = 1 • Distance (“meet”, “meat”) = 1 • Computed by dynamic programming • Reordering of words can be misleading • “Cohen, William” vs. “William Cohen”

Edit Distance • Monger-Elkan: assigns relatively lower cost to sequence of insertions or deletions • A + B*(n – 1) for n insertions or deletions (B < A) • Other methods that assign decreasing costs to subsequent insertions

Edit Distance • Jaro (s, t) • s’ be characters in s common with t • t’ be characters in t common with s • T (s’, t’) be half the number of transpositions in for s’ and t’

Improvements to Jaro • McLaughlin • Exact match – weight of 1.0 • Similar characters – weight of 0.3 • Scanning error (“I” vs. “1”) • Typographical error (“B” vs. “V”) • Pollock and Zamora • Error rates increase as the position in string moves to the right • Adjust output of Jaro by fixed amount depending upon how many of the first 4 characters match

Term Based • Treat strings s & t as bags S and T of words • Examples • Jaccard similarity = |S∩T| / |SUT| • TFIDF

Term Based • Words may be weighted to make the common words count less • Advantages • Exploits frequency information • Ordering of words doesn’t matter (Cohen, William vs. William Cohen) • Disadvantages • Sensitive to errors in spelling (Cohen vs. Cohon) and abbreviations (Univ. vs. University) • Ordering of words ignored (City National Bank vs. National City Bank)

Hybrid Distance Functions • Recursive Matching • Let s = (a1, a2, … aK) and t = (b1, b2, …, bL) • Sim’ is the level two matching function

Blocking / Pruning Methods • Comparing all pairs – too expensive when lists are large • A pair (s, t) is a candidate for match if they share some substring v that appears in at most a fraction f of all names • Using a v of length 4 and f = 1% finds on an average of 99% correct pairs

Results - Metric • Output of each algorithm is a list of candidate pairs ranked by distance • Non-interpolated average precision of a ranking • Other metrics used • Interpolated precision

Results - Matching • Term based: TFIDF most accurate • Edit distance based: Monge-Elkan most accurate • Jaro as accurate as Monge-Elkan, but much faster • Combine TFIDF and Jaro

A Comparison of String Matching Distance Metrics for Name-Matching Tasks

A Comparison of String Matching Distance Metrics for Name-Matching Tasks

Presentation Transcript

String Matching

Approximate String Matching

String Matching

String Matching

String Matching

String Matching

String Matching

String Matching

String Matching

String Matching

String Matching II

String Matching

String Matching

String Matching Algorithms

String Matching

String matching

Approximate String Matching

String Matching Algorithms

String Matching

String Matching

String Matching

A Comparison of String Matching Distance Metrics for Name-Matching Tasks