360 likes | 406 Views
Data Quality. Class 7. Agenda. Record Linkage Data Cleansing. Record Linkage. Critical component of data quality applications Linkage involves finding a link between a pair of records, either through an exact match or through an approximate match Linkage is useful for data cleansing
E N D
Data Quality Class 7
Agenda • Record Linkage • Data Cleansing
Record Linkage • Critical component of data quality applications • Linkage involves finding a link between a pair of records, either through an exact match or through an approximate match • Linkage is useful for • data cleansing • data correction • enhancement • householding
Record Linkage 2 • Two records are linked when they match with enough weighted similarity • Matching can be a combination of exact matching on particular fields, to approximate matching with some degree of similarity • For example: two customer records can be linked via an account number, if account numbers are uniquely assigned to customers
Record Linkage 3 • Example: • David Loshin • 633 Evergreen Terrace • Montclair, NJ • 201-765-8293 vs. • H. David Loshin • 633 Evergreen • Montclair, NJ • 201-765-8293 • In this case, we can establish a link based solely on telephone number
Record Linkage 4 • Frequently, pivot attributes exists and can be used for exact matching (such as social security number, account number, telephone number, student ID) • Often, there is no pivot attribute, and therefore approximate matching techniques must be used • Approximate matching is a process of looking for similarity
Deduplification • Duplicates are sets of records that represent the same entity • Duplicate elimination is the process of finding records that belong to the same equivalence class based on similarity above a specified threshold • When duplicates are found, one record is created to replace all the duplicates • That record is composed of the “best” data gleaned from all equivalence class members
Merge/Purge • Similar to duplicate elimination • Application used when merging two or more different database • Goal: find all the records associated with the same entity • Example: when two banks merge, find all accounts owned by the same person in both banks
Enhancement • We can enhance data by merging it with other data sets • Linkage may be based on profile information extracted from each record
Householding • We link on address as well as some permutation of the entity name • Look to establish a location match and some relation between entities at that location
Similarity • We use measures of similarity to establish measures and thresholds for linkage • Most interesting areas for similarity are in string matching
Similarity and Distance • To link 2 values, we need a notion of distance and similarity • Obviously, similarity is measured as a function of distance • All items to be explored are projected into an n-dimensional space • Distance can be measured based on traditional distance functions, or specialized distance functions
Distance Measures • Euclidean distance: the distance between any two points is the square root of the sum of the squares of the difference between the corresponding data points • City Block distance: this measures the distance in terms of walking along the line segments in a grid • Exact-match distance: For variables whose values are not allocated along continuous dimensions, the values assigned are either 1 for an exact match, or 0 for not a match
String Similarity • How do we characterize similarity between strings? • We can see it with our eyes, or hear it inside our heads: • example: Smith, Smythe • example: John and Mary Jones, John Jones • How do we transfer this to automation?
Edit Distance • Edit distance operations • Insertion, where an extra character is inserted into the string • Deletion, where a character has been removed from the string • Transposition, in which two characters are reversed in their sequence • Substitution, which is an insertion followed by a deletion
Edit Distance 2 • Strings with a small edit distance are likely to be similar • Edit distance is measured as a count of edit distance operations from one string to another: • internatianl • transposition • internatinal • insertion • international • internatianl to international has an edit distance of 2
Computing Edit Distance • Use dynamic programming • Given two strings, x = x1x2 .. xn, and y = y1y2 ..ym • the edit distance f(i, j) is computed as the best match of two substrings x1x2 .. xi and y1y2 ..yjwhere • f(0,0) = 0 • f(i, j) = min[f(i-1, j) + 1), f(i, j-1) +1, f(i-1, j-1) + d(xi,yj)]
Phonetic Similarity • Words that sound the same may be misspelled • Phonetic similarity reduces the complexity of the strings • Effectively, it compresses the strings, with some loss, then performs a similarity test
Soundex • The first character of the name string is retained, and then numbers are assigned to following characters • The numbers are assigned using this breakdown: • 1 = B P F V • 2 = C S K G J Q X Z • 3 = D T • 4 = L • 5 = M N • 6 = R • Vowels are ignored
Soundex – 2 • Examples: • Fitzgerald = F322 • Smith = S530 • Smythe = S530 • Loshin = L250
Soundex – 3 • Regular soundex is flawed • geared towards English names • can’t account for incorrect first letter • longer names are truncated • Options • encode entire string, not just the first 4 consonant sounds • reverse the words, then encode both forward and backward • Use different phonetic encodings
Other Phonetic Encoding • NYSIIS • Similar to soundex, but • does not use numeric encoding, instead uses mapping to smaller set of consonants • replaces all vowels with “A” • Metaphone • Tries to be more exact with multiple-letter sounds (sh, tio, th, etc.)
N-gramming • Another means of representing a “compressed” form of a string • An n-gram is a chunk of text of length n • We slide a window of size n across a string to generate the set of n-grams for that string
N-gram Example • INTERNATIONAL is comprised of these 2-grams: • IN, NT, TE, ER, RN, NA, AT, TI, IO,ON, NA, AL • Compare this with INTERNATIANL • IN, NT, TE, ER, RN, NA, AT, TI, IA, AN, NL • These two strings share 8 2-grams
N-gram Measures • 1) Absolute overlap – this is the absolute ratio of matching n-grams to the total number of n-grams. This is equal to (2 (|ngram(X) ngram(Y)|)) (|ngram(X)| + |ngram(Y)|) • 2) Source overlap – this is the number of matching n-grams divided by the number of n-grams in the source string X: (|ngram(X) ngram(Y)|) |ngram(X)| • 3) Search overlap: this is the number of matching n-grams divided by the number of n-grams in the search string Y: (|ngram(X) ngram(Y)|) |ngram(Y)|
Thresholding • We have functions to determine similarity • Tune these functions to return a value between 0 and 1 (i.e., 0% and 100%) • 0 = absolutely no match • 1 = absolute match
Thresholding 2 • For any pair of values, we can apply the similarity function and get a score • For different kinds of data, we can set a minimum threshold, above which the two values are said to match • For example, for an n-gram match, we can set the threshold at 75%
Thresholding 3 • When comparing 2 records, we apply the similarity functions in a pairwise manner to each of the important attribute value pairs • We assign some weight to each attribute value pair score • Total similarity score for each pair of records is the sum of individual attribute value pairs scores, adjusted by weight • We can assign an overall threshold indicating a match
Thresholds and Matches • We actually assign two thresholds, to partition scores into three groups: • Definite matches • User review • Definite no-matches • Scores above the high threshold are definite matches • Scores between low and high thresholds are user review matches • All others are not matches
Naive Algorithm • Goal: find all possible matches between any pair of records • Approach: Perform a pairwise similarity score for all record pairs • Downside: O(n2)
Improvements • Desire to reduce the number of candidates for pairwise similarity testing • We can use fast matching for fixed pivot values when merging data sets • We use a concept called blocking to reduce the search set
Fast Matching for Merging • We have looked at Bloom filters for fast matching • We can load all (recordID,attribute value) tuples from the first data set into the Bloom filter O(n) • We can then test each of the values from the second set to see if the pivot matches in the first set
Blocking • Goal: reduce number of match candidates by using some form of “compression” on the records to be linked • Example: phonetic encodings • Example: limit by fixing one of the attributes • Example: find a pivot attribute and use that for affinity
Blocking 2 • Example: if we want to perform householding on a mailing list: • Block by ZIP code, since we don’t expect to find members of the same household living in different locations
Linkage Algorithms • All linkage algorithms make use of these ideas: • A blocking mechanism • We must choose based on the data available and what makes sense for the application • Similarity functions • Every data type and data domain should have an associated similarity function, even if it is a 0/1 exact match test • Weights for the similarity functions • This requires more insight into the problem, to see how each attribute’s scores should weigh for the overall score