360 likes | 602 Views
BIN401 Master Data Management – Merging from Multiple Sources. Dejan Sarka Solid Quality Mentors. Agenda. The Problem Soundex, Difference and Beyond MDS String Similarity Functions Avoiding Large Cross Joins Fuzzy Lookup No Master Data Source Defined. The Problem (1). The Problem (2).
E N D
BIN401Master Data Management – Merging from Multiple Sources Dejan Sarka Solid Quality Mentors
Agenda • The Problem • Soundex, Difference and Beyond • MDS String Similarity Functions • Avoiding Large Cross Joins • Fuzzy Lookup • No Master Data Source Defined
The Problem (2) • In an enterprise, multiple sources of master data (e.g., customers data) can exist • There is no common key to make simple joins • Data merging has to be done based on similarities of strings • Name, address, e-mail, city, etc. • Which similarity algorithm to use?
The Problem (3) • For approximate merging, any row from one side can be joined to any row from the other side • We have a cross join • Even small data sets can produce performance problems • E.g., cross join of 18,484 rows with 18,484 rows (AdventureWorksDWvTargetMail) means 341,658,256 rows!
The Problem (4) • Data can change in any source • We would like to merge changes only • Can have master data defined • A privileged, trustworthy source • Update other sources to values from master source • No master data defined • All sources have equal priority
Soundex and Difference • Soundex takes a word, such as a person's name, as input and produces a character string which identifies a set of words that are (roughly) phonetically alike • Phonetically alike means US English phonetically alike • Difference returns an integer indicating difference between the SOUNDEX values • Value between 0 and 4 • Soundex is very basic
More Soundex Algorithms • There are many additional algorithms developed for comparing strings phonetically • T-SQL Soundex is actually American Soundex System • Russell Soundex Code, Henry Soundex Code • Daitch-Mokotoff Soundex System • New York State Identification and Intelligence System (NYSIIS) Phonetic Encoder • Some T-SQL functions already written
MDS Similarity Algorithms • Master Data Services implements couple of known similarity algorithms • Levenshtein (aka Edit Distance) • Jaccard index • Jaro-Winkler • Simil (aka longest common substring, aka Ratcliff-Obershelp) • SSIS Fuzzy Lookup algorithm should be implemented in MDS through T-SQL CLR functions
Levenshtein Distance • Levenshtein (edit) distancemeasures minimum number of edits needed to transform one string into the other • E.g., distance between kitten and sitting is 3 • kitten → sitten (substitution of 's' for 'k') • sitten → sittin (substitution of 'i' for 'e') • sittin → sitting (insert 'g' at the end) • Similarity is normalized between 0 and 1
Jaccard Index • The Jaccard index (similarity coefficient) measures similarity between sample sets • The size of the intersection divided by the size of the union of the sample sets
Jaro Distance • Jaro distance combines matches and transpositions • m is the number of matching characters • tis the number of transpositions • Characters are matching if no farther than
Jaro-Winkler Distance • Jaro-Winkler distance uses a prefix scale p which gives more favorable ratings to strings that match from the beginning - • dj is the Jaro distance • is the length of common prefix at the start of the string up to a maximum of 4 characters • p is a scaling factor for common prefixes • pshould not exceed 0.25, otherwise the distance can become larger than 1 (usually 0.1)
Simil Algorithm • The Simil algorithm looks for the longest common substring in two strings • Then searches for next longest common substring in remainders from let and right • Continues recursively till no more common substrings found • Calculates coefficient between 0 and 1 by dividing the sum of the lengths of the substrings by the lengths of the strings themselves
SSIS Fuzzy Lookup • Fuzzy Lookup and Fuzzy Grouping use a custom, domain (language) - independent distance function • Takes into account the edit distance, the number of common tokens (NGrams), token order, and relative frequencies • Not as easily misled by transpositions and can detect higher level patterns than an approach that uses only edit distance
Divide and Conquer • Divide and Impera solution: merge parts of the rows at a time • Start with exact merge (inner join) • Then merge similar strings • Then merge less similar strings • … • Then merge manually • How do we define which strings are similar, if we search for string similarity?
Pre-Similarity (1) • Difference T-SQL function • Language dependent • Use bitmasks (Bouche, Sommarskog) • Less frequent characters get higher weights (e.g., E gets 21 and Z gets 225 in English) • Build frequency dynamically to get language-independent value • Hard to find good matches • Neither too many rows (nearly cross-join) • Nor too few rows (the correct match lost)
Pre-Similarity (2) • Tokenize strings to substrings of length n • Tokens are called NGrams • MDS has mdq.NGrams function • Calculate overall frequency of tokens • Compare strings that have at least m common NGrams with less than p frequency • Can tweak m, n and p in a loop • Start with stricter matching
Conclusion • String similarity matching is a complex problem • Do not count on 100% exact matches • Try and learn and then • Try and learn and then • Try and learn and then… • SSIS Fuzzy Lookup does great job • Hope we get it soon in MDS
No Master Source • The problem: we do not have a master source defined • All sources peers • We do not have a master identification for id mapping • In addition, we need to do continuous merge on changes from previous merge only • The merged data can become the master (privileged) source
The Merging Algorithm (1) • In each source we need a stable PK • Except for one-time transfer • In each source we define n common columns; not all the sources must have all of them • For each source we create a “Shadow” table that includes original PK (O_Pk), new surrogate PK (S_Pk), common columns and delete flag • We create another common table “StagingData” that includes the same columns, except O_Pk
The Merging Algorithm (2) • In the 1st step we transfer (and transform, if needed) all rows from each source in its shadow table and in StagingData • We generate S_Pk, now we have a common Pk • Then we merge rows in StagingData using different algorithms • Common columns equal, priority of source,… • When we merge rows, we correct S_Pks in shadow tables • Finally we have to merge rows manually • We can merge using columns from different rows • We can use different sorts (even functions), searches, …
The Merging Algorithm (3) • Here we start with work on a daily basis, but on changed data only • Insert on the source means insert in shadow and in StagingData tables and then merging again • Update on the source means update in shadow table and: • Update in StagingData, if the source is a priority or if we define any automatic algorithm • Insert in StagingData, merging process • Delete on the source means setting the deleted flag in the shadow table to true • When the row is flagged in all shadow tables, we flag it in StagingData as well
The Merging Algorithm (4) • Merged, cleansed rows are copied from StagingData to a new table – MasterData • We define MasterData as another source; we can update the data there manually as in any other source • We can even do write-back to any source • MasterData is the then the basement for MDM centralized solution • And for CRM systems, etc. • Aka Operational Data Store (ODS)
Review • The Problem • Soundex, Difference and Beyond • MDS String Similarity Functions • Avoiding Large Cross Joins • Fuzzy Lookup • No Master Data Source Defined
References • Books: • Carlo Batini, Monica Scannpieco: Data Quality – Concepts, Methodologies and Techniques • DavideMauri, Gianluca Hotz, DejanSarka: Master Data Management with SQL Server 2008 R2 • Links: • Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server • Presentations: • Presentations on various conferences by Solid Quality Mentors
Session Evaluations Tell us what you think, and you could win! All evaluations submitted are automatically entered into a daily prize draw* Sign-in to the Schedule Builder at http://europe.msteched.com/topic/list/ * Details of prize draw rules can be obtained from the Information Desk.
© 2010 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.