200 likes | 366 Views
Fuzzy Duplicates Analysis with ACL Prepared by: Kevin Legere Date: April 3 rd , 2013. Agenda. Overview Example FUZZYDUP command OMIT() Function Script Editor and RECOFFSET Q&A. Overview. What is a "Fuzzy Duplicate"? Match based on criteria where the values are not exact but very close
E N D
Fuzzy Duplicates Analysis with ACL Prepared by: Kevin Legere Date: April 3rd, 2013
Agenda • Overview • Example • FUZZYDUP command • OMIT() Function • Script Editor and RECOFFSET • Q&A
Overview • What is a "Fuzzy Duplicate"? • Match based on criteria where the values are not exact but very close • EX: "ACL Services" and "ACL Service" • Typically used for: • Keyword matching • Invoice Number matching • Vendor Name matching* • Employee Name matching • Can be simple or complex • Completely depends on your approach and desired accuracy * focus for this presentation
Overview • Simple Match Examples: • Exact or 100% match • "ACL" = "ACL" • Force Upper or Lower case • "ACL" = UPPER("acl") • "acl" = LOWER("ACL") • Removal of special characters • "ACL" = EXCLUDE("*ACL." "!@#$%^&*().") • Only compare numbers or letters • "ACL" = INCLUDE(UPPER("ACL123") "ABCDEFGHIJKLMNOPQRSTUVWXYZ") • "123" = INCLUDE("ACL123" "1234567890")
Overview • Complex Match Examples: • Removal of company type indicators (LLC, INC, LTD, etc) • "ACL Services Ltd." = "ACL Services" • Percent of word match AKA letter by letter • "ACL Services" "ACL Service" • 11/12 character match or 91.6% match • Word by Word* • "ACL Services" "ACL Champions" • "ACL" "ACL" • "Services" "Champions" • = 50% match • Levenshtein distance • Sounds like • NYSIIS *Most used by ACL Consultants
Vendor Master Analysis • Fuzzy Duplicates on Vendor Name • Possible Risk • Payments are being sent to more than one vendor • May not involve risk. The desire can be to normalize the vendor master list to ensure that duplicates do not exist. • Ideally, one unique vendor should exist in your vendor master list with one or more address records in your vendor address table
Vendor Master Analysis • Sample file contains 75 vendors • Only Vendor Code and Vendor Name • Where do you start for Vendor Name matching? • Look for exact duplicates • Focus on Simple matching • Sort or Summarize!
Vendor Master Analysis • Step 1:Summarize your Vendor Master File • Choose Vendor Name as your key field • Add Vendor Code as the Other Fields for Summarizing • Be sure to check "Presort"
Vendor Master Analysis • Step 2:Quickly comb over the data to identify a common trend. • We will focus on this issue, in the sample data: • Create a computed field that corrects the trend (or cleans the data).
Vendor Master Analysis • Functions used in Default Value text box: INCLUDE(UPPER(ALLTRIM(Vendor_Name)) 'ABCDEFGHIJKLMNOPQRSTUVWXYZ') • Within ACL, the computed field will return the following:
Vendor Master Analysis • Step 3:Perform a Duplicates Command on the computed field
Vendor Master Analysis • Results are as follows:
FUZZYDUP command • ACL 9.3 has new features that make Fuzzy Duplicate analysis easier • FUZZYDUPcommand • OMIT() function • ISFUZZYDUP() function • LEVDIST() function • Important parameters to understand • Levenshtein Distance • Difference Percentage
FUZZYDUP command • Syntax • FUZZYDUP ON {key_field} <OTHERfields> {LEVDISTANCEvalue} <DIFFPCTvalue><RESULTSIZEvalue> <EXACT> TO table_name • Example • FUZZYDUP ON Vendor_NameOTHER ALL LEVDISTANCE 2 DIFFPCT 50 TO My_Results • Levenshtein Distance (LEVDISTANCE) • The number of edits required to make the strings equal • EX: "Smith" and "Smythe" have a Levenshtein Distance of 2 • Difference Percentage (DIFFPCT) • The threshold for percentage difference between two strings • EX:"Smith" and "Smythe" have a Percentage Difference of 40% • (2/5) * 100%
OMIT() Function • When Do I use OMIT()? • When you want to refine fuzzy duplicate analysis • Look for repeating strings you want to remove from your Vendor Name field • Syntax • OMIT(string1, string2 <,case_sensitive>) • Specify T to make substrings specified for removal case-sensitive, or F to ignore case • Example • OMIT(Vendor_Name " Ltd, Inc, Corp, Corporation" F)
Contact Information Kevin Legere Implementation Consultant ACL Services Ltd. 1550 Alberni Street, Vancouver, BC, Canada V6G 1A5 kevin_legere@acl.com | @aclkevin www.acl.com/linkedin | www.acl.com/twitter | www.acl.com/facebook