210 likes | 544 Views
Dirty Data - Can You Afford It?. Faron Kincheloe, Baylor University. Types of Dirty Data. “Too many wrong mistakes” – Yogi Berra Nonidentical duplicates (names & addresses) Missing data (gender) Non-standard entities (church names). Identity Crisis. Ever Get 2 of the Same Piece of Mail?.
E N D
Dirty Data - Can You Afford It? Faron Kincheloe, Baylor University
Types of Dirty Data • “Too many wrong mistakes” – Yogi Berra • Nonidentical duplicates (names & addresses) • Missing data (gender) • Non-standard entities (church names)
Identity Crisis Ever Get 2 of the Same Piece of Mail? Laura Lauren • 2 Girls • 5 Viewbooks • Daughters of Baylor administrators
Duplication Dirt Devils • Character recognition • Misspelling • Marriage/Divorce • Middle name preference • Nicknames • First & last name reversal • Electronic downloads • Delimiters in data • Variable field lengths
How Bad Was the Problem? • 1.15% – 725 out of 62,000 (1450 pairs) • $2175 at $3 per viewbook • 3% yield on 725 prospects = 21 students • $15,000 per year per student not enrolled • $315,000 upper limit • Misapplied data • Lost credibility
Where’s My Church? How many ways can you say, “First Baptist?”
It’s Not My Mess!!! • Why should I clean it up? • Overall financial impact • Data expertise • “Data Mine” instead of “Not Mine” • Improved accuracy • Partner with data owners
Cleaning Tools • DataFlux dfPower Suite (GUI) • SAS Data Quality Server (Code module)
ACCOUNT NUMBER ADDRESS CITY DATE E-MAIL NAME ORGANIZATION PHONE STATE TEXT ZIP Data Knowledge Definitions
Cleaning Functions • PARSE – Creates delimited text string • GENDER – Estimates gender based on name • MATCH – Creates index for matching • SCHEME – Standardizes data
Using Match Codes • Prepare the data • Create multiple match codes • Create groups to target specific matches • Merge groups together • Remove repeated rows and sort by clusters • Print list for cleanup • Mark records for future match tests
Scheming Against the Data • Create scheme • Church Names • Cities • Customize & finalize schemes • Apply schemes to data • Use SAS code to override exceptions • Compare with original entries • Update student information system • Automate the process
Questions? Faron_Kincheloe@baylor.edu