270 likes | 433 Views
Data Cleanup: Unlock the potential at a corporate scale. Thibault Dambrine. IT professional for 25 years Network Designer ETL Data Warehouse Analyst Interface Specialist ERP Developer Data Quality Experience:
E N D
ThibaultDambrine • IT professional for 25 years • Network Designer • ETL Data Warehouse Analyst • Interface Specialist • ERP Developer • Data Quality Experience: • Tasked to work on the pre-conversion data cleanup project during the Shell JDE to SAP transition
Introduction • Premise: “What would a Company-wide Data Quality Initiative look like?” • Base: • Experience setting up a data cleanup team, prior to a JDE to SAP data conversion • Realizing the potential for increasing the data value within the corporation
Defining Data Quality • Intrinsic Data Quality: • Accuracy, Completeness, Uniqueness • Reliability, Security and Accessibility • Contextual Data Quality: • Timeliness, Relevance • Inter-operability, consistency of identifiers • Accessibility and Representational Data Quality • Ease of understanding • Consistency of identifiers • Consistency in structures
$ Quantifying the cost of Quality: The 1-10-100 Rule - Additional cost = Less Competitive Business $ Prevention Cost $ Correction Cost Failure Cost The 1-10-100 Quality Cost Rule
The 1-10-100 Rule A Data Quality Example: The 1-10-100 Quality Rule Applied to mailing Data: It costs: • $1.00 to verify data at data entry time • $10.00 to clean the data after the fact • $100.00 to mop up errors caused by bad data • Packages mailed in the wrong address • Lost revenue • Lost customers • Bad (sloppy) reputation • Additional carrying cost for bad data
Data Quality: The Up Side! Trust • Consistent data inquiry results build confidence in information systems • Tractability across Business and IT domains • Consistent data identifiers • promotes internal cross-department reporting • Consistency • Confidence in results Productivity • Removing redundant or near-redundant data • Maximizes re-use of data • Reduces the amount of data being processed • Reduces errors Reliability Consistently good quality data is data you can count on!
A Data Cleanup Initiative • Where to start? • Who will enforce such quality initiatives? • How will the data quality be maintained on on-going basis?
DQB Task 1: Identify Sponsor and Data Quality Boss To Identify sponsor: • Communicate clear understanding of the cost of bad data • Use the 1-10-100 rule • Initiative has to be backed with • Money • Authority • Responsibility
Identify Data Quality Boss – DQB • Must be knowledgeable on data quality • Must be knowledgeable on the Business • Will be responsible for data quality • Will have authority to make changes Note: Responsibility without authority will not work
DQB Task 2: Identify Data Sets • Identify/inventory high-level data sets e.g. • CMDB • ERP • Master Data e.g. • Customer Master • Item Master • Transaction Data e.g. • PO’s & Invoices • Inventory movements • Assign data sets to departments, potential lists of Data owners • Note: The final data owners may not be the one initially penciled in at this stage
DQB Task 3: Identify Business Side Data Owners • Data Owners will effectively be the local, more granular, Data Quality Bosses. Again, they will need to • Be responsible for the data at their level • Have authority to request changes at their level • Have bottom up knowledge of the data, understand what “should be there” • Setup meetings with every department, in line with the Data Sets identified, with aim of coming up with a set of Data Owners • Have a presentation ready • Look for individuals who have been in the Business for a long time, who are well respected, who understand the data, the dependencies, and know who to talk to, to get answers, from the bottom up
DQB Task 4: Request from Data Owners the “Data Quality Specification” or DQS • DQS is a document that spells out the data quality rules e.g. • No duplicates or near-duplicates • Data older than x years should be purged or archived • Data Dependencies such as no detail without a header or no invoice without a PO • No duplicates • Consistency e.g. data format • Quality audit e.g. postal code matches address • More… • Note: Some rules will apply in all DQS Documents • There is value in sharing, reviewing and updating the DQS over time. • Data quality issues are not always apparent until a first cut of data is cleaned up
DQS (part of Task 4 ) Also look for: • Data Islands • Lack of consistent identifiers inhibit a single view of the big picture • Data Opportunity • Could correlated data sets be more useful to the Business? • Data Surprises • Misplaced Data • Information buried in free-form fields
DQB Task 5: Build IT Data Quality Team • Data Quality • Cannot be a “side job” or a part-time task • Must be staffed with individuals who understand data. Best candidates • Proficient in SQL, data extract techniques • Understand ETL tools and techniques • Are detailed-oriented • Experience: Data Warehouse staff is good fishing grounds for such individuals
Mid-Presentation Recap: All the Ingredients are now in placeThe real work can start! • Name Data Quality Sponsor & Data Quality Boss (DQB) • Identify Data Sets • Data Quality Owners • Data Quality Specifications (the DQ Roadmaps) • IT Data Cleansing Team
Introducing: the Data Quality Cycle • We now have • a sponsor • Identified data sets and data owners • They have produced Data Quality Specifications • An IT Team ready to work on the first Data Quality measurements, based on the DQS • Next step: Initiate the cleanup • Not a single iteration but one that will be repeated in a cyclic fashion
Data Quality Cycle - Corporate Version Analyze Data Improve Data Continuous Improvement Monitor Progress Formalize Schedule Make Progress VISIBLE
Step 1: Identify Bad DataBad Data Definition: Does not adhere to DQS • Coordinate meetings to translate DQS documents into a suite of repeatable data cleansing procedures • Very important that these procedures should be repeatable, schedulable on regular basis • Initial Focus: Identify Bad Data • Bad data(does not adhere to DQS), • Inconsistent data • Old Data • Note: DQS will spell out rules for “old” and “inconsistent” • Ensure results are reported in a format readable by Management at executive level. This initiative has to be VISIBLE
Step 2: Data Cleansing • Data Cleansing can be done in two ways: • Automated, IT based cleanup • Business-based, manual cleanup • Once the bad data is identified, determine who must do what • Business-based, • manual cleanup appropriate for more subtle tasks, e.g. to determine which of two duplicates identified should be kept. These tasks may require additional research, phone calls etc. • Automated, • IT based cleanup good for simpler tasks e.g. making telephone number formats consistent • Can be also sub-contracted to specialized data quality companies
IT-based Data Cleansing and Outsource Considerations • Data cleansing may take valuable time from the Business, which is not available – Data Cleanup effort may suffer as a result • Not all data cleansing is a simple SQL • Not all data is most confidential When considering data cleansing tasks, look at all possible options • Outsourcing some data cleansing tasks may be more economical than doing it all in-house
Step 3: Measure Progress • All programs, procedures written with the aim of identifying data quality issues should • Be stored, like any other programming assets • Be repeatable and be schedulable • Provide aggregate measures to describe the data cleanup status e.g. • X duplicates • Y old records • Z invoices without PO • Progress • Has to be measured in a published dashboard • Has to be visible by the entire organization to provide a sense of value
Step 4: Data Hygiene: * Schedule the Cleanup/Review Tasks * Ensure results are visible • Bad data is created EVERY DAY • Data quality is an on-going effort • Establish, publish a schedule, part of the dashboard • Ensure there is visibility and accountability to ensure the levels of bad data • are going down with time • Or are kept at a minimal level
Step 5: Sharpen the Saw Once the data cleanup cycle is established • Review Results • Review DQS documents periodically (setup schedule) • Get Business input • Improve process • Give input on improvements to be made • Ask the Business to come up with performance improvement measures born from the Data Quality initiative
Conclusion Two sets of Five activities best define the Data Quality The Foundation Setup • Identify Data Quality Sponsor & Data Quality Boss (DQB) • Identify Data Sets • Identify Business Side Data Owners • Define Data Quality Specifications (DQS) - the DQ Roadmaps • Appoint IT Data Cleansing Team The Data Quality Cycle – Ongoing • Identify Bad Data • Initiate Data Cleansing • Measure Progress • Initiate Data Hygiene, Data Cleanup Cycle • Schedule Cleanup/Reviews • Ensure progress visible • Sharpen the Saw
Links • How to improve Data Qualityhttp://www.informit.com/articles/article.aspx?p=399325&seqNum=3 • Predefined data quality rule definitions http://pic.dhe.ibm.com/infocenter/iisinfsv/v9r1/index.jsp?topic=%2Fcom.ibm.swg.im.iis.ia.quality.doc%2Ftopics%2Fpdr_predef.html • Creating Effective Business Rules: Interview with Graham Witthttp://dataqualitypro.com/data-quality-pro-blog/how-to-create-effective-business-rules-graham-witt • Gartner Magic Quadrant on Data Quality Tools – “Demand for data quality tools remains strong” http://www.citia.co.uk/content/files/50_161-377.pdf