260 likes | 273 Views
IBM Information Server. Data Quality Everywhere. Increasing Focus on Data Quality. Businesses are beginning to realize that data quality issues not only cost them time and money, but also inhibit their ability to address core strategic projects
E N D
IBM Information Server Data Quality Everywhere
Increasing Focus on Data Quality • Businesses are beginning to realize that data quality issues not only cost them time and money, but also inhibit their ability to address core strategic projects • More and more businesses are establishing programs for data quality, to measure and improve the reliability of information • Analysts contend that companies with focused data quality programs will find more opportunities to outperform their peers
Business Drivers for Investment Depend on Data Quality • Empowering risk & compliance initiatives with the information they require • Optimizing Revenue Opportunitiesby ensuring effective and efficient interactions with customers, partners, and suppliers • Enabling collaborative business processeswith consistent and trustworthy information • Reducing the total cost of ownership for maintaining consistent information across the enterprise
What is the Impact of Poor Data Quality? • "If you look at...any business function in your company, you're going to find some direct cost there attributed to poor data quality." • - Gartner 2006 Lost Sales Opportunity “Hard” Losses • SKU misplaced or hard to find • Out of stocks attributed to the store 1.5% 1.7% “Soft” Losses 2-4% 1-3% 1-2% • Lost potential for cross-sell and up-sell (staff not trained or available) • Reduced store visit frequency • Abandoned carts (poor service or excessive queues) Total 7.2%- 12% Source: GMA/FMI/CIES 2003 (US grocery), ECR Europe 2003, Lineraires.com, California Management Review, IBM case studies, interviewsand IBM Institute for Business Value analysis
Data Quality is a Subjective Business Standard • Data = facts used as a basis for decision making suitable for storage on a computer • Quality = the general standard or grade of something Business Purpose Data Quality = a subjective standard used to determine if a set of facts is suitable for a particular business purpose Relevant? Accurate? Valid? Complete? Ultimately, Data Quality = Trust
So, What Constitutes Data Quality? • Data is standardized • Data is fit for purpose (conforms to rules) • Each record is unique • View of information is complete • Records are certified against authoritative sources • Lineage is understood • Data quality is measured over time
Common Data Problems Kate A. Roberts 416 Columbus Ave #2, Boston, Mass 02116 Catherine Roberts Four sixteen Columbus APT2, Boston, MA 02116 Mrs. K. Roberts 416 Columbus Suite #2, Suffolk County 02116 • Lack of information standards • Different formats & structures across different systems • Data surprises in individual fields • Data misplaced in the database • Information buried in free-form fields • Data myopia • Lack of consistent identifiers inhibit a single view • The redundancy nightmare • Duplicate records with a lack of standards Name Tax ID Telephone J Smith DBA Lime Cons. 228-02-1975 6173380300 Williams & Co. C/O Bill 025-37-1888 415-392-2000 1st Natl Provident 34-2671434 3380321 HP 15 State St. 508-466-1200 Orlando WING ASSY DRILL 4 HOLE USE 5J868A HEXBOLT 1/4 INCH WING ASSEMBY, USE 5J868-A HEX BOLT .25” - DRILL FOUR HOLES USE 4 5J868A BOLTS (HEX .25) - DRILL HOLES FOR EA ON WING ASSEM RUDER, TAP 6 WHOLES, SECURE W/KL2301 RIVETS (10 CM) 19-84-103 RS232 Cable 6' M-F CandS CS-89641 6 ft. Cable Male-F, RS232 #87951 C&SUCH6 Male/Female 25 PIN 6 Foot Cable 90328574 IBM 187 N.Pk. Str. Salem NH 01456 90328575 I.B.M. Inc. 187 N.Pk. St. Salem NH 01456 90238495 Int. Bus. Machines 187 No. Park St Salem NH 04156 90233479 International Bus. M. 187 Park Ave Salem NH 04156 90233489 Inter-Nation Consults 15 Main Street Andover MA 02341 90345672 I.B. Manufacturing Park Blvd. Bostno MA 04106
Why Does this Problem Exist? • Most enterprises are running distinct sales, services, marketing, manufacturing and financial applications, each with it’s own “master” reference data. • No one system is the universally agreed-to system of record. • Enterprise Application Vendors do not guarantee a complete & accurate integrated view – they point to their dependence on the quality of the raw input data • Data quality continues to erode at the point of entry, though it is not a data entry problem
What Do You Need to Establish a Data Quality Program? • A foundation platform that centralizes quality rules and provides auditable data quality • Business-driven, data-centric design environment for data quality rules • An ongoing process for data quality • A way to measure quality over time • Universal deployment of quality rules across all points of entry • Data quality ownership and data governance • Management sponsorship and a corporate mandate for data quality improvement
IBM Information ServerA Platform for Data Quality IBM Information Server Unified Deployment Transform Deliver Understand Cleanse Discover, model, and govern information structure and content Standardize, merge, and correct information Combine and restructure information for new uses Synchronize, virtualize and move information for in-line delivery Unified Metadata Management Parallel Processing Rich Connectivity to Applications, Data, and Content
A Process For Data Quality Establish Data Quality Ownership & Sponsorship Analyze Source Data Measure & Baseline Data Quality Standardize Certify & Enrich Match Link or Survive Re-Measure Report
Understanding the Problem: Source System Analysis • Quality Controls for Completeness and Validity of data values • Incomplete or Invalid values set by value, range, or reference sources • Consistency checks for data formats
Measuring & Resolving: Designing Data Quality Rules • Data quality rules should be embedded into data flows Investigate source data Standardize information Match records together Survive the best data across sources into a new record
Parsing: Separating multi-valued fields into individual pieces Investigation 123 St. Virginia St. 123 | St. | Virginia | St. Number Street Alpha Street Type Type 123 | St. | Virginia | St. Lexical analysis: Determining business significance of individual pieces House Street Number Street Name Type 123 | St. Virginia | St. Context Sensitive: Identifying various data structures and content “The instructions for handling the data are inherent within the data itself.”
Input File: Address Line 1 Address Line 2 639 N MILLS AVENUE ORLANDO, FLA 32803 306 W MAIN STR, CUMMING, GA 30130 3142 WEST CENTRAL AV TOLEDO OH 43606 843 HEARD AVE AUGUSTA-GA-30904 1139 GREENE ST ACCT #1234 AUGUSTA GEORGIA 30901 4275 OWENS ROAD SUITE 536 EVANS GA 30809 Result File: House # Dir Str. Name Type Unit No. NYSIIS City SOUNDEX State Zip ACCT# 639 N MILLS AVE MAL ORLANDO O645 FL 32803 306 W MAIN ST MAN CUMMING C552 GA 30130 3142 W CENTRAL AVE CANTRAL TOLEDO T430 OH 43606 843 HEARD AVE HAD AUGUSTA A223 GA 30904 1139 GREENE ST GRAN AUGUSTA A223 GA 30901 1234 4275 OWENS RD STE 536 ON EVANS E152 GA 30809 Standardization - Address Results in strongly “typed” fixed fielded standardized data
Effective Matching Matching is the most beneficial and technically challenging part of data quality • Matching should be based on statistical probability • Match rules should take into account frequency, discriminating values, & reliability of fields when determining which fields to weight in a match • Matching against more fields of data produces higher quality matches • Matching logic is a very business-sensitive issue – business users should be involved in the design of matching rules • Matching is a science that requires careful calibration of match rules – design should be iterative, and should give results based on real data • Matching design should allow for baseline comparison to ensure rule changes are improving quality • The matching engine should provide clerical review capabilities • Setting up clerical review and match cutoffs should be intuitive
Designing Data Quality Rules Holding area allows experimental match rules to be retained Visual Histogram allows users to understand results • Pass Composer provides an intuitive overview of match passes Decision Rules define match criteria Cutoff Tuning allows match & clerical cutoffs to be visually fine-tuned Data Viewer provides immediate feedback on match rule effects, using actual data
What Do You Do with Match Results? ? • Clerical review • Record linkage • Survivorship • Append/ Fix sources = Cross-reference
Request Response Deployment Models for Data Quality Rules Data quality rules need to be applied universally • In bulk movement and consolidation of data • Applied when data changes in source systems • Available as data quality services in a SOA • Embedded in federated queries • Callable directly from enterprise applications Logic Reuse Query
Measuring Data Quality Over Time • Complete analysis of structure and content • View differences between current state and the baseline • Analysis can be run on a scheduled basis, or embedded in batch processes
Lessons Learned & Best Practice:Involve the Business Early • Recruit an executive sponsor • Signals that the initiative is important • Assures that funds continue to be available • Discourages other business units from implementing conflicting projects • Convene a data quality working group • Assess and report on quality early in the process • May coincide with implementation teams or data warehousing teams • Business leads, but IT coordinates and facilitates • Strive for consensus • Have the business appoint a data quality steward for each business unit • For business units with large user populations, several stewards are appropriate
Lessons Learned & Best Practice:Control Scope Ruthlessly / Focus on Benefits • Business must own scope • Business should be owners, not renters • IT maintains its independence by not taking sides • Controlling scope encourages project discipline • Iterate • Projects which try to do it all in one pass generally fail • Meaure, Report, and Deliver benefits regularly • Initial projects must provide some benefit within 6 - 9 months at the minimum (even if a small benefit) • Subsequent phases should provide benefits every 3-6 months
Summary • Data quality is becoming an increasingly important organizational issue • Most critical business initiatives depend of quality information • Improving data quality requires a focused programmatic approach • The IBM Information Server provides all of this in a unified platform • At the core of any data quality program is a platform capable of providing auditable data quality services IBM Information Server
How Can IBM Help? • Comprehensive platform for data quality • Experience and repeatable process for helping organizations set up data quality programs • Domain and industry-specific expertise in establishing repeatable data quality services • Data quality assessment offering to report on existing data quality and establish the business value of a data quality program • Contact your IBM representative for more information
Information On Demand 2006Register Now: www.ibm.com/events/informationondemand Why attend: • Participate in the PREMIER discussion on the future of Information Management • Learn how the transformation to Information as a Service will help you unlock business value and drive competitive advantage • Hear how your peers are realizing ROI • Understand the roadmap to long term strategic advantage • Learn best practices in your industry • Receive the best in technical education and free certification • Extensive opportunities for networking with both your peers and industry experts IBM Information On Demand 2006October 15-20, 2006 Anaheim, California • The premier information management event for business and IT executives, managers, professionals, DBA's and developers. • Select from over 800 sessions: a 2 1/2 day business leadership track with 180 sessions and a 5 day technical track with 650 sessions. • Latest strategy and product announcements • Large Expo Center, Hands on labs • One on ones with executives and specialists • Birds of a Feather roundtables