270 likes | 679 Views
IBM Information Server. Cleanse - QualityStage. IBM Information Server Delivering information you can trust. Support for Service-Oriented Architectures. Transform. Deliver. Understand. Cleanse. Discover, model, and govern information structure and content. Standardize, merge,
E N D
IBM Information Server Cleanse - QualityStage
IBM Information ServerDelivering information you can trust Support for Service-Oriented Architectures Transform Deliver Understand Cleanse Discover, model, and govern information structure and content Standardize, merge, and correct information Combine and restructure information for new uses Synchronize, virtualize and move information for in-line delivery Platform Services Parallel Processing Administration Deployment Connectivity Metadata 2
The IBM Solution: IBM Information ServerDelivering information you can trust IBM Information Server Unified Deployment Transform Deliver Understand Cleanse • WebSphere QualityStage Data cleansing, standardization, matching, and survivorship for enhancing data quality and creating coherent business views Unified Metadata Management Parallel Processing Rich Connectivity to Applications, Data, and Content
Need for Data Quality • Critical Problems • Need to create & maintain 360 degree views of customers, suppliers, products, locations, events • Need to leverage data - make reliable decisions, comply with regulations, meet service agreements • Why? • No common standards across organization • Unexpected values stored in fields • Required information buried in free-form fields • Fields evolve - used for multiple purposes • No reliable keys for consolidated views • Operational data degrades 2% per month • Alternative Approaches • Denial – problem misunderstood and ignored until too late; load and explode • Hand-coding - clerical exception processing; very time consuming and resource intensive • Simplistic cleansing apps - evolved from direct marketing & list hygiene, lack flexibility Data Sources Data Values Kentucky Fried Chicken KFC 227G CB&NAT STICK P QUE/MOZZ WRAPP. Molly Talber DBA KFC Kent Fried Chick Kentucky Fried Mrs. M. Talber 227G CB&NATURAL STICK MOZZ WRAPPER John & Molly Talber Talber, KFC, ATIMA 4
Why Should I Care About Cleansing Information? Lack of information standards • Different formats & structures across different systems Data surprises in individual fields • Data misplaced in the database Information buried in free-form fields Data myopia • Lack of consistent identifiers inhibit a single view The redundancy nightmare • Duplicate records with a lack of standards
Importance of Data Quality • Low data quality impacts an organization in several ways • Poor data quality leads to misguided marketing promotions • Cross sell opportunities may be missed because same customer appears several times in slightly different ways • Valued customers may not be recognized during support calls or other important touchpoints • Data mining is difficult because related items are not detected as related • What is good data quality? • Two percent of “bad” data doesn’t sound that bad? • Two percent of 10M rows means that you have 200K errors • 200K errors add up to big problem for analytics/operations/anything!
Enterprise initiatives… …to satisfy critical business requirements. • Supply chain collaboration & item synchronization • Inventory consolidation • Single view of a customer or supplier • ERP Implementations • ERP instance consolidation • IT System renovation • Consolidation resulting from M&A activity • Enterprise Data Warehouse • Compliance & Regulatory projects (SOX, HIPAA, ACCORD, etc.) …need high quality data… Compliance Business to Business Standards Risk Management Reduce Costs & Increase Productivity Increase Revenue / CRM Payoff Business Intelligence Payoff
Shared design environment with DataStage increases functionality and reduces development time Visual match rule interface simplifies match tuning Service orientation provides ‘continuous’ quality & delivers confidence in your data Parallel architecture shortens execution time IBM WebSphere QualityStage
How will you get an accurate, consolidated view of your business? 1. Free Form Investigation 2. Data Standardization 3. Data Matching 4. Data Survivorship Customers WebSphere QualityStage Process Products / Materials Target Database with Consolidated Views Transactions Vendors / Suppliers
Why Investigate • Discover trends and potential anomalies in the data • 100% visibility of single domain and free-form fields • Identify invalid and default values • Reveal undocumented business rules and common terminology • Verify the reliability of the data in the fields to be used as matching criteria • Gain complete understanding of data within context
Parsing: Separating multi-valued fields into individual pieces Investigation - Free Form 123 St. Virginia St. 123 | St. | Virginia | St. number street state street type type 123 | St. | Virginia | St. Lexical analysis: Determining business significance of individual pieces House Street Street Number Name Type 123 | St. Virginia | St. Context Sensitive: Identifying various data structures and content “The instructions for handling the data are inherent within the data itself.”
Rule Sets • Pre-defined rules for parsing and standardizing: • Name • Address • Area (City, State and Zip) • Multi-national address processing • Validate structure: • Tax ID • US Phone • Date • Email • Append ISO country codes • Pre-process or filter name, address and area • Rule sets are stored in the common repostiory
Input File: Address Line 1 Address Line 2 639 N MILLS AVENUE ORLANDO, FLA 32803 306 W MAIN STR, CUMMING, GA 30130 3142 WEST CENTRAL AV TOLEDO OH 43606 843 HEARD AVE AUGUSTA-GA-30904 1139 GREENE ST ACCT #1234 AUGUSTA GEORGIA 30901 4275 OWENS ROAD SUITE 536 EVANS GA 30809 Result File: House # Dir Str. Name Type Unit No. NYSIIS City SOUNDEX State Zip ACCT# 639 N MILLS AVE MAL ORLANDO O645 FL 32803 306 W MAIN ST MAN CUMMING C552 GA 30130 3142 W CENTRAL AVE CANTRAL TOLEDO T430 OH 43606 843 HEARD AVE HAD AUGUSTA A223 GA 30904 1139 GREENE ST GRAN AUGUSTA A223 GA 30901 1234 4275 OWENS RD STE 536 ON EVANS E152 GA 30809 Standardization - Example
Why Match • Identify duplicate entities within one or more files • Perform householding • Create consolidated view of customer • Establish cross-reference linkage • Enrich existing data with new attributes from external sources
B B A A B D B A = BBAABDBA +5 +2 +20 +3 +4 -1 +7 +9 = +49 DeterministicDecisions Tables: • Fields are compared • Letter grade assigned • Combined letter grades are compared to a vendor delivered file • Result: Match; Fail; Suspect Probabilistic Record Linkage: • Fields are evaluated for degree-of-match • Weight assigned: represents the “information content” by value • Weights are summed to derived a total score • Result: Statistical probability of a match Two Methods to Decide a Match Are these two records a match? WILLIAM J KAZANGIAN 128 MAIN ST 02111 12/8/62 WILLAIM JOHN KAZANGIAN 128 MAINE AVE 02110 12/8/62
Why Survive • Provide consolidated view of data • Provide consolidated view containing the “best-of-breed” data • Resolve conflicting values and fill missing values • Cross-populate best available data • Implement business and mapping rules • Create cross-reference keys
Group Legacy First Middle Last No. Dir. Str. Name Type Unit No. 1 D150 Bob Dixon 1500 SE ROSS CLARK CIR 1 A1367 Robert Dickson 1500 ROSS CLARK CIR 23 D689 Ernest A Obrian 5901 SW 74TH ST STE 202 • A436 Ernie Alex O’Brian 5901 SW 74TH ST 23 D352 Ernie Obrian 5901 74 ST # 202 Consolidated Output Group Legacy 1 D150 1 A1367 23 D689 23 A436 23 D352 Group First Middle Last No. Dir. Str. Name Type Unit No. 1 RobertDickson 1500 SE ROSS CLARK CIR 23 ErnieAlexO’Brian 5901 SW74TH ST STE202 Survivorship - Example Survivorship Input (Match Output)
Data Extraction and Load Routines Seamlessly! QualityStage • Investigation • Standardization • Integration • Survivorship How Does WebSphere QualityStage Integrate Database Target DB2 Oracle Sybase Onyx IDMS etc. DB2 Oracle Sybase Onyx IDMS etc.
WebSphere DataStage andWebSphere QualityStage: Fully Integrated! Seamless!
QualityStage: Data Quality Extensions IBM WebSphere QualityStage GeoLocator IBM WebSphere QualityStage Postal Verification Products • WAVES (WorldWide) IBM WebSphere Worldwide Address Verification Solution IBM WebSphere QualityStage Postal Certification Products • CASS (United States) • SERP (Canada) • DPID (Australia) IBM Information Server Data Quality Module for SAP IBM WebSphere QualityStage for Siebel 20
Key Strengths for IBM QualityStage • Intuitive, “Design as you think” User Interface • Simple rule design & fine tuning • Seamless Data Flow integration • Intuitive rule design & fine tuning • Defining the technology standard with SOA • Industry leading probabilistic matching engine 21