960 likes | 1.23k Views
Using Metadata to Drive Data Quality Hunting the Data Dust Bunnies. John Murphy Apex Solutions, Inc. NoCOUG 11-13-2003. Presentation Outline. The Cost - It’s always funny when it’s someone else… Quality - Quality Principles and The Knowledge Worker Data Quality Data Development
E N D
Using Metadata to Drive Data Quality Hunting the Data Dust Bunnies John Murphy Apex Solutions, Inc. NoCOUG 11-13-2003
Presentation Outline • The Cost - It’s always funny when it’s someone else… • Quality - Quality Principles and The Knowledge Worker • Data Quality • Data Development • Metadata Management • Profile and Baseline Statistics • Vendor Tools • Wrap-up • Some light reading
The Cost 1. The Cost…
The Cost… “Quality is free. What’s expensive is finding out how to do it it right the first time.” Philip Crosby • A major credit service was ordered to pay $25M for release of hundreds of customers names of one bank to another bank because a confidentiality indicator was not set. • Stock value of major health care insurer dropped 40% because analysts reported the insurer was unable to determine which “members” were active paying policy holders. Stock value dropped $3.7 Billion in 72 hours. • The sale/merger of a major cable provider was delayed 9 months while the target company determined how many households it had under contract. Three separate processes calculated three values with a 80% discrepancy.
The Cost… • The US Attorney General determined that 14% of all health care dollars or approximately $23 billion were the result of fraud or inaccurate billing. • DMA estimates that greater than 20% of customer information housed by it’s members database is inaccurate or unusable. • A major Telco has over 350 CUSTERMER tables with CUSTOMER data repeated as many as 40 times with 22 separate systems capable of generating or modifying customer data. • A major communications company could not invoice 2.6% of it’s customers because the addresses provided were non-deliverable. Total Cost > $85M annually. • A State Government annually sends out 300,000 motor vehicle registration notices with up to 20% undeliverable addresses. • A B2B office supply company calculated that it saves an average of 70 cents per line item through web sales based on data entry validation at the source of order entry.
The Cost TDWI - 2002
The Regulatory Challenges • No more “Corporate” data • New Privacy Regulations • Direct Marketing Access • Telemarketing Access • Opt – In / Opt - Out • New Customer Managed Data Regulations • HIPAA • New Security Regulations • Insurance Black List Validation • Bank Transfer Validation • Business Management • Certification of Financial Statements • Sarbanes – Oxley These have teeth…
Sources of Data Non-Quality TDWI - 2002
Data Development and Data Quality 2. Data Quality
Data Quality Process • There is a formal process to quantitatively evaluate the quality of corporate data assets. • The process outlined here is based on Larry English’s Total data Quality Management (TdQM) • Audit the current data resources • Assess the Quality • Measure Non-quality Costs • Reengineer and Cleanse data assets • Update Data Quality Process
Determination of Data Quality TDWI -2002
Data Quality Process • Develop a Data Quality Process to Quantitatively evaluate the Quality of Corporate Data Assets. • Establish the Metadata Repository • Implement Data Development and Standardization Process • Profile and Baseline your Data • Use the Metadata to Improve your Data Quality • Revise The Data Quality Process
Determination of Data Quality TDWI - 2002
Customer Satisfaction Profile • Determine who are the most consistent users of specific Data entities. Select a sample set of attributes to be reviewed. • Publish the metadata report for the selected attributes • Select representatives in from the various business areas and knowledge workers to review the selected attributes and metadata. • Distribute the questionnaires, retrieve and score the results. • Report on and distribute the results.
Quality Assessment Results Problem Metadata Acceptability Threshold
Quality Assessment Results Acceptability Threshold Improving…
Quality Assessment Results Acceptability Threshold Got it Right!
Data Development and Standardization 4. Building Data
Data Standardization Process Data Development and Approval Process Integrated Data Model (Data Elements) MDR Data Requirements Resolved Issues Issues Proposal Package Technical Review Functional Review Issue Resolution Data Architect Data Administrator Data Architect Stewards Architect
Data Standardization Process • Proposal Package – Data Model, Descriptive Information, Organization Information, Integration Information, Tool Specific Information • Technical Review – Model Compliance, Metadata Complete and accurate • Functional Review – Integration with Enterprise Model • Issue Resolution – Maintenance and Management • Total Process < 30 days • All based on an integrated web accessible application. Results integrated to the Enterprise Metadata Repository.
Data Standardization • Getting to a single view of the truth • Getting to a corporate owned process of data and information management. Addition of new business needs Describe Existing Data Assets
Data Development Process • There is a formal process for development, certification, modification and retirement of data. • Data Requirements lead directly to Physical Data Structures. • Data Products lead directly to Information Products. • The Data Standards Evaluation Guide • Enterprise level not subject area specific • I can use “customer” throughout the organization • I can use “Quarterly Earnings” throughout the organization • All the data objects have a common minimal set of attributes dependent upon their type. • All data elements have a name, business name, data type, length, size or precision, collection of domain values etc. • There are clear unambiguous examples of the data use • The data standards are followed by all development and management teams. • The same data standards are used to evaluate internally derived data as well as vendor acquired data.
Data Standardization Process • Standardization is the basis of modeling – Why Model? • Find out what you are doing so you can do it better • Discover data • Identify sharing partners for processes and data • Build framework for database that supports business • Establish data stewards • Identify and eliminate redundant processes and data • Check out ICAM / IDEF…
Statutes, Regulations & Policies Funding Acquisition Guidance Requirement Package Solicitation Announcement Industry Resource Data Purchase Conduct Procurement Communication from Contractor Notification to Vendor Proposed Programs & Procurement Issues Purchase Performance Analysis • A0 Company Support Team Purchase Officer An example Process Model
The Data Model… • Data Model- A description of the organization of data in a manner that reflects the information structure of an enterprise • Logical Data Model - User perspective of enterprise information. Independent of target database or database management system • Entity – Person, Place, Thing or Concept • Attribute – Detail descriptive information associated with an Entity • Relation – The applied business rule to one or more entities • Element - A named identifier of each of the entities and their attributes that are to be represented in a database.
Rules to Entities and Attributes • There is more than one state. • Each state may contain multiple cities. • Each city is always associated with a state. • Each city has a population. • Each city may maintain multiple roads. • Each road has a repair status. • Each state has a motto. • Each state adopts a state bird • Each state bird has a color. STATE STATE Code CITY Name CITY POPULATION Quantity CITY ROAD Name CITY ROAD REPAIR Status STATE MOTTO Text STATE BIRD Name STATE BIRD COLOR Name
CITY ROAD City Road Repair Status City Name State Code City Road Name VA Alexandria Route 1 2 VA Alexandria Franconia 1 MD Annapolis Franklin 1 MD Baltimore Broadway 3 AZ Tucson Houghton 2 AZ Tucson Broadway 2 IL Springfield Main 3 MA Springfield Concord 1 Resulting Populated Tables (3NF) STATE CITY STATE State Motto State Bird Name State Code City Name City Pop. State Code Cardinal Oriole Cactus Wren Cardinal Chickadee VA Alexandria 200K MD Annapolis 50K MD Baltimore 1500K AZ Tucson 200K IL Springfield 40K MA Springfield 45K VA “ “ MD “ “ AZ “ “ IL “ “ MA “ “ STATE BIRD State Bird State Bird Color Cardinal Red Oriole Black Cactus Wren Brown Chickadee Brown
STATE Code (FK) CITY Name CITY POPULATION Quantity Example Entity, Attributes, Relationships Becomes Road Kill on/ Kills State Model STATE STATE CITY STATE Code STATE MOTTO Text STATE BIRD Name (FK) Contains Maintains Adopted by/ Adopts CITY ROAD STATE BIRD STATE Code (FK) CITY Name (FK) CITY ROAD Name CITY ROAD REPAIR Status STATE BIRD Name STATE BIRD COLOR Name
Data Standardization • Data Element Standardization -The process of documenting, reviewing, and approving unique names, definitions, characteristics, and representations of data elements according to established procedures and conventions. Prime Word Property Modifier(s) Generic Element Required 1 0 - n Standard Data Element Structure Class word Modifier(s) Class Word 0 - n
The Generic Element The Generic Element - The part of a data element that establishes a structure and limits the allowable set of values of a data element. Generic elements classify the domains of data elements. Generic elements may have specific or general domains. Examples – Code, Amount, Weight, Identifier Domains – The range of values associated with an element. General Domains can be infinite ranges as with an ID number or Fixed as with a State Code.
Standardized Data Element EXAMPLE Element Name: Access Name: Definition Text: Person Eye Color Code PR-EY-CLR-CD The code that represents the natural pigmentation of a person’s iris Authority Reference Text: Steward Name: U.S. Code title 10, chapter 55 USD (P&R) Domain values: BK . . . . . . . . . . . . . . Black BL . . . . . . . . . . . . . . Blue BR . . . . . . . . . . . . . . Brown GR . . . . . . . . . . . . . . Green GY . . . . . . . . . . . . . . Gray HZ . . . . . . . . . . . . . . Hazel VI . . . . . . . . . . . . . . Violet
Standards • Name Standards • Comply with format • Single concept, clear, accurate and self explanatory • According to functional requirements not physical considerations • Upper and lower case alphabetic characters, hyphens (-) and spaces ( ) • No abbreviations or acronyms, conjunctions, plurals, articles, verbs or class words used as modifiers or prime words • Definition Standards • What the data is, not HOW, WHERE or WHEN used or WHO uses • Add meaning to name • One interpretation, no multiple purpose phrases, unfamiliar technical program, abbreviations or acronyms
Integration of the Data Through Metadata Data Integration by Subject area Subject Area 2 Subject Area 1 Subject Area 3
Data Model Integration • Brings together (joins) two or more approved Data Model views • Adds to the scope and usability of the Corporate Data Model (EDM) • Continues to support the activities of the department that the individual models were intended to support • Enables the sharing of information between the functional areas or components which the Data Models support
Enterprise Data Model Use of Enterprise Data Model Component Views SECURITY- LEVEL ORGANIZATION- SECURITY-LEVEL ORGANIZATION Component Models PERSON-SECURITY- LEVEL Functional Models Standard Metadata & Schemas System Models Functional Views
Metadata Management 5. Metadata Management
Data in Context! Mr. End User Sets Context For His Data.
Metadata Metadata is the data about data… Huh? Metadata is the descriptive information used to set the context and limits around a specific piece of data. • The metadata lets data become discreet and understandable by all communities that come in contact with a data element. • Metadata is the intersection of certain facts about data that lets the data become unique. • It makes data unique, understood and unambiguous. • The accumulation of Metadata creates a piece of data. The more characteristics about the data you have the more unique and discreet the data can be.
Relevant Metadata • Technical - Information on the physical warehouse and data. • Operational / Business - Rules on the data and content • Administrative - Security, Group identification etc. • The meta model is the standard content defining the attributes of any given data element in any one of these models. The content should address the needs of each community who comes in contact with the data element. The meta model components make the data element unique to each community and sub community.
Acquiring the Metadata • Data Modeling Tools – API and Extract to Repository • Reverse Engineered RDBMS – Export Extract • ETL Tools – Data mapping, Source to Target Mapping • Scheduling Tools – Refresh Rates and Schedules • Business Intelligence Tools – Retrieval Use • Current Data Dictionary
Technical Metadata • Physical Descriptive Qualities • Standardized Name • Mnemonic • Data type • Length • Precision • Data definition • Unit of Measure • Associated Domain Values • Transformation Rules • Derivation Rule • Primary and Alternate Source • Entity Association • Security And Stability Control
Administrative and Operational Metadata • Relates the Business perspective to the end user and Manages Content • Retention period • Update frequency • Primary and Optional Sources • Steward for Element • Associated Process Model • Modification History • Associated Requirement Document • Business relations • Aggregation Rules • Subject area oriented to insure understanding by end user
Individual Individual View Relationship Subject Area Entity Attribute Entity Alias Encoding / Lookup Tables Source System Attribute Alias Attribute Default The Simple Metamodel
The Common Meta Model Based on Tannenbaum
Required Data Element Technical Metadata • Name • Mnemonic • Definition • Data value source list text • Decimal place count quantity • Authority reference text • Domain definition text • Domain value identifiers • Domain value definition text • High and low range identifiers • Maximum character count quantity • Proposed attribute functional data steward • Functional area identification code • Unit measure name • Data type name • Security classification code • Creation Date
Use of The Enterprise Tools Enterprise Data Repository (EDR) Enterprise Data Model (EDM) Migration/New Information systems Enterprise Data Dictionary System (EDDS) Database Tables Database Columns Database Rows Prime Words Data Elements Metadata Entities Attributes Relationships (business rules) Database Dictionary Associations and Table Joins
Profile the Data 6. Profile and Baseline Data