250 likes | 387 Views
Agenda 03/27/2014. Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance. Define ETL activities. Discuss database analyst/programmer responsibilities for data evaluation. Answers to Multiple Choice Questions.
E N D
Agenda 03/27/2014 • Review first test. • Discuss internal data project. • Review characteristics of data quality. • Types of data. • Data quality. • Data governance. • Define ETL activities. • Discuss database analyst/programmer responsibilities for data evaluation.
Discussed in prior classes... • Lots of data. • Traditional transaction processing systems • Non-traditional data • Call center; Click-stream; Loyalty card; Warranty cards/product registration information, email, twitter, Facebook • External data from government and commercial entities • General classification of data • Transaction data • Referential data/master data • Metadata
Data quality • What is good quality data? • Correct • Accurate • Consistent • Complete • Available • Accessible • Timely • Relevant
How does data “go bad”? Does all “bad” data have to be fixed?
Data governance • Policies, processes and procedures aimed at managing the data in an organization. • Usually high-level cross-department committees that oversee data management across the organization. • Responsible for defining what data is necessary to gather. • Responsible for defining the source and store of data. • Responsible for security policies, processes, procedures. • Responsible for creating the policies, processes and procedures. • Responsible for assigning blame. • Responsible for enforcing policies.
Data quality in data warehouses • Is it more important than data quality in source transaction and reference data? • How is better quality data achieved? • Automated ETL processes to populate the data warehouse • Spot checking programmatically
Populating the data warehouse • Extract • Take data from source systems. • May require middleware to gather all necessary data. • Transformation • Put data into consistent format and content. • Validate data – check for accuracy, consistency using pre-defined and agreed-upon business rules. • Convert data as necessary. • Load • Use a batch (bulk) update operation that keeps track of what is loaded, where, when and how. • Keep a detailed load log to audit updates to the data warehouse.
Data Cleansing • Source systems contain “dirty data” that must be cleansed • ETL software contains rudimentary to very sophisticated data cleansing capabilities • Industry-specific data cleansing software is often used. Important for performing name and address correction • Leading data cleansing vendors include general hardware/software vendors such as IBM, Oracle, SAP, Microsoft and specialty vendors Informatica, Information Builders (DataMigrator), Harte-Hanks (Trillium), CloverETL, Talend, and BusinessObjects (SAP-AG)
Steps in data cleansing • Parsing • Correcting • Standardizing • Matching • Consolidating
Parsing • Parsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files. • Examples include parsing the first, middle, and last name; street number and street name; and city and state.
Correcting • Corrects parsed individual data components using sophisticated data algorithms and secondary data sources.
Standardizing • Standardizing applies conversion routines to transform data into its preferred (and consistent) format using both standard and custom business rules.
Matching • Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications.
Consolidating • Analyzing and identifying relationships between matched records and consolidating/merging them into ONE representation.
Source system view – 3 clients Account# 1238891 Policy No.ME309451-2 Transaction B498/97
The reality – ONE client Account# 1238891 Policy No.ME309451-2 Transaction B498/97
William Lewis Beth Parker Karen Parker-Lewis William Parker-Lewis, Jr. Consolidating whole groups
ETL Products • SQL Server 2012 Integration Services from Microsoft • Power Mart/Power Center/Power Exchange from Informatica • Warehouse Builder from Oracle • Teradata Warehouse Builder from Teradata • DataMigrator from Information Builders • SAS System from SAS Institute • Connectivity Solutions from OpenText • Ab Initio
ETL Goal: Data is complete, accurate, consistent, and in conformance with the business rules of the organization. Questions: • Is ETL really necessary? • Has the advent of big data changed our need for ETL? • ETL vs. ELT • Does the use of Hadoop eliminate the need for ETL software??? • Does it matter if the data is stored in the “cloud”?