250 likes | 323 Views
Agenda 02/21/2013. Discuss exercise Answer questions in task #1 Put up your sample databases for tasks #2 and #3 Define ETL in more depth by the activities performed. Discuss the “controversy” in ETL activities. Discussed in prior classes. Lots of data.
E N D
Agenda 02/21/2013 • Discuss exercise • Answer questions in task #1 • Put up your sample databases for tasks #2 and #3 • Define ETL in more depth by the activities performed. • Discuss the “controversy” in ETL activities
Discussed in prior classes... • Lots of data. • Traditional transaction processing systems • Non-traditional transaction processing • Call center; Click-stream; Loyalty card; Warranty cards/product registration information • External data from government and commercial entities. • Lots of poor quality data for lots of reasons that can be traced back to lots of people.
Master Data Management: What does it mean and why is it so difficult to manage master data?
Populating the data warehouse • Extract • Take data from source systems. • May require middleware to gather all necessary data. • Transformation • Put data into consistent format and content. • Validate data – check for accuracy, consistency using pre-defined and agreed-upon business rules. • Convert data as necessary. • Load • Use a batch (bulk) update operation that keeps track of what is loaded, where, when and how. • Keep a detailed load log to audit updates to the data warehouse.
Data Cleansing • Source systems contain “dirty data” that must be cleansed • ETL software contains rudimentary to very sophisticated data cleansing capabilities • Industry-specific data cleansing software is often used. Important for performing name and address correction • Leading data cleansing vendors include general hardware/software vendors such as IBM, Oracle, SAP, Microsoft and specialty vendors Information Builders (DataMigrator), Harte-Hanks (Trillium), CloverETL, Talend, and BusinessObjects (Centric)
Steps in data cleansing • Parsing • Correcting • Standardizing • Matching • Consolidating
Parsing • Parsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files. • Examples include parsing the first, middle, and last name; street number and street name; and city and state.
Correcting • Corrects parsed individual data components using sophisticated data algorithms and secondary data sources.
Standardizing • Standardizing applies conversion routines to transform data into its preferred (and consistent) format using both standard and custom business rules.
Matching • Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications.
Consolidating • Analyzing and identifying relationships between matched records and consolidating/merging them into ONE representation.
Source system view – 3 clients Account# 1238891 Policy No.ME309451-2 Transaction B498/97
The reality – ONE client Account# 1238891 Policy No.ME309451-2 Transaction B498/97
William Parker Beth Lewis Karen Parker-Lewis William Parker-Lewis, Jr. Consolidating whole groups
ETL Products • SQL Server 2012 Integration Services from Microsoft • Power Mart/Power Center from Informatica • Warehouse Builder from Oracle • Teradata Warehouse Builder from Teradata • DataMigrator from Information Builders • SAS System from SAS Institute • Connectivity Solutions from OpenText • Ab Initio
What about unstructured data? • What is unstructured data? • What percentage of data in organizations is considered to be “unstructured”? • Examples • Why store it in a data warehouse? • Does it do any good in large text fields? • Special ETL for unstructured data
Unstructured Data Example • Notes about post-service of a product: • The hub bent when the bicycle hit a large pothole. • The plane takes off sluggishly during high-altitude departures. • The product won’t allow entry of a 1098-T when the person is declared as a dependent. • “Text analytics” are used to transform the data.
Text analytics • Parses text and extracts facts (complaints, problems, issues) about key entities (customers, products, locations). • Uses natural language processes (NLP). • NLP converts human language into more formal representations that are easier for a computer program to manipulate. • Combination of computational linguistics and artificial intelligence.
Goal of ETL • Structured and unstructured data stored in a relational database. • Data is complete, accurate, consistent, and in conformance with the business rules of the organization.
Controversy in ETL • Is it necessary? • Has the advent of big data changed our need for ETL? • ETL vs. ELT • Does the use of Hadoop eliminate the need for ETL software???