1 / 42

Extract, Transform and Load

loki
Download Presentation

Extract, Transform and Load

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. 1 Extract, Transform and Load Lecture 2

    2. 2 Objectives Describe the key design considerations for transformation and load Describe a recommended process for designing and coding and testing the ETL process.

    4. 4 Data Engineering Considerations for Design Standardization Cleaning Denormalization Data mapping and type conversion Derivation Aggregation(s) Integration – assigning surrogate keys Managing Slowly Changing Dimensions Audit Loading

    5. 5 Standardization Often, data extracted from the source system needs to be cleaned and standardized. Best way to understand is through some examples Standardization: Consider the following names: Bob Jones Rob Jones Bobby Jones Robert Jones These can all be standardized as Robert Jones

    6. 6 Standardization Consider the following addresses: These should be standardized as: 123 Main, Antigonish, Nova Scotia, B2G 2A1

    7. 7 Standardization Standardization is a key step in the process of integrating data. Using the previous example, assume we get a customer record from 2 different sources.

    8. 8 Standardization In the previous example, it is probably safe to assume that the individual referenced and is in fact the same person As such, we would only want one record for this individual in our data warehouse Assuming we use different primary keys in each system (a pretty safe assumption) we would need to match on name and address to determine that this is the same person Standardization of name and address will give us the ability to match the individual records from multiple systems and integrate into a single customer record

    9. 9 Cleaning Cleaning is a process used to remove unwanted ‘noise’ from data and to make data more presentable Consider the following shipping address: Address 1: 123 Main * deliver to rear entrance * City: Antigonish Postal Code: B2G 2A1 Province: NS Cleaning process would remove the * comment * in the address field

    10. 10 Cleaning Consider the following name: BOB JONES Many tools provide a ‘nice name’ function that will turn above into: Bob Jones Consider the following phone number: 9028630440 Many tools provide the ability to turn this into: (902) 863-0440

    11. 11 Cleaning Disney example: Disney provided kiosks in Epcot that allowed visitors to sign up to receive information regarding Disney vacations (great marketing tool) Got a high % of bogus sign ups, like: Goofy, 123 Main St Mickey Mouse, etc. Built rules (using a tool) to exclude these records from DW) Also, obscenity checking was done and records eliminated

    12. 12 Cleaning and standardization tools Required primarily for cleaning and standardizing the customer and to a lesser extent the product dimension. Provide lots of functionality for cleaning and standardizing names and addresses Key functionality: the ability to match records based on very sophisticated heuristic rules Provide users the ability to define own rules Allow matching on a percentage match or score – users define what constitutes a match – 80%, 90%, 100%?

    13. 13 Cleaning and Standardization tools Example of advanced functionality for business name/address matching Are the following companies in fact the same company? Cleaning and standardization tools provide the ability to find that A=B and B=C so A=B=C

    14. 14 Cleaning and Standardization Vendors Leading Vendors: Id Centric from FirstLogic Trillium from Harte Hanks Code 1 from Group 1 software IDR from Vality Typical cost of software: $250,000 US and up

    15. 15 Denormalization Denormalization is the process of ‘flattening’ dimension and fact tables from multiple tables to (in most cases) a single table. Consider the following order (fact) data from a production system

    16. 16 Denormalization - Dimensions Consider the employee to branch relationship We would probably denormalize that into something like a ‘location’ dimension

    17. 17 Derivation We often need to derive data items. Simple example is for measures on fact table. Consider the order fact:

    18. 18 Derivation - Households More complex example. Often, marketers are interesting in understanding the households that customers live in and the relationships that exist (husband, wife, kids, etc.). Typically, this information is not available from any source system and needs to be derived. Household may have more than 1 addresses (if for example, a family has a permanent and summer address

    19. 19 Derivation – Household example Typically, household derivation is done using cleaning/standardization tools like id centric and Trillium. These tools look to match individuals into households based on standardized names and addresses Examples:

    20. 20 Derivation - examples

    21. 21 Derivation – Households - Conclusions Deriving households can be tricky and organizations tend to be pretty conservative in their approach Don’t want to assume a relationship that does not exist – may be offensive Balanced with the desire not to (for example) make the same marketing offer (for credit cards, as example) multiple times to the same household

    22. 22 Aggregation Often, we need to generate aggregated fact data for 2 main reasons: Speed up queries Generate ‘snapshot’ facts Speed up query processing. The DBA is continuously monitoring the the performance of queries and will create aggregate fact when atomic facts are often rolled up a certain way The most common rollup: date. Assuming we keep atomic fact data, it is very common that we want to do analysis of orders by a time period such as week and month. If we build aggregate fact tables, these can be used in place of atomic level facts and speed query processing as they contain many fewer records.

    23. 23 Aggregation Speed up queries Note that aggregate fact tables retain reference to other dimension tables Many query tools are ‘aggregate aware’. This means that they can detect when an aggregate table exists for a fact and automatically select that table in query. This is transparent to the user Snapshot facts As discussed in a previous lecture, organizations typically want both a transactional and snapshot view of data. Generation of snapshot facts requires a significant amount of aggregation of data.

    24. 24 Aggregation When calculating aggregates, the following is typically true. Assume we are building an aggregate order fact where we want products ordered by customer by location by month. The data for the current month would not be available until the end of the month The FKs to other relevant dimensions are the same as the atomic level fact The FK to the date dimension needs to be to a month, not a day. To achieve this, we need to specify ‘levels’ in the date dimension so that each unique month has its own unique key. So, in the date dimension, we have records that represent unique days 3/19/02 AND unique months 3/02. Makes things more complex for users.

    25. 25 Data Mapping and Type conversions Need to implement logic to choose values to map into fields (e.g., different gender codes from systems) Need to implement IF – THEN - ELSE (or CASE) logic to determine how individual fields will be populated. Often need to convert data types from one form to another Example: date field stored in a CHAR field, convert to DATE. Example: time of day stored as part of date – parse out and store as integer ETL tools are really good at mapping and converting from one data type to another

    26. 26 Surrogate Key Processing Design needs to consider two aspects of surrogate key processing: Assigning and maintaining the keys associated with dimension records Assigning surrogate keys to fact records

    27. 27 Surrogate Keys For a given dimension table, the initial set up involves two steps: Create a ‘sequence’ that will provide the surrogate keys to the dimension table. The sequence generates a sequential set of integers that will be assigned to each row in the dimension table Create a ‘key map’ table that will map product key(s) to the surrogate key assigned.

    28. 28 Surrogate keys – Key Map table Assume we have products coming from 2 systems and there is overlap in the products in the systems Assume we have a method to tell us X2334R is the same as STDHMMR

    29. 29 Surrogate Keys – key map table The role of the key map table is to create and maintain a record of the production key(s) that map to the surrogate key assigned to a given dimension row. Simple Format, using previous example:

    30. 30 Surrogate Key assignment Surrogate key assignment is a two step process: Manage assignment of keys to dimension records as primary keys Manage assignment of surrogate keys to fact records Key map table is critical to this process as a lookup table

    31. 31 Surrogate Key Assignment Process - Dimensions

    32. 32 Surrogate key assignment - facts

    33. 33 Slowly changing dimensions We need to build the logic for processing slowly changing dimensions, and this can be challenging. Note that for a complex dimension like customer, depending on the type of change, we may implement multiple change strategies, depending on what data is changing!! Type 1 Type 2 Type 3 Demographics

    34. 34 Audit As with extract, its important to build audit and error handling logic into the process

    35. 35 Load Process Always use the bulk loader for ALL inserts – even those from within a database Most DBMS’ provide a bulk loader function Basically, the bulk loader provides the ability to load data more quickly as it takes a number of shortcuts For example, the bulk loader does not insert records into recovery log on a record by record basis. Presort what is to be loaded by primary key. DBMS provides capabilitiies Can also purchase tools like SynchSort which sorts data very quickly.

    36. 36 Load Process Many DBMS’ allow simple transformations (e.g., aggregation of data) during load process Rule of thumb: stay away from any transformations using load facilities Index management – need to determine whether or not to drop the index(es) on a table before load, then rebuild Rule of thumb: If loading more than 10% of table, drop and recreate Parallel capabilities of DBMS, OS and hardware to do as much processing simultaneously as possible.

    37. 37 Design Process Overview Step 1. Environment Setup Test, choose, and implement tools - ETL - Cleaning/Standardization - Sort utilities - Code management - Metadata Define documentation standards Ensure all design documentation is saved as metadata

    38. 38 Design Process Overview Step 2. High-Level Plan Create a very high-level, one-page schematic of the source-to-target flow Identify starting and ending points Label known data sources Label targets Include notes about known gotchas

    39. 39 Example – Process Overview

    40. 40 Design Process Overview Step 3. Detailed Design Drill down by target table or logical group of target tables, graphically sketching any complex data restructuring or transformations Graphically illustrate the surrogate-key generation process Develop a preliminary job sequencing

    41. 41 Design Process Overview Step 4: Design ETL for a simple dimension Start with static extract transform and load Step 5: Add change logic for dimension Design and implement the logic required for incremental extract and processing Design and implement logic to manage slowly changing dimensions Step 6: Design ETL for initial (historical) load of facts Step 7: Design ETL for incremental ETL of facts ‘Tune’ fact table load to take advantage of parallelism, bulk load, etc.

    42. 42 Design Process Overview Step 8: Design and build process for aggregates and for moving data from data warehouse to data marts: Special consideration required for design of MDDB-based marts Step 9: Build ETL automation, including: Job scheduling Statistics generation Error handling

More Related