1 / 21

ETL

ETL. Prof. Navneet Goyal Department of Computer Science & Information Systems BITS, Pilani. Topics. Requirements Build or Buy? ETL Data Structures Data Flow Extract Clean & Conform Deliver Dimension Tables Fact tables Implementation & Operations. Introduction.

Download Presentation

ETL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ETL Prof. Navneet Goyal Department of Computer Science & Information Systems BITS, Pilani

  2. Topics • Requirements • Build or Buy? • ETL Data Structures • Data Flow • Extract • Clean & Conform • Deliver • Dimension Tables • Fact tables • Implementation & Operations

  3. Introduction • ETL system is the foundation of any DW system • Still the most under rated system! • An ETL system • Extracts data from source systems • Enforces data quality & consistency • Conforms data so that separate source systems could be used together • Delivers data in a presentation ready format that can be used by end users

  4. Introduction • ETL system makes or breaks a DW • Building an ETL system is a back room activity not visible to end users • Consumes almost 70-80% of resources needed for implementation & maintenance of a DW • Mission of ETL system: get data out of the source systems & load it into the DW

  5. Introduction • Extract • Extract relevant data • Transform • Transform data to DW format • Build keys, etc. • Cleansing of data • Load • Load data into DW • Build aggregates, etc.

  6. Choice of Architecture Tool Based ETL Simpler, Cheaper & Faster development • People with business skills & not much technical skills can use it. • Automatically generate Metadata • Automatically generates data Lineage & data dependency analysis • Offers in-line encryption & compression capabilities • Manage complex load balancing across servers

  7. Choice of Architecture Hand-Coded ETL • Quality of tool by exhaustive unit testing • Better metadata • Requirement may be just file based processes not database-stored procedures • Use of existing legacy routines • Use of in-house programmers • Unlimited flexibility

  8. Middleware & Connectivity Tools • Provide transparent access to source systems in heterogeneous computing environments • Expensive but often prove invaluable because they provide transparent access to DBs of different types, residing on different platforms • Examples: • IBM: Data Joiner • Oracle: Transparent Gateway • SAS: SAS/Connect • Sybase: Enterprise Connect

  9. Extraction Tools • Lot of tools available in the market • Tool selection tedious • Choice of tool depends on following factors: • Source system platform and DB • Built-in extraction or duplication functionality • Batch windows of the operational systems

  10. Extraction Methods • Bulk Extractions • Entire DW is refreshed periodically • Heavily taxes the network connections between the source & target DBs • Easier to set up & maintain • Change-based Extractions • Only data that have been newly inserted or updated in the source systems are extracted & loaded into the DW • Places less stress on the network but requires more complex programming to determine when a new DW record must be inserted or when an existing DW record must be updated

  11. Transformation Tools • Transform extracted data into the appropriate format, data structure, and values that are required by the DW • Features provided: • Field splitting & consolidation • Standardization • Abbreviations, date formats, data types, character formats, etc. • Deduplication

  12. Mission of ETL team To build the back room of the DW • Deliver data most effectively to end user tools • Add value to the data in the cleaning & conforming steps • Protect & document the lineage of data

  13. Mission of ETL team The back room must support 4 key steps • Extracting data from original sources • Quality assuring & cleaning data • Conforming the labels & measures in the data to achieve consistency across the original sources • Delivering the data in a physical format that can be used by query tools and report writers

  14. ETL Data Structures Data Flow Extract  Clean  Conform  Deliver • Back room of a DW is often called the data staging area • Staging means ‘writing to disk’ • ETL team needs a number of different data structures for all kinds of staging needs

  15. To stage or not to stage • Decision to store data in physical staging area versus processing it in memory is ultimately the choice of the ETL architect

  16. To stage or not to stage • A conflict between • getting the data from the operational systems as fast as possible • having the ability to restart without repeating the process from the beginning • Reasons for staging • Recoverability: stage the data as soon as it has been extracted from the source systems and immediately after major processing (cleaning, transformation, etc). • Backup: can reload the data warehouse from the staging tables without going to the sources • Auditing: lineage between the source data and the underlying transformations before the load to the data warehouse

  17. Designing the staging area • The staging area is owned by the ETL team • no indexes, no aggregations, no presentation access, no querying, no service level agreements • Users are not allowed in the staging area for any reason • staging is a “construction” site • Reports cannot access data in the staging area • tables can be added, or dropped without notifying the user community • Controlled environment

  18. Designing the staging area (contd…) • Only ETL processes can read/write the staging area (ETL developers must capture table names, update strategies, load frequency, ETL jobs, expected growth and other details about the staging area) • The staging area consists of both RDBMS tables and data files

  19. Staging Tables Volumetric Worksheet • Lists each table in the staging area with the following information: • Table Name: name or table or file in the DSA. One row in the WS for each staging table • Update Strategy: Indicates how a table is maintained. For persistent tables it will have data appended, updated, or perhaps deleted. Transient tables are truncated and reloaded with each process • Load Frequency: How often the table is loaded or changed by the ETL process. In real-time environment – continuously • ETL Jobs • Initial Row count:

  20. Staging Area data Structures in the ETL System • Flat files • fast to write, append to, sort and filter (grep) but slow to update, access or join • enables restart without going to the sources • XML Data Sets • Used as a medium of data transfer between incompatible data sources • Gives enough information to create tables using CREATE TABLE • Relational Tables • Metadata, SQL interface, DBA support

More Related