140 likes | 284 Views
DW Lab # 3 Overview of Extraction, Transformation, and Loading. By Lecturer/ Aisha Dawood. LAB EXERCISE #3 Oracle Data Warehousing. Overview of ETL in Data Warehouses
E N D
DW Lab # 3Overview of Extraction, Transformation, and Loading By Lecturer/ Aisha Dawood
LAB EXERCISE #3 Oracle Data Warehousing Overview of ETL in Data Warehouses You need to load your data warehouse regularly so that it can serve its purpose of facilitating business analysis. To do this, data from one or more operational systems needs to be extracted and copied into the data warehouse. The process of extracting data from source systems and bringing it into the data warehouse is commonly called ETL, which stands for extraction, transformation, and loading. Note that ETL refers to a broad process, and not three well-defined steps. The methodology and tasks of ETL have been well known for many years, and are not necessarily unique to data warehouse environments: a wide variety of proprietary applications and database systems are the IT backbone of any enterprise.
LAB EXERCISE #3 Oracle Data Warehousing ETL Basics in Data Warehousing What happens during the ETL process? During extraction, the desired data is identified and extracted from many different sources, including database systems and applications. Very often, it is not possible to identify the specific subset of interest, therefore more data than necessary has to be extracted, so the identification of the relevant data will be done at a later point in time. Depending on the source system's capabilities (for example, operating system resources), some transformations may take place during this extraction process. The size of the extracted data varies from hundreds of kilobytes up to gigabytes, depending on the source system and the business situation. The same is true for the time delta between two (logically) identical extractions: the time span may vary between days/hours and minutes to near real-time. Web server log files, for example, can easily grow to hundreds of megabytes in a very short period of time. After data is extracted, it has to be physically transported to the target system or to an intermediate system for further processing. Depending on the chosen way of transportation, some transformations can be done during this process, too.
LAB EXERCISE #3 Oracle Data Warehousing The need of ETL in data warehouse As the data warehouse is a living IT system, sources and targets might change. Those changes must be maintained and tracked through the lifespan of the system without overwriting or deleting the old ETL process flow information. To build and keep a level of trust about the information in the warehouse, the process flow of each individual record in the warehouse can be reconstructed at any point in time in the future in an ideal case.
LAB EXERCISE #3 Oracle Data Warehousing Extraction Methods in Data Warehouses Logical Extraction Methods There are two types of logical extraction: ■ Full Extraction ■ Incremental Extraction Full Extraction The data is extracted completely from the source system. Because this extraction reflects all the data currently available on the source system, there's no need to keep track of changes to the data source since the last successful extraction. An example for a full extraction may be an export file of a distinct table or a remote SQL statement scanning the complete source table. Incremental Extraction At a specific point in time, only the data that has changed since a well-defined event back in history will be extracted. This event may be the last time of extraction or a more complex business event like the last booking day of a fiscal period. To identify this delta change there must be a possibility to identify all the changed information since this specific time event. This information can be either provided by the source data itself such as an application column, reflecting the last-changed timestamp.
LAB EXERCISE #3 Oracle Data Warehousing Physical Extraction Methods Depending on the chosen logical extraction method and the capabilities and restrictions on the source side, the extracted data can be physically extracted by two mechanisms. There are the following methods of physical extraction: ■ Online Extraction ■ Offline Extraction Online Extraction The data is extracted directly from the source system itself. The extraction process can connect directly to the source system to access the source tables themselves or to an intermediate system that stores the data in a preconfigured manner (for example, snapshot logs or change tables). Offline Extraction The data is not extracted directly from the source system but is staged explicitly outside the original source system. The data already has an existing structure (for example, redo logs, archive logs or transportable tablespaces) or was created by an extraction routine.
LAB EXERCISE #3 Oracle Data Warehousing Offline Extraction You should consider the following structures: ■ Flat files Data in a defined, generic format. Additional information about the source object is necessary for further processing. ■ Dump files Oracle-specific format. Information about the containing objects may or may not be included, depending on the chosen utility. ■ Redo and archive logs Information is in a special, additional dump file. ■ Transportable tablespaces A powerful way to extract and move large volumes of data between Oracle databases.
LAB EXERCISE #3 Oracle Data Warehousing Change Data Capture There are several techniques for implementing a self-developed change capture on Oracle Database source systems: ■ Timestamps ■ Partitioning ■ Triggers These techniques are based upon the characteristics of the source systems, or may require modifications to the source systems. Thus, each of these techniques must be carefully evaluated by the owners of the source system prior to implementation.
LAB EXERCISE #3 Oracle Data Warehousing Timestamps The tables in some operational systems have timestamp columns. The timestamp specifies the time and date that a given row was last modified. If the tables in an operational system have columns containing timestamps, then the latest data can easily be identified using the timestamp columns. For example, the following query might be useful for extracting to’27-JUN-08’s data from an orders table in OE schema: Such modification would require, first, modifying the operational system's tables to include a new timestamp column and then creating a trigger to update the timestamp column following every operation that modifies a given row.
LAB EXERCISE #3 Oracle Data Warehousing Partitioning Some source systems might use range partitioning, such that the source tables are partitioned along a date key, which allows for easy identification of new data. For example, if you are extracting from an orders table, and the orders table is partitioned by week, then it is easy to identify the current week's data.
LAB EXERCISE #3 Oracle Data Warehousing Data Warehousing Extraction Examples You can extract data in two ways: ■ Extraction Using Data Files ■ Extraction Through Distributed Operations Most database systems provide mechanisms for exporting or unloading data from the internal database format into flat files. When the source system is an Oracle database, several alternatives are available for extracting data into files: ■ Extracting into Flat Files Using SQL*Plus ■ Extracting into Flat Files Using OCI or Pro*C Programs ■ Exporting into Export Files Using the Export Utility ■ Extracting into Export Files Using External Tables
LAB EXERCISE #3 Oracle Data Warehousing Extracting into Flat Files Using SQL*Plus: The most basic technique for extracting data is to execute a SQL query in SQL*Plus and direct the output of the query to a file. For example, to extract a flat file, country_city.log, with the pipe sign as delimiter between column values, containing a list of the cities in the US in the tables countries and customers, the following SQL script could be run: SPOOL country_city.log SELECT distinct t1.country_name ||'|'|| t2.cust_city FROM countries t1, customers t2 WHERE t1.country_id = t2.country_id AND t1.country_name= 'United States of America'; SPOOL off
LAB EXERCISE #3 Oracle Data Warehousing The output location identified by the command “host cd” … The output file:
LAB EXERCISE #3 Oracle Data Warehousing This extraction technique can be parallelized by initiating multiple, concurrent SQL*Plus sessions, each session running a separate query representing a different portion of the data to be extracted. For example, suppose that you wish to extract data from an orders table, and that the orders table has been range partitioned by month. (OE schema) SPOOL order_SEP.dat SELECT * FROM orders WHERE order_date BETWEEN TO_DATE('01-SEP-07') AND TO_DATE('29-SEP-07'); SPOOL OFF Note: The physical method is based on a range of values. By viewing the data dictionary