600 likes | 761 Views
ISQS 6339, Data Management & Business Intelligence Extraction, Transformation, and Loading (II). Zhangxi Lin Texas Tech University. Agenda. I. Using SSIS for ETL Integration Services Learn by doing Package items Problem-oriented package development II. The Principle of ETL Extraction
E N D
ISQS 6339, Data Management & Business IntelligenceExtraction, Transformation, and Loading (II) Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence
Agenda I. Using SSIS for ETL • Integration Services • Learn by doing • Package items • Problem-oriented package development II. The Principle of ETL • Extraction • Transformation • Loading ISQS 6339, Data Management & Business Intelligence
II. The Principle of ETL ISQS 6339, Data Management & Business Intelligence
Structure and Components of Business Intelligence SSMS SSIS SSAS SSRS SAS EG SAS EM ISQS 6339, Data Management & Business Intelligence
Automating your routine information processing tasks • Your routine information processing tasks • Read online news at 8:00a and collect a few most important pieces • Retrieve data from database to draft a short daily report at 10a • View and reply emails and take some notes that are saved in a database • View 10 companies’ webpage to see the updates. Input the summaries into a database • Browse three popular magazines twice a week. Input the summaries into a database • Generate a few one-way frequency and two-way frequency tables and put them on the web • Merge datasets collected by other people into a main database. • Prepare a weekly report using the database and at 4p every Monday, and publish it to the internal portal site. • Prepare a monthly report at 11a on the first day of a month, which must be converted into a pdf file and uploaded to the website. • Seems there are many things are on going. How to handle them properly in the right time? • Organizer – yes • How about regular data processing tasks? ISQS 6339, Data Management & Business Intelligence
Information Processing and Information Flow • Transaction processing • Interactions between a user and a computer application system with immediate responses from the application • Operational processing • Make use of computer to control a process • Batch processing • Consisting of a series of executions, each of which is applied to a set of data and turns the result to the next one. • Analytical processing • The interaction between analysts and collections of aggregated data that may have been reformulated into alternative representational forms for improved analytical performance. ISQS 6339, Data Management & Business Intelligence
Programs Gateways Tools Extraction, Transformation, Loading (ETL) Processes • Extract source data • Transform/clean data • Index and summarize • Load data into warehouse • Detect changes • Refresh data Operational systems ETL Data Warehouse ISQS 6339, Data Management & Business Intelligence
ETL: Tasks, Importance, and Cost Extract Clean up Consolidate Restructure Load Maintain Refresh Data Warehouse Operationalsystems ETL Relevant Useful Quality Accurate Accessible ISQS 6339, Data Management & Business Intelligence
Data mapping Transform Extracting Data • Source systems • Data from various data sources in various formats • Extraction Routines • Developed to select data fields from sources • Consist of business rules, audit trails, error correction facilities Warehouse database Operational databases Data staging area ISQS 6339, Data Management & Business Intelligence
IMS DB2 Oracle Sybase Informix VSAM SAP Shared Medical Systems Dun and Bradstreet Financials Hogan Financials Oracle Financials Production Data • Operating system platforms • File systems • Database systems and vertical applications ISQS 6339, Data Management & Business Intelligence
Archive Data • Historical data • Useful for analysis over long periods of time • Useful for first-time load • May require unique transformations Operation databases Warehouse database ISQS 6339, Data Management & Business Intelligence
Planning Marketing Accounting Internal Data • Planning, sales, and marketing organization data • Maintained in the form of: • Spreadsheets (structured) • Documents (unstructured) • Treated like any other source data Warehouse database ISQS 6339, Data Management & Business Intelligence
A.C. Nielsen, IRI, IMS,Walsh America Purchased databases Competitive information Dun and Bradstreet Economic forecasts Wall Street Journal Barron's Warehousing databases External Data • Information from outside the organization • Issues of frequency, format, and predictability • Described and tracked using metadata ISQS 6339, Data Management & Business Intelligence
Possible ETL Failures • A missing source file • A system failure • Inadequate metadata • Poor mapping information • Inadequate storage planning • A source structural change • No contingency plan • Inadequate data validation ISQS 6339, Data Management & Business Intelligence
Maintaining ETL Quality • ETL must be: • Tested • Documented • Monitored and reviewed • Disparate metadata must be coordinated. ISQS 6339, Data Management & Business Intelligence
Transformation • Transformation eliminates anomalies from operational data: • Cleans and standardizes • Presents subject-oriented data • Transform: • Clean up • Consolidate • Restructure Extract Warehouse Operationalsystems Load Data Staging Area ISQS 6339, Data Management & Business Intelligence
Operationalsystem Operationalsystem Warehouse Warehouse Remote Staging Model Data staging area within the warehouse environment Transform Extract Load Staging area Data staging area in its own environment Transform Extract Load Staging area ISQS 6339, Data Management & Business Intelligence
On-site Staging Model • Data staging area within the operational environment,possibly affecting the operational system Transform Extract Load Operational system Staging area Warehouse ISQS 6339, Data Management & Business Intelligence
Data Anomalies • No unique key • Data naming and coding anomalies • Data meaning anomalies between groups • Spelling and text inconsistencies ISQS 6339, Data Management & Business Intelligence
Transformation Routines • Cleaning data • Eliminating inconsistencies • Adding elements • Merging data • Integrating data • Transforming data before load ISQS 6339, Data Management & Business Intelligence
Transforming Data: Problems and Solutions • Multipart keys • Multiple local standards • Multiple files • Missing values • Duplicate values • Element names • Element meanings • Input formats • Referential Integrity constraints • Name and address ISQS 6339, Data Management & Business Intelligence
Product code = 12M654313 45 Salesperson code Country code Sales territory Productnumber Multipart Keys Problem • Multipart keys ISQS 6339, Data Management & Business Intelligence
Multiple Local Standards Problem • Multiple local standards • Tools or filters to preprocess cm DD/MM/YY 1,000 GBP inches MM/DD/YY FF 9,990 cm DD-Mon-YY USD 600 ISQS 6339, Data Management & Business Intelligence
Multiple Files Problem • Added complexity of multiple source files • Start simple Multiple source files Logic to detectcorrect source Transformed data ISQS 6339, Data Management & Business Intelligence
Missing Values Problem • Solution: • Ignore • Wait • Mark rows • Extract when time-stamped If NULL thenfield = ‘A’ ISQS 6339, Data Management & Business Intelligence
Duplicate Values Problem • Solution: • SQL self-join techniques • RDMBS constraint utilities ACME Inc ACME Inc ACME Inc SQL> SELECT ... 2 FROM table_a, table_b 3 WHERE table_a.key (+)= table_b.key 4 UNION 5 SELECT ... 6 FROM table_a, table_b 7 WHERE table_a.key = table_b.key (+); ISQS 6339, Data Management & Business Intelligence
Element Names Problem • Solution: Common naming conventions Customer Client Customer Contact Name ISQS 6339, Data Management & Business Intelligence
Element Meaning Problem • Avoid misinterpretation • Complex solution • Document meaning in metadata Customer’s name All customer details All details except name Customer_detail ISQS 6339, Data Management & Business Intelligence
Input Format Problem EBCDIC ASCII “123-73” 12373 ACME Co. áøåëéí äáàéí Beer (Pack of 8) ISQS 6339, Data Management & Business Intelligence
Referential Integrity Problem • Solution: • SQL anti-join • Server constraints • Dedicated tools ISQS 6339, Data Management & Business Intelligence
Name and Address Problem • Single-field format • Multiple-field format Mr. J. Smith,100 Main St., Bigtown, County Luth, 23565 ISQS 6339, Data Management & Business Intelligence
Quality Data: Importance and Benefits • Quality data: • Key to a successful warehouse implementation • Quality data helps you in: • Targeting right customers • Determining buying patterns • Identifying householders: private and commercial • Matching customers • Identify historical data ISQS 6339, Data Management & Business Intelligence
Data Quality Guidelines • Operational data: • Should not be used directly in the warehouse • Must be cleaned for each increment • Is not simply fixed by modifying applications ISQS 6339, Data Management & Business Intelligence
Transformation Techniques • Merging data • Adding a Date Stamp • Adding Keys to Data ISQS 6339, Data Management & Business Intelligence
Merging Data • Operational transactions do not usually map one-to-one with warehouse data. • Data for the warehouse is merged to provide information for analysis. ISQS 6339, Data Management & Business Intelligence
Merging Data ISQS 6339, Data Management & Business Intelligence
Adding a Date Stamp • Time element can be represented as a: • Single point in time • Time span • Add time element to: • Fact tables • Dimension data ISQS 6339, Data Management & Business Intelligence
Product Table Product_id Time_key Product_desc Store Table Store_id District_id Time_key Sales Fact Table Item_id Store_id Time_key Sales_dollars Sales_units Time Table Week_id Period_id Year_id Time_key Item Table Item_id Dept_id Time_key Adding a Date Stamp:Fact Tables and Dimensions ISQS 6339, Data Management & Business Intelligence
#1 Sale 1/2/98 12:00:01 Ham Pizza $10.00 #2 Sale 1/2/98 12:00:02 Cheese Pizza $15.00 #3 Sale 1/2/98 12:00:02 Anchovy Pizza $12.00 #4 Return 1/2/98 12:00:03 Anchovy Pizza - $12.00 #5 Sale 1/2/98 12:00:04 Sausage Pizza $11.00 Data values or artificial keys #dw1 Sale 1/2/98 12:00:01 Ham Pizza $10.00 #dw2 Sale 1/2/98 12:00:02 Cheese Pizza $15.00 #dw3 Sale 1/2/98 12:00:04 Sausage Pizza $11.00 Adding Keys to Data ISQS 6339, Data Management & Business Intelligence
Summarizing Data 1. During extraction on staging area 2. After loading to the warehouse server Operational databases Staging area Warehouse database ISQS 6339, Data Management & Business Intelligence
Sources Stage Rules Publish Extract Transform Load Query Maintaining Transformation Metadata • Transformation metadata contains: • Transformation rules • Algorithms and routines ISQS 6339, Data Management & Business Intelligence
Maintaining Transformation Metadata • Restructure keys • Identify and resolve coding differences • Validate data from multiple sources • Handle exception rules • Identify and resolve format differences • Fix referential integrity inconsistencies • Identify summary data ISQS 6339, Data Management & Business Intelligence
Transformation Timing and Location • Transformation is performed: • Before load • In parallel • Can be initiated at different points: • On the operational platform • In a separate staging area ISQS 6339, Data Management & Business Intelligence
Monitoring and Tracking • Transformations should: • Be self-documenting • Provide summary statistics • Handle process exceptions ISQS 6339, Data Management & Business Intelligence
Loading Data into the Warehouse • Loading moves the data into the warehouse • Loading can be time-consuming: • Consider the load window • Schedule and automate the loading • Initial load moves large volumes of data • Subsequent refresh moves smaller volumes of data Transform Extract Transport,Load Operational databases Staging area Warehouse database ISQS 6339, Data Management & Business Intelligence
Initial Load and Refresh • Initial Load: • Single event that populates the database with historical data • Involves large volumes of data • Employs distinct ETL tasks • Involves large amounts of processing after load • Refresh: • Performed according to a business cycle • Less data to load than first-time load • Less-complex ETL tasks • Smaller amounts of post-load processing ISQS 6339, Data Management & Business Intelligence
Operational databases T1 T2 T3 Data Refresh Models: Extract Processing Environment • After each time interval, build a new snapshot of the database. • Purge old snap shots. ISQS 6339, Data Management & Business Intelligence
Operational databases T1 T2 T3 Data Refresh Models: Warehouse Processing Environment • Build a new database. • After each time interval, add changes to database. • Archive or purge oldest data. ISQS 6339, Data Management & Business Intelligence
Building the Loading Process • Techniques and tools • File transfer methods • The load window • Time window for other tasks • First-time and refresh volumes • Frequency of the refresh cycle • Connectivity bandwidth ISQS 6339, Data Management & Business Intelligence
Building the Loading Process • Test the proposed technique • Document proposed load • Monitor, review, and revise ISQS 6339, Data Management & Business Intelligence