180 likes | 192 Views
Learn about the standard implementation strategy and process for successful data warehouse demonstration. Discover the importance of prototyping, staging areas, and cross-table validations.
E N D
Demonstration 10EDW Implementation Strategy and Process www.InstantBI.com 1/10/2012
Agenda • Standard Implementation Strategy • Standard Implementation Process • Prototyping • Setting up the Staging Area
Prototyping • “Plan to build it twice. You will anyway” • Frederick P Brooks – The Mythical Man Month • The BEST IT development book ever written • DWs have become FAR more sophisticated • Early 90s (90-94) • 10-20 dimension tables and 5-10 fact tables • All about ‘sales’ and campaigns • These were the high value applications • 300+ work days to build the ETL for these in cobol • Today with models like BI4ALL or even custom development • 100+ dimension tables and 30+ fact tables (easily) • 200-300 work days to build the ETL • Much more of the ‘ETL time’ is ‘understanding data’
Prototyping • Prototyping allows you to • Find data understanding errors before writing ETL • Quickly tune database model for performance • Start developing reports earlier • Report development tools have become more ‘complex’ • Ensure data integrity earlier • Data integrity is the #1 killer of ETL productivity • We used to only find errors when we wrote the ETL • Find ‘assumptions’ that do not hold up for this EDW • More easily communicate the ‘end result’ to business users • If you can’t build the prototype, you can’t build the real thing • Bottom Line: Strongly recommended
System 1 Source Systems Reporting Systems BIApps 2 BIApps 1 Data Marts BIApps 3 System 2 Data Mart 1 System 3 Data Mart 2 Validate and Direct Connect Clean System 4 Data Mart 3 System 5 Data Mart 4 Trans Trans form EDW form EDW ODS And And Staging Data Mart 5 System 6 Load Area Load Extracts Extracts Data Mart 6 System 7 Commercial Specific Apps System 8 App 7 App 1 App 2 App 3 App 4 App 6 App 5 System 9 App 9 App 10 App 14 App 13 App 11 App 12 App 8 System 10 1
Standard Implementation Strategy • Data is ‘somehow’ extracted from source systems • Must be careful to detect deletes • Can be a very difficult problem to solve • Data, as raw as possible, sent to Staging Area • Usually as files, sometimes as ODBC links • The OLTP system usually controls extraction schedule • Extract one field -> extract whole table • Data is profiled to determine real data type • Staging area tables use ‘real’ data types • Start developing understanding of raw data • Understanding data is a huge and difficult job usually
Standard Implementation Strategy • Data loaded into ‘staging area’ • Today we can afford RDBMS staging areas (used to be files) • As data goes into the staging area fields are converted from possible native types to RDBMS types • Numeric/string dates, codes, flags, numerics in chars • Errors must be caught so both source and target columns are kept so the source value is visible • Calculations within a row are performed • Durations/elapsed times, ages etc • Three Flags are set • Row deleted from source – if it was deleted • Row valid – valid by default – set to ‘N’ if found invalid • Row sent to EDW = ‘N’ because it has not • It is then possible to run ‘cross table validations’ on the data BEFORE sending it into the EDW • Always beware sending invalid data into a DW
Example Staging Table CREATE TABLE UNBILLED_CALLS ( CUM_START_DATE_TIME DATE NOT NULL, CALL_DIALLED_DIGITS VARCHAR2(18 BYTE), CALL_DURATION NUMBER(10,2) NOT NULL, CALL_RETAIL_PRICE NUMBER(14,3) NOT NULL, CALL_BREAKDOWN_CODE VARCHAR2(5 BYTE) NOT NULL, CALL_DISCOUNT NUMBER(4,1) NOT NULL, CALL_UNITS NUMBER(4) NOT NULL, CALL_PP_ALLOWANCE NUMBER(5,1) NOT NULL, CALL_CLASS VARCHAR2(5 BYTE), CALL_CATEGORY VARCHAR2(5 BYTE), CALL_ORIGINATION VARCHAR2(2 BYTE), SERVICE_CODE VARCHAR2(2 BYTE) NOT NULL, CUM_CUSTOMER NUMBER(10) NOT NULL, CUM_SUBSCRIBER VARCHAR2(18 BYTE) NOT NULL, CALL_DIRECTION VARCHAR2(2 BYTE), CALL_LOCATION VARCHAR2(13 BYTE), CALL_DESTINATION VARCHAR2(5 BYTE), CALL_RECORD_TYPE VARCHAR2(3 BYTE) NOT NULL, SERVICE_TYPE VARCHAR2(2 BYTE) NOT NULL, BUCKET_TYPE VARCHAR2(5 BYTE), ROW_DEL_FRM_SRC_IND VARCHAR2(1 BYTE) DEFAULT 'N' NOT NULL, ROW_VALID_IND VARCHAR2(1 BYTE) DEFAULT 'Y' NOT NULL, ROW_SENT_TO_IWS VARCHAR2(1 BYTE) DEFAULT 'N' NOT NULL )
Getting Data into Staging Area • Get data into a Staging Area asap • Helps in learning to understand the data • Can query/browse much more easily than in files • We now use (free) utilities to load staging area pttype • Using utilities is faster and less costly than using Infa for pptype development • Read a wide variety of files and load the data ‘as is’ • Defaults the flags • Get ALL data for this release into staging area before starting mapping • Or at least as much as possible • Late arriving data can only confuse the issue
Cross Table Validations • Cross table validations might include • Checking customer/account exists for a sales record • Checking address is a valid address • Checking details provided by retailer match other systems such as sell through capture • Checking codes entered on tables exist • The list of ‘possible’ things to check is endless • EA must decided what validations will stop data from flowing into the EDW • These are more likely to be business than technology based • We have only built to capability to do so, not the rules themselves • Can be implemented in Infa or Stored Procedures
Data Quality • Data Quality Measures/Correction • Can be implemented in ETL tools if acquired or ‘home grown’ • Can send data back to source systems if needed • Is a whole ‘other topic’ • But is will be done at some point in the life of EDW
The ‘Mapping Spreadsheet’ • So, Staging Area has data in it‘What to do next?’ • Build the left hand side of the ‘mapping spreadsheet’ • Source to Target Mapping • The right hand side starts blank!! • Once you have developed your understanding of the data, built/loaded the staging area with all the data you want in the ‘current release’ you are ready to perform ‘data mapping’ and Data Modelling. • To do this, you need to understand current models and have some ideas about BI4ALL model… • Now begins the modeling portion of the training…
Data Mapping • Filling in the right hand side of mapping spreadsheet • Defining ‘real keys’ from source data • Defining tables to be joined/split • Defining how to present a ‘view’ of the staging area such that it can be sent into the EDW • Defining changes to EDW model • Two columns in SS - Table Exists/Column Exists • EDW Modeller sets them for the DBA • We will see the ‘in progress’ version later • Each source field required in the EDW is mapped • Notes and comments are included… • Database level transformations are included in SS • Key role in any EDW implementation
Prototype Building • DBA implements changes to ODW/EDW Physical Model • Tables, indexes, naming standards • Key role is to ‘catch mistakes’ by the modeller • With 000’s of fields to map modeller will make mistakes • From the SS we can now generate pptype ETL • Tools available from IBI • One of which was just written (SeETL) • Includes all elements such as • Generated views • Updating control tables • Running and testing on windows 2008+ • All prior to building Informatica/DataStage ETL
Prototype Building • Once Prototype is ‘relatively stable’ • Sizable amounts of test data can be loaded • Prototype can be moved to deployment platform • Presentation Views can be created • Cognos Catalogs/Business Objects started • Early reports started • More learning about real data volumes can happen • When errors are found (and they will be found) • Prototype can be changed easily • Generated ETL can be changed easily • When everyone is happy with the pptype • EDW is ‘made real’ • Real database, ETL, reports
Conclusion • We have talked about the overall process of EDW Implementation • So, once you have developed your understanding of the data, built/loaded the staging area with all the data you want in the ‘current release’ you are ready to perform ‘data mapping’ and BI4ALL Modelling. • To do this, you need to understand the EDW model… • Now begins the modeling portion of the training…