360 likes | 532 Views
Session id:. The Oracle9i Multi-Terabyte Data Warehouse. Jeff Parker Manager Data Warehouse Development Amazon.com. The Challenges. Rapidly evolving business Growing data volumes Do more with less. The Challenges. Rapidly evolving business New international markets
E N D
Session id: The Oracle9i Multi-Terabyte Data Warehouse Jeff ParkerManager Data Warehouse Development Amazon.com
The Challenges • Rapidly evolving business • Growing data volumes • Do more with less
The Challenges • Rapidly evolving business • New international markets • Continual innovation of features on Amazon • Buy it used • Magazine subscriptions • Marketplace Partnerships – Toys R Us, Target • Growing data volumes • Do more with less
The Challenges • Rapidly evolving business • Growing data volumes • 2X growth yearly over the past 5 years • Currently 10 Terabytes of raw data • Do more with less
The Challenges • Rapidly evolving business • Growing data volumes • Do more with less • Innovative use of technology and resources • Throwing money and people at the problem is not an option • Leverage existing investment in Oracle
Addressing the issues • Rapidly evolving business • Denormalize only for performance reasons • Create a solution that allows new datasets to be brought in rapidly to the DW, but without high maintenance costs • Growing data volumes • Do more with less
Addressing the issues • Rapidly evolving business • Growing data volumes • Dual database approach to ETL • Staging database for efficient transformation of large datasets. SQL and hash-joins allow transforms to scale in a non-linear fashion • Second database optimized for analytics • Oracle as an API • Simplifies ETL architecture • Better scalability than traditional ETL tools • Do more with less
Addressing the issues • Rapidly evolving business • Growing data volumes • Do more with less • One DW schema supports all countries • Cut costs by eliminating unneeded software • Data driven Load functionality
The ETL Process • Extract data from source • The Load process • Dimensional Transforms
The ETL Process • Extract data from source • Can create one or more files to be loaded • Must produce Metadata upon which the Load process can depend • The Load Process • Dimensional Transforms
Extract produced Metadata • Describes each field in database type terms • Changes as the dataset changes • Can reference multiple files • Very reliable • No additional overhead
XML Based Metadata <DATA CHARSET="UTF8" DELIMITER="\t" ROWS=”1325987> <COLUMNS> <COLUMN ID="dataset_id" DATA_TYPE="NUMBER" DATA_PRECISION="38" DATA_SCALE="0“/> <COLUMN ID="dataset_name" DATA_TYPE="VARCHAR2" DATA_LENGTH="80“/> <COLUMN ID="CREATION_DATE" DATA_TYPE="DATE" DATE_MASK="YYYY/MM/DD.HH24:MI:SS“/> <COLUMN ID="CREATED_BY" DATA_TYPE="VARCHAR2" DATA_LENGTH="8“/> </COLUMNS> <FILES> <FILE PATHNAME="/flat/datasets_20020923_US.txt.1“/> <FILE PATHNAME="/flat/datasets_20020923_US.txt.2“/> </FILES> </DATA>
The ETL Process • Extract data from source • The Load Process • Makes extensive use of External Tables • MERGE and Bulk Insert • Contains integrated DBA tasks • Every load is tracked in an operational database • Dimensional Transforms
The Load Process • External Tables • access to files on the operating system • Is a building block in a broader ETL process • MERGE & Bulk Insert • Integrated DBA tasks
The External Table • Created by using Metadata from the Extract process • Data is read-only • No indexes • Use DBMS_STATS to set number of rows
Example External Table • Copy the data to the database server • Data must reside in a file system location specified by the DBA’s. - create directoryDAT_DIR as ‘/stage/flat’
Example External Table 2. Create the external table using the DML from the extract. CREATE TABLE XT_datasets_77909 ( dataset_id NUMBER , dataset_name VARCHAR2(80) , creation_date DATE ,created_by VARCHAR2(8) ) ORGANIZATION EXTERNAL( TYPE ORACLE_LOADER DEFAULT DIRECTORY dat_dir ACCESS PARAMETERS( records delimited by newline characterset UTF8 fields terminated by '\t' LOCATION (‘/flat/datasets_20020923_US.txt' )
The External Table • No pre-staging of data • Ability to describe a flat file to Oracle • Handles horizontally partitioned files • Good error messaging
The Load Process • External Tables • MERGE • Can be run in parallel • Combined with external table provides a powerful set of ETL tools • Integrated DBA tasks
MERGE • Allows for update or insert in a single statement • If key value already exists • Yes, update row • No, insert row • MERGE statement is auto-generated • Row level column transforms are supported
MERGE example MERGE into DATASETS ds USING ( SELECT ds.dataset_name ,ds.creation_date ,nvl(created_by,’nobody’) as created_by ,sysdate as last_updated FROM XT_datasets_77909 xt ) src On ( xt.dataset_id = ds.dataset_id ) When matched then UPDATE SET ds.dataset_name = src.dataset_name ,ds.creation_date = src.dataset_name ,ds.created_by = src.created_by ,ds.last_updated = sysdate when not matched then INSERT( dataset_name, creation_date, created_by, last_updated ) VALUES( dataset_name, creation_date, created_by, sysdate )
MERGE • Issues we faced • Duplicate records in the dataset • NESTED-LOOPS from external table • Parallelism is not enabled by default • Bulk Load partition determination
The Load Process • External Tables • MERGE • Integrated DBA tasks • Reduces workload required by the DBA team • Streamlines the load process • Eliminates human error
Integrated DBA Tasks • Provided by the DBA team • Managed by the DBA team • ETL team does not need special knowledge of table layout
Integrated DBA Tasks • Truncate Partition developer makes call truncate_partition( ‘TABLE-NAME’, partition-key1, partition-key2, partition-key3 ) DBA utility translates this and executes alter table TABLE-NAME drop partition dbi20020930_101;
Integrated DBA Tasks • Analyze Partition developer makes call analyze_partition( ‘TABLE-NAME’, partition-key1, partition-key2, partition-key3 ) DBA utility translates this and executes dbms_stats.gather_table_stats(ownname , tabname , partname , cascade , estimate_percent, granularity);
Integrated DBA Tasks • Return Partition Name developer makes call get_partition_name( ‘TABLE-NAME’, partition-key1, partition-key2, partition-key3 ) DBA utility translates this and returns the appropriate name of the partition. This is very useful when bulk loading tables.
Integrated DBA Tasks • Partitioning utilities • Helps to streamline the process • Reduces workload of DBA team • Helps to eliminate the problem of double loads for Snapshot tables and partitions
The Load Process • External Tables • Provides access to flat files outside the database • MERGE • Parallel “upsert” simplifies ETL • Row level transforms can be performed in SQL • Integrated DBA tasks • Reduces workload required by the DBA team • Streamlines the load process • Eliminates human error • Loads are repeatable processes
Summary • Reduction in time to integrate new subject areas • Oracle parallelism scales well • Eliminated unneeded software
Summary • Oracle has delivered on the DW promise • Oracle External table combined with MERGE is a viable alternative to other ETL tools • ETL tools are ready today
& Q U E S T I O N S A N S W E R S
Reminder – please complete the OracleWorld session surveyThank you.