180 likes | 315 Views
An overview of Data Warehousing and OLAP Technology. Original slides by: Manish Desai Modified and presented by Alice Leung. Introduction. Essential elements of decision support Enables The Knowledge Worker to make better and faster decisions Used in many industries like:
E N D
An overview of Data Warehousing and OLAP Technology Original slides by: Manish Desai Modified and presented by Alice Leung
Introduction • Essential elements of decision support • Enables The Knowledge Worker to make better and faster decisions • Used in many industries like: • Manufacturing (for order shipment) • Retail (for inventory management) • Financial Services (claims and risk analysis) • Every major database vendor offers product in this area
What is Data Warehouse ? • A data warehouse is a “subject-oriented, integrated, time-varying, non-volatile collection of data that is used primarily in organizational decision making” • Typically maintained separately from operational databases
Explanation of definition • Subject-Oriented: • Designed around subject such as customer, vendor, product and activity • Does not includes data that are not needed for Decision support system (DSS) • Integrated: • Most important feature • Consistent naming convention, measurement of variables and so forth • The data should be stored in single globally acceptable fashion
Explanation (continues…) • Time Varying: • All data in the warehouse should be accurate as of some moment in time • Data stored over a long time horizon (5 –10 years) • Key structure contains element of time (implicitly or explicitly) • Data once correctly recorded cant be updated • Non Volatile: • No Update of data allowed • only loading and access of data operations
Why Separate Data Warehouse? High performance for both systems DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation Different functions and different data: missing data: Decision support requires historical data which operational DBs do not typically maintain data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled August 6, 2014 Data Mining: Concepts and Techniques 6
Tier2: OLAP Server External Sources OLAP Server Operational Databases Extract Transform Load Refresh Serve Data Sources OLAP Engine TieredArchitecture Tier3: Clients Tier1: Data Warehouse Server Analysis Query/Reports Data mining Data Warehouse Data Marts Data Storage Front-End Tools
Architecture (continues…) • Distributed Data warehouse • Load balancing, scalability,higher availability • Meta data replicated and centrally administrated • Too expansive • Data marts • Departmental subset focused on selected subjects • example: marketing department includes customer, sales and product tables • Has own repository and administration • May lead to complex integration problems if not designed properly
Back end tools and Utilities • Data cleaning, loading, refreshing tools • Cleaning • Multiple source, possibility of errors • Example: replace string sex by gender • Loading • Building indices, sorting and making access paths • Large amount of data • Incremental loading • Only updated tuples are inserted ,Process hard to manage • Refresh • Propagating updates • When to refresh ? • Set by administrator depending on user needs and traffic
Conceptual Model and front end tools • Multi dimensional view • Dimensions together uniquely determine the measure • Example: Sales can be represented as city,product, data • Each dimension is described by set of attribute • Example: product consist of • Category of product • Industry of product • Year of introduction • Front end tools • Multi dimensional spreadsheet • Supports Pivoting-reorientation • Roll_up - summarized data • Drill_down - go from high level to low level summary
Conceptual Model Total annual sales of TV in U.S.A. Date 2 1 sum 3 4 TV Product U.S.A PC PVR sum Canada Country Mexico sum ALL
Database design • Two ways to represent Multi dimensional model • Star schema • Database consist of single fact table and single table for each dimension • Each tuples in fact table consist of pointer to each of dimension • Snowflake schema • Refinement over star schema • Dimensional hierarchy is explicitly represented by normalizing dimension tables
Time T_key T_day T_day_week T_month T_quarter T_year Sales Fact Table item I_key I_name I_brand I_type I_supplier_type Time_key Branch Item_key B_key B_name B_type Branch_key Location_key location Units_sold location_key street city province country Measures Dollars_sold Avg_sales Star Schema Star Schema
Item I_key I_name I_brand I_type I_supplier_type time_key Supplier Branch item_key S_key S_type B_key B_name B_type branch_key location_key Location units_sold location_key street city Measures dollars_sold City avg_sales C_key C_city C_province C_country Snowflake Schema Time T_key T_day T_day_week T_month T_quarter T_year Sales Fact Table Snowflake Schema
Warehouse Servers • Specialized SQL servers • Provides advanced query language and query processing support for SQL queries over star and snowflake schemas • Example: Redbrick • ROLAP • Between relational back end and client front end tools • Extend traditional relational servers to support multidimensional queries • Example: Microstratergy • MOLAP • Multidimensional storage engine • Direct mapping • Example: Essbase from Arbor Inc.
Index structures • Bit map indices • Use single bit to indicate specific value of attribute • Example: instead of storing eight characters to record “engineer” as skill of employee use single bit id# Name Skill 1000 John 1 • Join indices • Maintains the relationship between foreign key with its matching primary keys
Meta data and warehouse management • Its data about data • Used for building, maintain, managing and using data warehouse • Administrative meta data • Information about setting up and using warehouse • Business meta data • Business terms and definition • Operational meta data • Information collected during operation of warehouse