580 likes | 839 Views
ADBMS. Data Warehousing OLAP Technology. What is Data Warehouse?. Defined in many different ways, but not rigorously. A decision support database that is maintained separately from the organization’s operational database
E N D
ADBMS Data Warehousing OLAP Technology
What is Data Warehouse? • Defined in many different ways, but not rigorously. • A decision support database that is maintained separately from the organization’s operational database • Support information processing by providing a solid platform of consolidated, historical data for analysis. • “A data warehouse is asubject-oriented, integrated, time-variant, and nonvolatilecollection of data in support of management’s decision-making process.”—W. H. Inmon • Data warehousing: • The process of constructing and using data warehouses
Data Warehouse—Subject-Oriented • Organized around major subjects, such as customer, product, sales. • Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. • Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process.
Data Warehouse—Integrated • Constructed by integrating multiple, heterogeneous data sources • relational databases, flat files, on-line transaction records • Data cleaning and data integration techniques are applied. • Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources • E.g., Hotel price: currency, tax, breakfast covered, etc. • When data is moved to the warehouse, it is converted.
Data Warehouse—Time Variant • The time horizon for the data warehouse is significantly longer than that of operational systems. • Operational database: current value data. • Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) • Every key structure in the data warehouse • Contains an element of time, explicitly or implicitly • But the key of operational data may or may not contain “time element”.
Data Warehouse—Non-Volatile • A physically separate store of data transformed from the operational environment. • Operational update of data does not occur in the data warehouse environment. • Does not require transaction processing, recovery, and concurrency control mechanisms • Requires only two operations in data accessing: • initial loading of data and access of data.
Data Warehouse vs. Heterogeneous DBMS • Traditional heterogeneous DB integration: • Build wrappers/mediators on top of heterogeneous databases • Query driven approach • When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set • Complex information filtering, compete for resources • Data warehouse: update-driven, high performance • Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis
Data Warehouse vs. Operational DBMS • OLTP (on-line transaction processing) • Major task of traditional relational DBMS • Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. • OLAP (on-line analytical processing) • Major task of data warehouse system • Data analysis and decision making • Distinct features (OLTP vs. OLAP): • User and system orientation: customer vs. market • Data contents: current, detailed vs. historical, consolidated • Database design: ER + application vs. star + subject • View: current, local vs. evolutionary, integrated • Access patterns: update vs. read-only but complex queries
Objectives • What is OLAP • Need for OLAP • Features & functions of OLAP • Different OLAP models • OLAP implementations
Demand for OLAP • To develop DM, three approaches • In all approaches, Data Marts rest on Dimensional Model • Data Marts are sufficient for basic data analysis • Users need to go beyond such basic analysis
Demand for OLAP • Need for Multidimensional Analysis • Fast Access & Powerful Calculations • Limitations of other analysis methods like: • SQL • Spreadsheets • Report Writers
Demand for OLAP • Traditional tools of report writers, query products, spreadsheets, & language interfaces do not match the user expectations as far as performing multidimensional analysis with complex calculations is concerned. • Tools used with OLTP and basic DW environments do not match up to the task
OLAP is the Answer! OLAP is a category of software technology that enables analysts, managers, and executives to gain insight into the data through fast, consistent, interactive, access in a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user.
Why is OLAP useful? • Facilitates multidimensional data analysis by pre-computing aggregates across many sets of dimensions • Provides for: • Greater speed and responsiveness • Improved user interactivity
Data Warehouses • A data warehouse is based on a multidimensional data model which views data in the form of a data cube • A data cube allows data to be modeled and viewed in multiple dimensions • In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube.
all 0-D(apex) cuboid time item location supplier 1-D cuboids time,item time,location item,location location,supplier 2-D cuboids time,supplier item,supplier time,location,supplier time,item,location 3-D cuboids item,location,supplier time,item,supplier 4-D(base) cuboid time, item, location, supplier Lattice of Cuboids
day 2 day 1 CUBE Multi-dimensional cube: Fact table view: dimensions = 3
Aggregates • Add up amounts for day 1 • In SQL: SELECT sum(amt) FROM SALE • WHERE date = 1 81
Aggregates • Add up amounts by day • In SQL: SELECT date, sum(amt) FROM SALE • GROUP BY date
Aggregates • Operators: sum, count, max, min, median, avg • “Having” clause • Using dimension hierarchy • average by region (within store) • maximum by month (within date)
rollup drill-down Cube Aggregation Example: computing sums day 2 . . . day 1 129
Cube Operators day 2 . . . day 1 sale(c1,*,*) 129 sale(c2,p2,*) sale(*,*,*) sale(*,p1,*)
Extended Cube * day 2 sale(*,p2,*) day 1
day 2 day 1 Aggregation Using Hierarchies customer region country (customer c1 in Region A; customers c2, c3 in Region B)
day 2 day 1 Pivoting Fact table view: Multi-dimensional cube:
day 2 day 1 Cube Aggregates Lattice 129 all city product date city, product city, date product, date use greedy algorithm to decide what to materialize city, product, date
Dimension Hierarchies all state city
Dimension Hierarchies all product city date product, date city, product city, date state city, product, date state, date state, product state, product, date not all arcs shown...
Interesting Hierarchy all years weeks quarters conceptual dimension table months days
Date 2Qtr 1Qtr sum 3Qtr 4Qtr TV Product U.S.A PC VCR sum Canada Country Mexico sum All, All, All SAMPLE CUBE Total annual sales of TV in U.S.A. Total annual sales of PC in U.S.A. Total sales In U.S.A Total annual sales of VCR in U.S.A. Total Q1 sales In U.S.A Total sales In Canada Total Q1 sales In Canada Total sales In Mexico Total Q1 sales In Mexico Total Q2 sales In all countries TOTAL SALES Total Q1 sales In all countries
OLAP Operations • Roll-Up • Drill-Down • Slice & Dice • Pivot • Drill-Across • Drill-Through
Other OLAP Operations • Drill-Across: Queries involving more than one fact table • Drill-Through: Makes use of SQL to drill through the bottom level of a data cube down to its back-end relational tables • Pivot (rotate): Pivot (also called "rotate") is a • visualization operation which rotates the data axes in • view in order to provide an alternative presentation of • the data. Other examples include rotating the axes in a • 3-D cube, or transforming a 3-D cube into a series of 2- • D planes.
Other OLAP Operations • Moving Averages • Growth Rates • Depreciation • Currency Conversion • Statistical Functions • Top N or Bottom N queries
Conceptual vs. Actual • The “cube” is a logical way of visualizing the data in an OLAP setting • Not how the data is actually represented on disk • Two ways of storing data: • ROLAP: Relational OLAP • MOLAP: Multidimensional OLAP
OLAP & CUBE • Construction of the data cube is key to the operation of OLAP • The computation process creates a set of aggregates on the various dimensions of the data • The CUBE operator
The CUBE Operator • Proposed by Gray et al* • Effectively involves a series of GROUP-BY operations to aggregate data • Creates power set on all attributes according to: • A measure • An aggregator function *J. Gray, S. Chaudhuri, A. Bosworth, A. Layman,D. Reichart, M. Venkatrao, F. Pellow and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997.
CUBING Problem • Problem: this generates a lot of data and work (2n sets in total, where n is the number of dimensions) • Solution: optimized algorithms to run faster, consume less memory, and perform fewer I/Os.
Efficient Computation of Data Cubes • ROLAP-based cubing algorithms (Agarwal et al’96) • Array-based cubing algorithm (Zhao et al’97) S. Agarwal, R. Agrawal, P. M. Deshpande, A.Gupta, J. F. Naughton, R. Ramakrishnan and S.Sarawagi. On the computation of multidimensional aggregates. In VLDB'96. Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidimensional aggregates. In SIGMOD'97.
Efficient Computation of Data Cubes • How many cuboids in a cube with 3 dimensions? • Answer: • As many group by operations? • No hierarchies involved!! • π(Li+1), where Li is the number of levels associated with dimension I • 10 dimensions & 4 levels for each dimension • Total Cuboids = 510
Approaches to OLAP Servers • It is all about which DBMS you choose to store your data warehouse data • RDBMS – ROLAP • MDDB – MOLAP • BOTH - HOLAP
Approaches to OLAP Servers Three possibilities for OLAP servers (1) Relational OLAP (ROLAP) • Relational and specialized relational DBMS to store and manage warehouse data • OLAP middleware to support missing pieces (2) Multidimensional OLAP (MOLAP) • Array-based storage structures • Direct access to array data structures (3) Hybrid OLAP (HOLAP) • Storing detailed data in RDBMS • Storing aggregated data in MDBMS • User access via MOLAP tools