690 likes | 888 Views
O N- L INE A NALYTICAL P ROCESSING. Objectives. What is OLAP Need for OLAP Features & functions of OLAP Different OLAP models OLAP implementations. Demand for OLAP. To develop DM, three approaches In all approaches, Data Marts rest on Dimensional Model
E N D
Objectives • What is OLAP • Need for OLAP • Features & functions of OLAP • Different OLAP models • OLAP implementations
Demand for OLAP • To develop DM, three approaches • In all approaches, Data Marts rest on Dimensional Model • Data Marts are sufficient for basic data analysis • Users need to go beyond such basic analysis
Demand for OLAP • Need for Multidimensional Analysis • Fast Access & Powerful Calculations • Limitations of other analysis methods like: • SQL • Spreadsheets • Report Writers
Demand for OLAP • Traditional tools of report writers, query products, spreadsheets, & language interfaces do not match the user expectations as far as performing multidimensional analysis with complex calculations is concerned. • Tools used with OLTP and basic DW environments do not match up to the task
OLAP is the Answer! OLAP is a category of software technology that enables analysts, managers, and executives to gain insight into the data through fast, consistent, interactive, access in a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user.
Why is OLAP useful? • Facilitates multidimensional data analysis by pre-computing aggregates across many sets of dimensions • Provides for: • Greater speed and responsiveness • Improved user interactivity
Data Warehouses • A data warehouse is based on a multidimensional data model which views data in the form of a data cube • A data cube allows data to be modeled and viewed in multiple dimensions • In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube.
all 0-D(apex) cuboid time item location supplier 1-D cuboids time,item time,location item,location location,supplier 2-D cuboids time,supplier item,supplier time,location,supplier time,item,location 3-D cuboids item,location,supplier time,item,supplier 4-D(base) cuboid time, item, location, supplier Lattice of Cuboids
day 2 day 1 CUBE Multi-dimensional cube: Fact table view: dimensions = 3
Aggregates • Add up amounts for day 1 • In SQL: SELECT sum(amt) FROM SALE • WHERE date = 1 81
Aggregates • Add up amounts by day • In SQL: SELECT date, sum(amt) FROM SALE • GROUP BY date
Aggregates • Operators: sum, count, max, min, median, avg • “Having” clause • Using dimension hierarchy • average by region (within store) • maximum by month (within date)
rollup drill-down Cube Aggregation Example: computing sums day 2 . . . day 1 129
Cube Operators day 2 . . . day 1 sale(c1,*,*) 129 sale(c2,p2,*) sale(*,*,*) sale(*,p1,*)
Extended Cube * day 2 sale(*,p2,*) day 1
day 2 day 1 Aggregation Using Hierarchies customer region country (customer c1 in Region A; customers c2, c3 in Region B)
day 2 day 1 Pivoting Fact table view: Multi-dimensional cube:
day 2 day 1 Cube Aggregates Lattice 129 all city product date city, product city, date product, date use greedy algorithm to decide what to materialize city, product, date
Dimension Hierarchies all state city
Dimension Hierarchies all product city date product, date city, product city, date state city, product, date state, date state, product state, product, date not all arcs shown...
Interesting Hierarchy all years weeks quarters conceptual dimension table months days
Date 2Qtr 1Qtr sum 3Qtr 4Qtr TV Product U.S.A PC VCR sum Canada Country Mexico sum All, All, All SAMPLE CUBE Total annual sales of TV in U.S.A. Total annual sales of PC in U.S.A. Total sales In U.S.A Total annual sales of VCR in U.S.A. Total Q1 sales In U.S.A Total sales In Canada Total Q1 sales In Canada Total sales In Mexico Total Q1 sales In Mexico Total Q2 sales In all countries TOTAL SALES Total Q1 sales In all countries
OLAP Operations • Roll-Up • Drill-Down • Slice & Dice • Pivot • Drill-Across • Drill-Through
Other OLAP Operations • Drill-Across: Queries involving more than one fact table • Drill-Through: Makes use of SQL to drill through the bottom level of a data cube down to its back-end relational tables • Pivot (rotate): Pivot (also called "rotate") is a • visualization operation which rotates the data axes in • view in order to provide an alternative presentation of • the data. Other examples include rotating the axes in a • 3-D cube, or transforming a 3-D cube into a series of 2- • D planes.
Other OLAP Operations • Moving Averages • Growth Rates • Depreciation • Currency Conversion • Statistical Functions • Top N or Bottom N queries
Conceptual vs. Actual • The “cube” is a logical way of visualizing the data in an OLAP setting • Not how the data is actually represented on disk • Two ways of storing data: • ROLAP: Relational OLAP • MOLAP: Multidimensional OLAP
OLAP & CUBE • Construction of the data cube is key to the operation of OLAP • The computation process creates a set of aggregates on the various dimensions of the data • The CUBE operator
The CUBE Operator • Proposed by Gray et al* • Effectively involves a series of GROUP-BY operations to aggregate data • Creates power set on all attributes according to: • A measure • An aggregator function *J. Gray, S. Chaudhuri, A. Bosworth, A. Layman,D. Reichart, M. Venkatrao, F. Pellow and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997.
CUBING Problem • Problem: this generates a lot of data and work (2n sets in total, where n is the number of dimensions) • Solution: optimized algorithms to run faster, consume less memory, and perform fewer I/Os.
Efficient Computation of Data Cubes • ROLAP-based cubing algorithms (Agarwal et al’96) • Array-based cubing algorithm (Zhao et al’97) S. Agarwal, R. Agrawal, P. M. Deshpande, A.Gupta, J. F. Naughton, R. Ramakrishnan and S.Sarawagi. On the computation of multidimensional aggregates. In VLDB'96. Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidimensional aggregates. In SIGMOD'97.
Efficient Computation of Data Cubes • How many cuboids in a cube with 3 dimensions? • Answer: • As many group by operations? • No hierarchies involved!! • π(Li+1), where Li is the number of levels associated with dimension I • 10 dimensions & 4 levels for each dimension • Total Cuboids = 510
Approaches to OLAP Servers • It is all about which DBMS you choose to store your data warehouse data • RDBMS – ROLAP • MDDB – MOLAP • BOTH - HOLAP
Approaches to OLAP Servers Three possibilities for OLAP servers (1) Relational OLAP (ROLAP) • Relational and specialized relational DBMS to store and manage warehouse data • OLAP middleware to support missing pieces (2) Multidimensional OLAP (MOLAP) • Array-based storage structures • Direct access to array data structures (3) Hybrid OLAP (HOLAP) • Storing detailed data in RDBMS • Storing aggregated data in MDBMS • User access via MOLAP tools
ROLAP • Special schema design: star, snowflake • Special indexes: bitmap, multi-table join • Proven technology (relational model, DBMS), tend to outperform specialized MDDB especially on large data sets • Products • IBM DB2, Oracle, Sybase IQ, RedBrick, Informix
ROLAP • Defines complex, multi-dimensional data with simple model • Reduces the number of joins a query has to process • Allows the data warehouse to evolve with relatively low maintenance • Can contain both detailed and summarized data. • ROLAP is based on familiar, proven, and already selected technologies. BUT!!! • SQL for multi-dimensional manipulation of calculations.
MOLAP • MDDB: a special-purpose data model • Facts stored in multi-dimensional arrays • Dimensions used to index array • Sometimes on top of relational DB • Products • Pilot, Arbor Essbase, Gentia
MOLAP • Pre-calculating or pre-consolidating transactional data improves speed. BUT Fully pre-consolidating incoming data, MDDs require an enormous amount of overhead both in processing time and in storage. An input file of 200MB can easily expand to 5GB MDDBs are great candidates for the < 100GB department data marts. • With MDDs, application design is essentially the definition of dimensions and calculation rules, while the RDBMS requires that the database schema be a star or snowflake.
OLAP Needs • User Needs • Multidimensional view • Excellent Performance • Analytical Flexibility • Real-Time Data Access • High Data Capacity • MIS Needs • Leverages Data Warehouse • Easy Development • Low Structure Maintenance • Low Aggregate Maintenance
OLAP Needs: User Needs Multidimensional View • All true OLAP tools, whether they work with a MDDB or an RDBMS, provide a multidimensional view of data. • For example, decision makers may view sales by office, quarter, representative, product, etc. This perspective on data, which mirrors the way business professional think, allows for more intuitive and more powerful analysis.
OLAP Needs: User Needs Excellent Performance • The performance of your decision support tool directly depends on the way it manages aggregates. • RDBMS • Calculate aggregates on fly (response time suffers) • DBA creates summary tables to store aggregates (enormous amount of disk space)
OLAP Needs: User Needs Excellent Performance • For example, suppose you have a Sales indicator with six dimensions—Representatives, Products, Customers, Regions, Months, and Years. • MOLAP tools will store a given aggregate, such as the November 1997 government sales of product A504 by representative 1040 in New York, in 1 cell of the MDDB. • In contrast, ROLAP tools consume 600% more space, because they require a record of seven values—six foreign keys and the actual aggregate—in a relational summary table.
OLAP Needs: User Needs Excellent Performance
OLAP Needs: User Needs Excellent Performance RDBMSs must use several summary tables to store the aggregates that a MOLAP could store in just one cube. For example, consider a Sales indicator with three dimensions: Months, Regions, and Products. The indicator cube will contain seven sets of aggregates: • Sales by month • Sales by product • Sales by region • Sales by month and product • Sales by month and region • Sales by product and region • Sales by product, month, and region To store these aggregates in an RDBMS, you’d have to create seven summary tables, one for each aggregate set. HOW MANY SUMMARY TABLES FOR 6 DIMENSIONS? (Separate fact table and shrunken dimension table approach for storing aggregates)