Extending Practical Pre-Aggregation for On-Line Analytical Processing

Extending Practical Pre-Aggregation for On-Line Analytical Processing T. B. Pedersen1,2, C. S. Jensen2, and C. E. Dyreson2 1 Center for Health Information Services Kommunedata, www.kmd.dk 2 Nykredit Center for Database Research Department of Computer Science Aalborg University, www.cs.auc.dk/NDB 25th VLDB, Edinburgh, Scotland, September 7-10, 1999

Motivation • On-Line Analytical Processing (OLAP) systems are popular. • Based on a dimensional view of data. • Require fast query response times even for large amounts of data. • Pre-aggregation is used to speed up queries. • Full pre-aggregation (all combinations of aggregate levels are materialized) is infeasible. • Takes up too much space (~200 times the size of the raw data). • Takes too long to update when data changes. • Practical (or partial) pre-aggregation techniques • Store only select combinations of aggregates. • Re-use these to compute higher-level results. • Provides a good trade-off between storage space/update time and response time. 25th VLDB, Edinburgh, Scotland, September 7-10, 1999

Motivation • Practical pre-aggregation requires well-behaved hierarchies and fact-dimension relationships. • Hierarchies must be • strict • covering • onto • Facts and dimensions must be • many-to-one • have uniform granularity. • Often, these requirements are not satisfied in practice. • We present techniques and algorithms for applying practical pre-aggregation to general, non-summarizable hierarchies and fact-dimension relationships. 25th VLDB, Edinburgh, Scotland, September 7-10, 1999

Patient Case Study • Patients are the facts in multidimensional terms. • Each patient has zero or more diagnoses (many-to-many). • The diagnoses may be registered at any level (Low-level, Family, or Group), i.e., the granularity is varying. • A diagnosis may have no children (non-onto) or several parents (non-strict). • Each patient has one address, in a city or a rural area (non-covering). Cities are located in counties. 25th VLDB, Edinburgh, Scotland, September 7-10, 1999

Talk Overview • Motivation • Data model context • Hierarchy properties • Hierarchy transformations • Integration in current systems • Conclusion and future work 25th VLDB, Edinburgh, Scotland, September 7-10, 1999

Data Model - Schema • Fact type: Patient • Dimension types: Diagnosis and Residence • There are no “measures”, all data are dimensions. • Category types: Low-level Diagnosis, Diagnosis Family, Diagnosis Group, Address, City, County • Top category types: corresponds to ALL of the dimension. • Bottom category types: the lowest level in each dimension • The category types of a dimension type form a lattice. Residence Dimension Type Diagnosis Dimension Type TDiagnosis TResidence Diagnosis Group County Diagnosis Family City LL Diagnosis Address Patient 25th VLDB, Edinburgh, Scotland, September 7-10, 1999

Data Model - Instances • Categories: instances of category types, consist of dimension values. • Top categories contain only one “T” value. • Dimensions = categories + partial order on category values, hierarchy may be non-strict. • Facts: instances of fact type with separate identity. • Fact-dimension relations: links facts to dimensions, may map to several values of any granularity in each dimension. • Multidimensional object (MO) = schema + dimension + facts + fact-dimension relations Residence Dimension Diagnosis Dimension T T Cancer PregRel Diabetes Sydney Outback Lung Canc. Ins. Dep. Diab. Diab. Preg. Sydney Ins. Dep. Diab. Preg. 1 Central 2 Rural John Jim Jane 25th VLDB, Edinburgh, Scotland, September 7-10, 1999

Hierarchy Properties • Summarizability requires that the dimension hierarchies are covering, onto, and strict. • Non-covering hierarchies occur when links between dimension values “skip” levels. • Non-stricthierarchies occur when one lower-level item has several parents. • Non-onto (into) hierarchies occur when the height of the hierarchy is varying. • These properties cause problems when re-using stored counts of patients to compute new values. Residence Dimension Diagnosis Dimension T T Cancer PregRel Diabetes Sydney Outback 1 0 2? ? 1 Lung Canc. Ins. Dep. Diab. Diab. Preg. Sydney 1 1 1 0 ? Ins. Dep. Diab. Preg. 1 Central 2 Rural 1 No Low-level Diagnosis 25th VLDB, Edinburgh, Scotland, September 7-10, 1999

Hierarchy Transformations • The overall task is to transform non-summarizable hierarchies to summarizable hierarchies automatically. • A hierarchy is transformed in three steps: • 1) Transform the hierarchy to be covering. • 2) Transform the result from 1) to be onto. • 3) Transform the result from 2) to be strict. • We give an algorithm for each transformation. Each algorithm assumes that the previous algorithm(s) has been applied. 25th VLDB, Edinburgh, Scotland, September 7-10, 1999

Non-covering Hierarchies • The MakeCovering algorithm starts with the bottom category C(Address). • For each parent category P of C (City and County) it looks for parent categories H that are “higher than” P (City < County). • The algorithms finds all the links L=(h,c) from H to C that are not covered by going through P. • For each link L an intermediate value is inserted into P and linked to h and c. • Finally, the algorithm is applied recursively to P (nothing changes in this case). Residence Dimension TResidence T Sydney Outback County H L Sydney CityOutback City P Address C 1 Central 2 Rural 25th VLDB, Edinburgh, Scotland, September 7-10, 1999

Non-onto Hierarchies • The MakeOnto algorithm starts with the top category P(Tdiagnosis). • For each child category C of P it finds the values n in P with no children (none are found in the first two calls). • For each childless n in P, the algorithms inserts a placeholder value cn in C and links it to n. • Finally, the algorithm is called recursively on C. Diagnosis Dimension TDiagnosis T P Cancer PregRel Diabetes Diag. Group P C Lung Canc. Ins. Dep. Diab. Diab. Preg. P Diag. Family C n Ins. Dep. Diab. Preg. LL Diagnosis C LLLC 25th VLDB, Edinburgh, Scotland, September 7-10, 1999

Non-strict Hierarchies • Schema changes shown first • Idea: “fuse” sets into one value. • The MakeStrict algorithm starts with a child category C. • For each parent category P of C, it looks for non-strictness between C and PifPhas parents. • If non-strictness occurs, a new category N (holding setsof P) is inserted between C and P. • Grandparents G of C are linked to the new category. • “Unsafe” links are removed. • Finally, the algorithm is called recursively on N. TDiagnosis G G G Diag. Group P Set-of-DG P P Diag. Family Set-of-DF C LL Diagnosis C C 25th VLDB, Edinburgh, Scotland, September 7-10, 1999

Non-strict Hierarchies • Instance changes shown now. • The MakeStrict algorithm starts with a child category C. • For each parent category P of C, it looks for non-strictness between C and PifPhas parents. • If non-strictness occurs, new “fused” values (setsof P) are inserted into N. • Values in grandparents G of C are linked to the new values. • “Unsafe” links are removed. • Finally, the algorithm is called recursively on N. Diagnosis Dimension T Cancer PregRel Diabetes {PR,D} {C} Lung Canc. Diab. Preg. Ins. Dep. Diab. {DP,IDD} {LC} Ins. Dep. Diab. Preg. LLLC 25th VLDB, Edinburgh, Scotland, September 7-10, 1999

Did We Solve The Problem ? • Now we can re-use stored values to correctly compute new higher-level values. Diagnosis Dimension Residence Dimension T 1 T 3 Cancer PregRel Diabetes 1 1 0 Sydney Outback 1 2 {PR,D} {C} 1 0 Diab. Preg. Ins. Dep. Diab. Lung Canc. Sydney CityOutback 1 2 1 1 0 1 Central 2 Rural {DP,IDD} {LC} 1 0 Ins. Dep. Diab. Preg. LLLC 1 0 25th VLDB, Edinburgh, Scotland, September 7-10, 1999

Integration in Current Systems • The transformations should be transparent to the user. • Modern OLAP systems have client and server components, communicating using, e.g., OLE DB for OLAP or MD-API. • OLAP queries consist of 80% navigation and 20% aggregation queries [Kimball]. • Integration is achieved using a Query Handler that sends navigation queries to the original DB and aggregation queries to the transformed DB. • Small space overhead (the hierarchies only are stored twice). 25th VLDB, Edinburgh, Scotland, September 7-10, 1999

Conclusion and Future Work • We have presented algorithms that automatically transform non-summarizable (non-covering, non-onto, non-strict) hierarchies to become summarizable. • The algorithms have a practical, low complexity and can be applied incrementally to minimize the cost of updates. • The algorithms can be applied to non-summarizable relationships between facts and dimensions. • The presented technique can be implemented in current OLAP systems transparently to the user. • Future work: • Transform only the part of the hierarchy that has been selected for materialization. • Take the aggregate function into account, i.e., MAX and MIN are insensitive to duplicates. 25th VLDB, Edinburgh, Scotland, September 7-10, 1999

Previous Work • Much work has focused on selecting the optimal subset of aggregate levels for practical pre-aggregation, given space and/or time constraints. • This work assumes summarizability. Thus, it cannot be applied to general hierarchies and fact-dim. relationships. • One paper has shown how to manually, and not transparently to the user, achieve summarizability for non-covering hierarchies. • We give algorithms and techniques that automatically achieves summarizability for non-covering, non-onto, and non-strict hierarchies, transparently to the user. • The algorithms and techniques can also be applied to non-summarizable relationships between facts and dimensions. 25th VLDB, Edinburgh, Scotland, September 7-10, 1999

Extending Practical Pre-Aggregation for On-Line Analytical Processing