Supporting Imprecision in Multidimensional Databases Using Granularities

Supporting Imprecision in Multidimensional Databases Using Granularities T. B. Pedersen1,2, C. S. Jensen2, and C. E. Dyreson2 1 Center for Health Information Services, Kommunedata, www.kmd.dk 2 Nykredit Center for Database Research, Department of Computer Science, Aalborg University, www.cs.auc.dk/NDB 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

Talk Overview • Motivation • Data model and query language context • Handling imprecision • Alternative queries • Imprecision in grouping • Imprecision in computations • Presenting imprecise results • Using pre-aggregated data • Conclusion 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

Motivation • Online Analytical Processing (OLAP) tools are increasingly used in many different areas: • Business applications • Medical applications • Other scientific applications • Data “imperfection” is a problem for medical and other applications. • Some data is missing. • Some data has varying degrees of imprecision. • Current OLAP tools assume that data imperfections are handled during the “data cleansing” process. • Not realistic for most cases • Introduces mapping errors • Hides the “true quality” of the data 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

Previous Work • Imprecision versus uncertainty • Imprecision is a property of the content of an attribute. • Uncertainty concerns the degree of truth of a statement. • We handle only imprecision. • Our focus is on imprecision in aggregate queries. • Most work on imprecision deals with relational databases • Fuzzy sets - specifies a degree of set membership for a value • Partial values - one of a set of values is the true value • Multiple imputation - substitute multiple values for a missing value • High computational complexity • Only “incomplete datacubes” have treated imprecision in multidimensional databases. • Imprecision fixed at schema level • Imprecision in computations not handled 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

A Diabetes Case Study • E/R schema of case study • Patients have a diagnosis. • Diagnoses may be missing. • Diagnoses may be specified at the Low-level Diagnosis or Diagnosis Group level. • HbA1c% (long-term blood sugar level) measured with several methods of varying precision. 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

Data Model - Schema • Fact type: Patient • Dimension types: Diagnosis and HbA1c% • There are no “measures”, all data are dimensions. • Category types: Low-level Diagnosis, Diagnosis Group, Precise, Imprecise • Top category types: corresponds to ALL of the dimension. • Bottom category types: the lowest level in each dimension • The category types of a dimension type form a lattice. • Category types ~ granularities HbA1c% Dimension Type Diagnosis Dimension Type TDiagnosis THbA1c% Imprecise Diagnosis Group LL Diagnosis Precise Patient 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

Data Model - Instances • Categories: instances of category types, consist of dimension values. • Top categories contain only one “T” value. • Dimensions = categories + partial order on category values • Facts: instances of fact type with separate identity. • Fact-dimension relations: links facts to dimensions, may map to values of any granularity. • Multidimensional object (MO) = schema + dimension + facts + fact-dimension relations HbA1c% Dimension Diagnosis Dimension T T 5 6 7 Diabetes ID Diabetes NID Diabetes 5.4 5.5 .. 7.0 Jane Jim John 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

Algebraic Query Language • Close to relational algebra with aggregation functions • Selection, projection, rename, union, difference, identity-based join operators • Aggregation operator: takes “grouping categories”, aggregation function and result dimension as arguments. • Groups together facts characterized by the same dimension values, applies aggregation function to groups, and places result in result dimension. • Example: COUNT by Diagnosis Family and THbA1c% Result Dimension Diagnosis Dimension T T 0-2 >2 Diabetes 0 1 2 3 .. {Jim,John,Jane} 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

Overview • Motivation • Data model and query language context • Handling imprecision • Alternative queries • Imprecision in grouping • Imprecision in computations • Presenting imprecise results • Using pre-aggregated data • Conclusion 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

Handling Imprecision • We use the granularity of the data to capture imprecision. • The dimensions specify an imprecision hierarchy. • Data is mapped to the appropriate granularity. • Pseudo-code: • Procedure EvalImprecise(Q,M) if PreciseEnough(Q,M) then Eval(Q,M) else Q’=Alternative(Q,M) if Q’ is ok then Eval(Q’,M) else Handle Imprecision in Grouping Handle Imprecision in Computation Return Imprecise Result end if end if Most Imprecise=T=Unknown More Imprecise . . . More precise Most precise=bottom category 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

Alternative Queries • We use a “Precision MO” (PMO) to capture the data granularities. • The precision MO has a granularity dimension for every MO dimension D. • A granularity dimension has two categories: T and GranularityD. • GranularityD contains a value for every category type in D. • The set of facts stays the same. • Facts are mapped to the appropriate GranularityD value. GranHbA1c% Dimension GranDiagnosis Dimension T T DG TD LLD P I TH Jane Jim John 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

Alternative Queries • Queries are rewritten into “testing queries” on the PMO that counts the number of facts mapped to different granularity combinations (details in paper). • The result of the testing queries can be used to suggest alternative queries. • Example: COUNT by Low-level Diagnosis (and THbA1c%)) on MO • Testing query: COUNT by GranularityDiagnosis on PMO • Result shows that data is not precise enough to group on Low-level Diagnosis • Alternative query: COUNT by Diagnosis Group Result Dimension GranDiagnosis Dimension T T DG TD LLD 0 1 2 {Jim} {John,Jane} 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

Overview • Motivation • Data model and query language context • Handling imprecision • Alternative queries • Imprecision in grouping • Imprecision in computations • Presenting imprecise results • Using pre-aggregated data • Conclusion 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

Imprecision in Grouping • If facts are mapped to a dimension value of coarser granularity than the grouping category, with which dimension value should they be grouped ? - we do not know ! • So, we return several answers, based on different groupings of facts: • Conservative grouping: only include in a group what is known to belong to it - discard imprecise data. • Liberal grouping: include in a group everything that might belong to it - facts may be in several groups. • Weighted grouping: include everything that might belong, but give more weight to likely members. 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

Conservative Grouping • Corresponds to the standard aggregation operator. • Imprecise data are simply discarded. • Example: COUNT by Low-level Diagnosis (and THbA1c%)) on MO • Jim does not show in any group. • The count for both groups is 1. • The result is “too conservative” as not all data is accounted for in the result. Result Dimension Diagnosis Dimension T T Diabetes ID Diabetes NID Diabetes 0 1 2 {Jane} {John} 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

Liberal Grouping • We modify the aggregation operator from the query language to compute the liberal grouping (formal semantics in the paper). • The computed groups may now overlap, i.e., the same fact may be in several groups. • Example: COUNT by Low-level Diagnosis (and THbA1c%)) on MO • Jim ends up up both groups. • The count for both groups is 2. • The result is “too liberal” as the same data may be counted several times in the result. Result Dimension Diagnosis Dimension T T Diabetes ID Diabetes NID Diabetes 0 1 2 {Jim,Jane} {Jim,John} 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

Weighted Grouping • Compromise between the conservative and liberal groupings: • Weights assigned to the partial order on dimension values and to fact membership in groups (groups become fuzzy sets). • Aggregation operator modified to compute the weighted grouping (see paper). • Example: COUNT by Low-level Diagnosis (and THbA1c%)) on MO • 80% of Diabetes patient have ID Diabetes, 20% NID. • Jim ends up up both groups, but weighted differently. • Result is a weighted COUNT (details on next slide). Result Dimension Diagnosis Dimension T T 1.0 Diabetes .8 .2 ID Diabetes NID Diabetes 0 1.2 1.8 {Jim.2,Jane1.0} {Jim.8,John1.0} 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

Imprecision in Computation • We also need to handle imprecision in the aggregate computation itself, e.g., the varying precision for HbA1c%. • In computations, we impute the expected value for values of non-bottom granularity (generalized imputation). This allows normal computation of the result. • A precision computation is performed along with the aggregate computation: • A granularity computation measure (GCM) is used to capture the imprecision of a dimension value during computation. • A measure combination function (MCF)is used to combine GCMs. The MCF must be distributive. • A final granularity measure (FGM) represent the “true” imprecision of a result value. • A final granularity function (FGF) maps from a GCM to a FGM. 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

Imprecision in Computation • Example: WAVG(HbA1c%) by Low-Level Diagnosis • We use FGM=WAVG(Level) as a measure of precision (Precise values have level 0, etc.) • The GCM for a value e with weight w is: (w * Level(e),w) • The MCF is: h((n1, n2),(n3,n4)) = (n1+n3,n2+n4) • The FGF is: f(n1,n2) = n1/n2 • 6.0 & 7.0 imputed for T & 7 in WAVG(HbA1c%) computation • Jim, John, and Jane have levels 2, 0, and 1, respectively • Results of query: {(IDD,5.7,.9), (NIDD,5.6,1.2)} HbA1c% Dimension Diagnosis Dimension T T 6.0 2 1.0 5 6 7 7.0 Diabetes .8 .2 ID Diabetes NID Diabetes 5.4 5.5 .. 7.0 0 1 Jane Jim John 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

Presenting Imprecise Results • Several ways to present the imprecise results to the user: • Show result values along with the corresponding final granularity measure values. • Very precise estimate of result precision, but hard to grasp. • Resulting (value,FGM) = {(IDD,5.7,.9),(NIDD,5.6,1.2)} • Map result values into different granularities using a value coarsening function (VCF). • More intuitive result, but less precise estimate of precision. • Example: Weighted grouping with VCF: r(x) = v such that xv and Level(v) = Ceiling(x). Result Dimension Diagnosis Dimension T T 5 6 7 Diabetes ID Diabetes NID Diabetes 5.4 5.5 .. 7.0 {Jim.2,Jane1.0} {Jim.8,John1.0} 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

Using Pre-Aggregated Data • The approach can utilize pre-aggregated data effectively. • Aggregate results for the precision MO can usually be fully materialized due to the relatively small multidimensional space (~ 1.000.000 cells). • The aggregate computation with expected values can use standard pre-aggregation techniques, e.g., partial pre-aggregation, for good response-time versus storage/update-time tradeoffs. • The distributive MCF allows for partial pre-aggregation of precision results. 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

Conclusion • Current OLAP tools and models do not support the imprecision found in real-world data. • We have shown an approach to handling imprecision in OLAP databases based on the common multidimensional concept of granularities. • The approach can suggest alternative queries when data is not precise enough and handles imprecision both in the grouping of data and in the aggregate computation. • The approach has a low computational overhead and is able to exploit pre-aggregated results effectively. • The approach can be implemented using existing technology such as SQL and OLAP tools. • Future work: presentation of results, MIN/MAX functions, etc. 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

Supporting Imprecision in Multidimensional Databases Using Granularities