1 / 22

Supporting Imprecision in Multidimensional Databases Using Granularities

Supporting Imprecision in Multidimensional Databases Using Granularities. T. B. Pedersen 1,2 , C. S. Jensen 2 , and C. E. Dyreson 2 1 Center for Health Information Services, Kommunedata, www.kmd.dk 2 Nykredit Center for Database Research, Department of Computer Science,

pete
Download Presentation

Supporting Imprecision in Multidimensional Databases Using Granularities

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Supporting Imprecision in Multidimensional Databases Using Granularities T. B. Pedersen1,2, C. S. Jensen2, and C. E. Dyreson2 1 Center for Health Information Services, Kommunedata, www.kmd.dk 2 Nykredit Center for Database Research, Department of Computer Science, Aalborg University, www.cs.auc.dk/NDB 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

  2. Talk Overview • Motivation • Data model and query language context • Handling imprecision • Alternative queries • Imprecision in grouping • Imprecision in computations • Presenting imprecise results • Using pre-aggregated data • Conclusion 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

  3. Motivation • Online Analytical Processing (OLAP) tools are increasingly used in many different areas: • Business applications • Medical applications • Other scientific applications • Data “imperfection” is a problem for medical and other applications. • Some data is missing. • Some data has varying degrees of imprecision. • Current OLAP tools assume that data imperfections are handled during the “data cleansing” process. • Not realistic for most cases • Introduces mapping errors • Hides the “true quality” of the data 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

  4. Previous Work • Imprecision versus uncertainty • Imprecision is a property of the content of an attribute. • Uncertainty concerns the degree of truth of a statement. • We handle only imprecision. • Our focus is on imprecision in aggregate queries. • Most work on imprecision deals with relational databases • Fuzzy sets - specifies a degree of set membership for a value • Partial values - one of a set of values is the true value • Multiple imputation - substitute multiple values for a missing value • High computational complexity • Only “incomplete datacubes” have treated imprecision in multidimensional databases. • Imprecision fixed at schema level • Imprecision in computations not handled 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

  5. A Diabetes Case Study • E/R schema of case study • Patients have a diagnosis. • Diagnoses may be missing. • Diagnoses may be specified at the Low-level Diagnosis or Diagnosis Group level. • HbA1c% (long-term blood sugar level) measured with several methods of varying precision. 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

  6. Data Model - Schema • Fact type: Patient • Dimension types: Diagnosis and HbA1c% • There are no “measures”, all data are dimensions. • Category types: Low-level Diagnosis, Diagnosis Group, Precise, Imprecise • Top category types: corresponds to ALL of the dimension. • Bottom category types: the lowest level in each dimension • The category types of a dimension type form a lattice. • Category types ~ granularities HbA1c% Dimension Type Diagnosis Dimension Type TDiagnosis THbA1c% Imprecise Diagnosis Group LL Diagnosis Precise Patient 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

  7. Data Model - Instances • Categories: instances of category types, consist of dimension values. • Top categories contain only one “T” value. • Dimensions = categories + partial order on category values • Facts: instances of fact type with separate identity. • Fact-dimension relations: links facts to dimensions, may map to values of any granularity. • Multidimensional object (MO) = schema + dimension + facts + fact-dimension relations HbA1c% Dimension Diagnosis Dimension T T 5 6 7 Diabetes ID Diabetes NID Diabetes 5.4 5.5 .. 7.0 Jane Jim John 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

  8. Algebraic Query Language • Close to relational algebra with aggregation functions • Selection, projection, rename, union, difference, identity-based join operators • Aggregation operator: takes “grouping categories”, aggregation function and result dimension as arguments. • Groups together facts characterized by the same dimension values, applies aggregation function to groups, and places result in result dimension. • Example: COUNT by Diagnosis Family and THbA1c% Result Dimension Diagnosis Dimension T T 0-2 >2 Diabetes 0 1 2 3 .. {Jim,John,Jane} 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

  9. Overview • Motivation • Data model and query language context • Handling imprecision • Alternative queries • Imprecision in grouping • Imprecision in computations • Presenting imprecise results • Using pre-aggregated data • Conclusion 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

  10. Handling Imprecision • We use the granularity of the data to capture imprecision. • The dimensions specify an imprecision hierarchy. • Data is mapped to the appropriate granularity. • Pseudo-code: • Procedure EvalImprecise(Q,M) if PreciseEnough(Q,M) then Eval(Q,M) else Q’=Alternative(Q,M) if Q’ is ok then Eval(Q’,M) else Handle Imprecision in Grouping Handle Imprecision in Computation Return Imprecise Result end if end if Most Imprecise=T=Unknown More Imprecise . . . More precise Most precise=bottom category 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

  11. Alternative Queries • We use a “Precision MO” (PMO) to capture the data granularities. • The precision MO has a granularity dimension for every MO dimension D. • A granularity dimension has two categories: T and GranularityD. • GranularityD contains a value for every category type in D. • The set of facts stays the same. • Facts are mapped to the appropriate GranularityD value. GranHbA1c% Dimension GranDiagnosis Dimension T T DG TD LLD P I TH Jane Jim John 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

  12. Alternative Queries • Queries are rewritten into “testing queries” on the PMO that counts the number of facts mapped to different granularity combinations (details in paper). • The result of the testing queries can be used to suggest alternative queries. • Example: COUNT by Low-level Diagnosis (and THbA1c%)) on MO • Testing query: COUNT by GranularityDiagnosis on PMO • Result shows that data is not precise enough to group on Low-level Diagnosis • Alternative query: COUNT by Diagnosis Group Result Dimension GranDiagnosis Dimension T T DG TD LLD 0 1 2 {Jim} {John,Jane} 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

  13. Overview • Motivation • Data model and query language context • Handling imprecision • Alternative queries • Imprecision in grouping • Imprecision in computations • Presenting imprecise results • Using pre-aggregated data • Conclusion 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

  14. Imprecision in Grouping • If facts are mapped to a dimension value of coarser granularity than the grouping category, with which dimension value should they be grouped ? - we do not know ! • So, we return several answers, based on different groupings of facts: • Conservative grouping: only include in a group what is known to belong to it - discard imprecise data. • Liberal grouping: include in a group everything that might belong to it - facts may be in several groups. • Weighted grouping: include everything that might belong, but give more weight to likely members. 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

  15. Conservative Grouping • Corresponds to the standard aggregation operator. • Imprecise data are simply discarded. • Example: COUNT by Low-level Diagnosis (and THbA1c%)) on MO • Jim does not show in any group. • The count for both groups is 1. • The result is “too conservative” as not all data is accounted for in the result. Result Dimension Diagnosis Dimension T T Diabetes ID Diabetes NID Diabetes 0 1 2 {Jane} {John} 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

  16. Liberal Grouping • We modify the aggregation operator from the query language to compute the liberal grouping (formal semantics in the paper). • The computed groups may now overlap, i.e., the same fact may be in several groups. • Example: COUNT by Low-level Diagnosis (and THbA1c%)) on MO • Jim ends up up both groups. • The count for both groups is 2. • The result is “too liberal” as the same data may be counted several times in the result. Result Dimension Diagnosis Dimension T T Diabetes ID Diabetes NID Diabetes 0 1 2 {Jim,Jane} {Jim,John} 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

  17. Weighted Grouping • Compromise between the conservative and liberal groupings: • Weights assigned to the partial order on dimension values and to fact membership in groups (groups become fuzzy sets). • Aggregation operator modified to compute the weighted grouping (see paper). • Example: COUNT by Low-level Diagnosis (and THbA1c%)) on MO • 80% of Diabetes patient have ID Diabetes, 20% NID. • Jim ends up up both groups, but weighted differently. • Result is a weighted COUNT (details on next slide). Result Dimension Diagnosis Dimension T T 1.0 Diabetes .8 .2 ID Diabetes NID Diabetes 0 1.2 1.8 {Jim.2,Jane1.0} {Jim.8,John1.0} 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

  18. Imprecision in Computation • We also need to handle imprecision in the aggregate computation itself, e.g., the varying precision for HbA1c%. • In computations, we impute the expected value for values of non-bottom granularity (generalized imputation). This allows normal computation of the result. • A precision computation is performed along with the aggregate computation: • A granularity computation measure (GCM) is used to capture the imprecision of a dimension value during computation. • A measure combination function (MCF)is used to combine GCMs. The MCF must be distributive. • A final granularity measure (FGM) represent the “true” imprecision of a result value. • A final granularity function (FGF) maps from a GCM to a FGM. 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

  19. Imprecision in Computation • Example: WAVG(HbA1c%) by Low-Level Diagnosis • We use FGM=WAVG(Level) as a measure of precision (Precise values have level 0, etc.) • The GCM for a value e with weight w is: (w * Level(e),w) • The MCF is: h((n1, n2),(n3,n4)) = (n1+n3,n2+n4) • The FGF is: f(n1,n2) = n1/n2 • 6.0 & 7.0 imputed for T & 7 in WAVG(HbA1c%) computation • Jim, John, and Jane have levels 2, 0, and 1, respectively • Results of query: {(IDD,5.7,.9), (NIDD,5.6,1.2)} HbA1c% Dimension Diagnosis Dimension T T 6.0 2 1.0 5 6 7 7.0 Diabetes .8 .2 ID Diabetes NID Diabetes 5.4 5.5 .. 7.0 0 1 Jane Jim John 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

  20. Presenting Imprecise Results • Several ways to present the imprecise results to the user: • Show result values along with the corresponding final granularity measure values. • Very precise estimate of result precision, but hard to grasp. • Resulting (value,FGM) = {(IDD,5.7,.9),(NIDD,5.6,1.2)} • Map result values into different granularities using a value coarsening function (VCF). • More intuitive result, but less precise estimate of precision. • Example: Weighted grouping with VCF: r(x) = v such that xv and Level(v) = Ceiling(x). Result Dimension Diagnosis Dimension T T 5 6 7 Diabetes ID Diabetes NID Diabetes 5.4 5.5 .. 7.0 {Jim.2,Jane1.0} {Jim.8,John1.0} 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

  21. Using Pre-Aggregated Data • The approach can utilize pre-aggregated data effectively. • Aggregate results for the precision MO can usually be fully materialized due to the relatively small multidimensional space (~ 1.000.000 cells). • The aggregate computation with expected values can use standard pre-aggregation techniques, e.g., partial pre-aggregation, for good response-time versus storage/update-time tradeoffs. • The distributive MCF allows for partial pre-aggregation of precision results. 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

  22. Conclusion • Current OLAP tools and models do not support the imprecision found in real-world data. • We have shown an approach to handling imprecision in OLAP databases based on the common multidimensional concept of granularities. • The approach can suggest alternative queries when data is not precise enough and handles imprecision both in the grouping of data and in the aggregate computation. • The approach has a low computational overhead and is able to exploit pre-aggregated results effectively. • The approach can be implemented using existing technology such as SQL and OLAP tools. • Future work: presentation of results, MIN/MAX functions, etc. 11th SSDBM, Cleveland, Ohio, July 28-30, 1999

More Related