240 likes | 374 Views
Using Taxonomies to Perform Aggregated Querying over Imprecise Data. Atanu Roy Chandrima Sarkar Rafal A. Angryk. Presented by: Rafal A. Angryk Date: 2010-12-14. Outlines of the Presentation. Idea Imprecision Motivation Limitations of Previous Work Definitions Approach
E N D
Using Taxonomies to Perform Aggregated Querying over Imprecise Data Atanu Roy ChandrimaSarkar Rafal A. Angryk Presented by: Rafal A. Angryk Date: 2010-12-14
Outlines of the Presentation • Idea • Imprecision • Motivation • Limitations of Previous Work • Definitions • Approach • Experimental Setup & Results • Conclusion and Future Work Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Idea of the Project • This paper provides framework for answering queries over imprecise data found in the common databases. • We propose to solve this by classifying the data into taxonomical hierarchies and then capturing it in weighted hierarchical hypergraph. Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Imprecision in Databases: An Example Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Constraint: All soybean seeds with the same kind of stem canker should germinate in the same month of the season. Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Motivation • Several recent papers have focused on retrieval of imprecise data, where every fact can be a region, instead of a point, in a multi-dimensional space. • The most prominent one is [BDRV07] • They have solved it by constructing marginal databases (MDBs) from extended database (EDBs) with the help of constraint hypergraph. Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Limitations of Previous Work • Creating Marginal Databases using weighted hierarchical Hypergraph, employs brute force method for retrieving connected facts (tuples). • This increases the overall time complexity and processing time of the queries. • [BDRV07] follows a data specific technique but we propose to follow a domain specific knowledge Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Definitions • Background knowledge: Knowledge required to generate taxonomies. • Expert knowledge: Domain-specific human expertise. • Data-derived knowledge: Derived from historic precise database and is used to generate mutually exclusive probabilities • Possible worlds: All the possible combinations that an imprecise record can assume. • Valid world: All the possible worlds which satisfies a given set of constraints. Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Assignment of Probabilities Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
EDB Creation • Probability of a possible world is the product of the unconditional occurrences of all imprecise attributes. • Sum of probabilities of all possible worlds of an imprecise record is 1. • Probability assignment rule creates a set of tuples using Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Hyperedge Creation Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
MDB Creation • Weighted hierarchical hypergraph is defined as H(L, E) where L represents the nodes and E is the set of hyperedges between different taxonomies. • Each hyperedge signifies a distinct combination of attribute values. The weight of a possible world assigned to a hyperedge [AC10] needs to preserve the a few properties. • All t-norms [AC10] (e.g. minimum, product) fulfill these requirements. We choose product for the purposes of our preliminary investigation. Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
EDB MDB Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Aggregated Querying • We aggregate tuples for aggregated querying based on its uniqueness. • Group two tuples only when all their attributes values and the corresponding probabilities are the same. • Find the total no. of plants grown in august which have a Stem Canker above-sec-node • (44*0.9057) + (25*0.6429) ≈ 56 Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Experimental Setup • Census-Income dataset from UCI Machine Learning repository. • Finally used 7 dimensions. • Precise database has 191239 records. • Test dataset has 99762 records. • Randomly inserted imprecision into the test dataset to make it imprecise. Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Distribution of Imprecision Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Imprecision Characteristics Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Scalability Test Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Extended Database Analysis Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Influence of Imprecision Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Absolute Percentage Error Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Conclusion and Future Work • In this research we significantly present a framework for efficient querying over imprecise data with an average of ≈ 94% accuracy • We intend to extend this research to include Ontology in place of Taxonomy. • We also intend to use Associative Weight Mining to assign weights to hyperedges. Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Questions? • References • [BDRV07]: Douglas Burdick, AnHai Doan, RaghuRamakrishnan, ShivakumarVaithyanathan: OLAP over Imprecise Data with Domain Constraints. VLDB 2007: 39-50 • [AC10]: Rafal A. Angryk, JacekCzerniak: Heuristic Algorithm for Interpretation of Multi-Valued Attributes in Similarity-based Fuzzy Relational Databases. International Journal of Approximate Reasoning 51: 895-911 (2010) Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data