200 likes | 293 Views
DOLAP 2004. A New OLAP Aggregation Based on the AHC Technique. R. Ben Messaoud, O. Boussaid, S. Rabaséda. Laboratoire ERIC – Université de Lyon 2 5, avenue Pierre-Mendès–France 69676, Bron Cedex – France http://eric.univ-lyon2.fr. 0. 1. 2. 3. 4. 5. Complex data. Definition:
E N D
DOLAP 2004 A New OLAP Aggregation Based on the AHC Technique R. Ben Messaoud, O. Boussaid, S. Rabaséda Laboratoire ERIC – Université de Lyon 2 5, avenue Pierre-Mendès–France 69676, Bron Cedex – France http://eric.univ-lyon2.fr
0 1 2 3 4 5 Complex data • Definition: Data are considered complex if they are … • Multi-formats: information can be supported by different kind of data (numeric, symbolic, texts, images, sounds, videos …) • Multi-structures: structured, unstructured or semi-structured (relational databases, XML documents …) • Multi-sources: data come from different sources (distributed databases, web …) • Multi-modals: the same information can be described differently (data in different languages …) • Multi-versions: data are updated through time (temporal databases, periodical inventory …) Ben Messaoud et al.
Complex data 0 1 2 3 OLAP Data mining 4 5 MDBMS OpAC General context • Complex data • Huge volumes of complex data • Warehousing complex data … • OLAP facts as complex objects • Analyze complex data • Current OLAP tools aren’t suited to process complex data • Data mining is able to process complex data like images, texts, videos … • Coupling OLAP and data mining • Analyze complex data on-line • New operator OpAC: Operator of Aggregation by Clustering (AHC) Ben Messaoud et al.
Outline Complex data and general context Related work: Coupling OLAP and data mining Objectives of the proposed operator Formalization of the operator Implementation and demonstration Conclusion and future works 0 1 2 3 4 5 Ben Messaoud et al.
First approach Second approach 0 Third approach 1 2 3 OLAP Data mining 4 5 DBMS Related work • Three approaches for coupling OLAP and data mining • First approach: Extending the query languages of decision support systems • Second approach: Adapting multidimensional environment to classical data mining techniques • Third approach: Adapting data mining methods for multidimensional data Ben Messaoud et al.
0 1 2 3 OLAP Data mining 4 5 OpAC Related work • These works proved that: • Associating data mining to OLAP is a promising way to involve rich analysis tasks • Data mining is able to extend the analysis power of OLAP • Use data mining to enhance OLAP tools in order to process complex data • OpAC: A new OLAP operator based on a data mining technique Ben Messaoud et al.
Sales Sales Count Count + Washington $2520 $2520 120 120 + California $2410 129 0 Sales Count 1 + Bellingham + Washington - Washington $700 32 $2410 129 + Bremerton $400 20 2 + Olympia $850 44 + Redmond $250 9 3 + Seattle $320 15 + Berkeley - California + California $820 41 4 + Beverly Hills $910 50 5 + Los Angeles $680 38 Objectives Classic OLAP aggregation Vs OpAC aggregation • Classic OLAP: • Summarizes numerical data in a fewer number of values • Computes additive measures (Sum, Average, Max, Min …) Example: Sales cube Ben Messaoud et al.
0 Images Size ASM 1 Orange coral 3560px 0,016 2 Nebraska, USA 0,021 2340px 3 Toco toucan 0,014 4434px 4 Maldives 3260px 0,012 5 Objectives Classic OLAP aggregation Vs OpAC aggregation • OpAC aggregation: • What about aggregating complex objects? • How to aggregate images, texts or videos with classic OLAP tools? • Complex objects are not additive OLAP measures … Example: Images cube ? Ben Messaoud et al.
0 1 2 3 4 5 Objectives • How to aggregate complex objects? • Using a data mining technique: AHC (Agglomerative Hierarchical Clustering) • The AHC aggregates data • The hierarchical aspect of the AHC Ben Messaoud et al.
L1Normalized for high homogeneity 0 1 2 3 4 5 L1Normalized for low entropy Objectives Images Very high High Medium Low Very low Very high High Medium Low Very low Entropy Homogeneity Ben Messaoud et al.
0 1 X / X(gijt) =Measure of gsrvcrossed with gijt ì ì 2 S Ì í í where gsrvÎ hsr , s¹i and r is unique for each s 3 î î 4 5 Formalization Di: the ith dimension of a data cube C hij: the jth hirarchical level of the dimension Di gijt: the tth modality of hij The set of individuals: W Ì { gijt/ gijtÎ hij} • The set of variables: • Dimension retained for individuals can’t generate variables • Only one hierarchical level of a dimension is allowed to generate variables Ben Messaoud et al.
0 1 2 k Iintra(k) = å I(Ai) i=1 3 k Iinter(k) = å P(Ai)d(G(Ai),G(W)) 4 i=1 5 Formalization • Evaluation tools • Minimize the intra-cluster distances • Maximize the inter-cluster distances • Inter and intra-cluster inertia • A1, A2 , …, Akis a partition ofW • P(Ai)is the weight of Ai • G(Ai)is the gravity center of Ai Ben Messaoud et al.
- Inter-clusters - Intra-cluster 0 1 2 3 500 Very high High Medium Low Very low Very high High Medium Low Very low 7 6 5 4 3 2 1 400 4 300 Entropy 200 5 100 Homogeneity 0 Formalization • Individuals: • Modalities from the dimension of images • Variables: • L1Normalized values of images for all possible modalities of the entropy dimension • L1Normalized values of images for all possible modalities of the homogeneity dimension Ben Messaoud et al.
0 1 2 3 4 5 Formalization Results: • Exploits the cube’s facts describing images to construct groups of similar complex objects • Highlights significant groups of objects by a clustering technique • Clusters –aggregates- are defined both from dimensions and measures of a data cube • Implementation of a prototype Ben Messaoud et al.
0 1 2 3 4 5 Implementation Prototype: • Data loading module: • Connects to a data cube on Analysis Services of MS SQL Server • Uses MDX queries to import information about the cube’s structure • Extract data selected by the user • Parameter setting interface: • Assists the user to extract individuals and variables from the cube • Selects modalities and measures • Defines the clustering problem • Clustering module: • Allows the definition of the clustering parameters like dissimilarity metric and aggregation criterion • Constructs the AHC • Plots the results of the AHC on a dendrogram Ben Messaoud et al.
0 1 2 3 4 5 Implementation Images dataset: • 3000 images collected from the web: • Semantic annotation: Description, subject and theme • Descriptors of texture like: • ENT: Entropy • CON: Contrast • L1Normalized: Medium Color Characteristic • … • Three color channels: RGB Ben Messaoud et al.
0 1 2 3 4 5 Implementation Demonstration: Ben Messaoud et al.
0 1 2 3 4 5 Conclusion • OpAC is a possible way to realize on-line analysis over complex data • OpAC aggregates complex objects • Aggregates –clusters- are defined from both dimensions and measures of a data cube • Prototype available at : http://bdd.univ-lyon2.fr/?page=logiciel&id=5 Ben Messaoud et al.
0 1 2 3 4 5 Future works • The current evaluation tool may present some limits • Use other evaluation indicators to evaluate the quality of partitions • Assist user to find the best number of clusters • Exploit the aggregates generated by OpAC in order to reorganize the cube’s dimensions • Get a new cube with remarkable regions • Use other data mining technique to enhance the OLAP power with explanation and prediction capabilities Ben Messaoud et al.
The End Ben Messaoud et al.