210 likes | 560 Views
Efficient OLAP Query Processing for Distributed Data Warehouses. Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB, Aalborg University, Denmark Theodore Johnson, AT&T Labs-Research, USA Laks V. S. Lakshmanan, University of British Columbia, Canada
E N D
Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB, Aalborg University, Denmark Theodore Johnson, AT&T Labs-Research, USA Laks V. S. Lakshmanan, University of British Columbia, Canada Divesh Srivastava, AT&T Labs-Research, USA Michael O. Akinde EDBT’2002 -- March 24-28, Prague
Motivation • Analysis of network data • Collect, correlate, and analyze data across the network • Huge amounts of decentrally collected data • Complex OLAP operations • Performed using ad-hoc Perl scripts • Experiment: OLAP technology • Pro: Improves specification, performance • Con: Expensive (or data loss) when centralized • Existing centralized OLAP tools are inadequate
DWs close to the data collection points Network Administrators & Control Systems Source Coordinator Source Query Coordinator Source DW DW DW Solution:Distributed Datawarehouse • Local DW at each collection point (e.g., router) • Compute queries across multiple DWs • A technology is needed fordistributed processing of complex OLAP queries Application Application
Complex OLAP Queries • Examples: • Network usage: For each IP address, what fraction of the total number of flows is due to web traffic? • Principal components: On an hourly basis, what fraction of the total traffic is from IP subnets whose total hourly traffic is within 10% of the maximum? • Pattern identification: Break down all flows recorded on US election day by all possible combinations of source AS, destination AS, and protocol • Diverse OLAP queries, involving pivots, correlations, and multiple levels of aggregation
Skalla • Translates OLAP queries expressed with extended algebra, into distributed query evaluation plans • Salient Features: • Efficiently handles a significant variety of complex OLAP queries (incl. pivots, correlations,etc.) • Only partial results are shipped between the sites and the coordinator -- never subsets of the detail data • No site to site communication
Extended Algebra: GMDJ • Algebraic OLAP operator [Chatziantoniou et al., 2001] • Salient feature: Splits grouping and aggregation • MD(B, R, l,) • B is the base-values table (the “groups”) • R is the detail table (fact data) • l is the list of aggregate functions • : possibly complex condition over B and R describing what fact data is to be aggregated • Result: The table B extended with the aggregates in l
Extended Algebra: Example • For each IP address, what fraction of the total number of flows is due to web traffic? MD ( MD(IPT, Flows, (Cnt1), (IPT.key = Flows.key)), Flows, (Cnt2), (IPT.key = Flows.key and Flows.Source = WEB )) • Result of inner GMDJ: (IPT, Cnt1) • Result of outer GMDJ: (IPT, Cnt1, Cnt2) • Sequences of GMDJs instead of multiple aggregate-join expressions
Administers local GMDJ queries Coordinator Coordinator Source Query Engine Query Engine Mediator Mediator Source • Computes distributed query plans • Synchronizes sub-results Source DW DW DW Skalla Architecture & Evaluation Skalla Site Wrapper Application Site Wrapper Application Site Wrapper • Skalla Evaluation Rounds: • Computation of GMDJ at the local DWs • Synchronize sub-results at coordinator
Coordinator Build Groups Synchronize Synchronize result S1 • Distribute • Compute • Aggregates S2 • Distribute • Compute • Aggregates DW DW Skalla Evaluation: Example • For each IP address, what fraction of the total number of flows is due to web traffic?
Skalla Evaluation: Features • Each round of computation in the distributed query evaluation computes a single GMDJ. • Features of the Evaluation: • Semantics of the query plans ensure that the amount of data shipped by the algorithm is dependent on the number of groups and aggregate functions and independent of the size of the fact relation in the database! • The algorithm permits for a wide variety of optimizations
Optimizations: Group Reduction • During processing, we only ship data that has actually been changed • Example: • Query: For each IP address, what fraction of the total number of flows is due to web traffic? • Each local DW receives a base-values table containing all source data • Coordinator has a copy of base-values table • Local DWs ship only those tuples back that have actually been changed
DW DW Group Reduction: Example Coordinator Build Groups Synchronize Synchronize result S1 • Distribute • Compute • Aggregates S2 • Distribute • Compute • Aggregates • For each IP address, what fraction of the total number of flows is due to web traffic?
Optimizations: Synchronization Reduction • It is possible to detect cases where no synchronization is required between passes. • Example: • DW data: All the flows of an autonomous system are always registered (stored) at a particular local DW • Query: For each IP address, what fraction of the total number of flows is due to web traffic? • Each IP address belongs to a particular autonomous system; i.e., all data for a particular IP address is located at the system storing the flows of its autonomous system
DW DW Synch Reduction: Example (1) • For each IP address, what fraction of the total number of flows is due to web traffic? Coordinator Build Groups Synchronize Synchronize result S1 • Distribute • Compute • Aggregates S2 • Distribute • Compute • Aggregates
DW DW Synch Reduction: Example (2) • For each IP address, what fraction of the total number of flows is due to web traffic? Coordinator Build Groups Synchronize result S1 • Distribute • Compute • Aggregates S2
Conclusions • We develop a framework for evaluating complex OLAP queries on a distributed data warehouse • Efficient query plans that minimize data transfer over the network • Further work: • Additional developments of the architecture • Cost-based query optimization