Efficient Query Optimization for Distributed Join in Database Federation

Efficient Query Optimization for Distributed Joinin Database Federation A Master’s Thesis Proposal by Di Wang Advisor: Prof. Murali Mani Dec 4, 2008

Outline • Introduction – Query Optimization in Database Federations • Architecture and Problem Definition • Proposed Work • Schedule

Introduction: Need for data integration • Various systems -> full picture • Mergers -> access both resources with a common interface • Business partners -> combine data Multiple Access Methods Multiple Data Schemas

Introduction to Database Federation • Database Federation is one approach to data integration • Key performance advantage: efficiently combine data from multiple sources in a single statement • The data sources are federated into a unified middleware, called mediator.

Key Components of Database Federation Query • Research Issues: • containment algorithms for conjunctive queries, • schema mapping, • capability-based optimization Query Rewriter . . . . . . Cost-based optimization -- Closely related to the optimization techniques developed for the distributed database systems Cost-Based Optimizer

The problem

Plan 1 SortMerge on M1 HashJoin on M3 SortMerge on M2 Plan 2 Plan 3 Things that make us unhappy M3.R3 M3.R3 M3.R3 NestLoop on M2 NestLoop on M1 SortMerge on M1 M1.R1 M1.R1 M1.R1 M2.R2 M2.R2 M2.R2 Optimizer Estimated Condition: Available buffer sizes of sites; CPU utility of sites; Network traffics … Statistics: physical designs … M1 M3 M2 Assume: B(R1) < B(R2) < B(R3), B(R1 join R2) < B(R3) Need to take run-time conditions into account at optimization time.

Existing Solution - Parametric Query Optimization • Y. E. Ioannidis, et al. Parametric Query Optimization. VLDB 1992. • Key idea: To identify several execution plans, each one of which is optimal for a subset of ALL possible values of the run-time parameters • E.g. Two parameters: Buffer size B = [2, 151] Kind of indexes I = {no_index, clustered_Btree, non_clustered_BTree} P – possible vectors of values of parameters P = cross product B × I |P| = 150*3 = 450 The optimization problem: pP , to find the plan s0in that plan space S that satisfies the condition: is static parameters, c( ) is the cost function

Existing Solution - Parametric Query Optimization (Cont.) • Efficient exploration algorithm – Randomized Algorithm • Justification for using parametric query optimization Relative cost Buffer size • Problems of the implementation in distributed database • Site selection + algebraic transformation + physical method selection • Much more combinations of run-time parameters

Existing Solution – Two-Phase Algorithm • W. Hong, et al. Optimization of Parallel Query Execution Plans in XPRS. PDIS,1991. • Developed for a parallel database based on a share-memory multiprocessor Phase 1: find the optimal sequential plan assuming the entire buffer pool is available Phase 2: find the optimal parallelization of the optimal sequential plan, considering run-time available buffer size & # of free processors

Existing Solution – Two-Phase Algorithm (Cont.) • Benefits: • Phase 1 has the same plan space as a System-R-style algorithm, but only one plan is explored in Phase 2 • Capability of dealing with compile-time unknown parameters • Problems for applying in database federations: • Communication cost was not considered • Exhaustive search in phase 2 is still expensive for large scale of data sources

Proposed Work

Important Observation • many national-scale or global-scale data federations are built on the networks which consist of both broad, LAN paths and narrow, long-haul paths. • many highly-integrated systems have to access data through a great deal of databases that belong to multiple different organizations.

Cluster-and-Conquer • consider all data resources in the database federation as a set of several clusters of sites • design two layers of mediators to schedule the query plan cooperatively: • Global Mediator + Cluster Mediator Cluster 4 Cluster 1 Cluster 5 Cluster 6 Cluster 2 Cluster12 Cluster 7 Cluster10 Cluster9 Cluster 11 Cluster 8 Cluster13 Global Mediator

Architecture • Consider run-time conditions & static physical designs • Find a intra-cluster optimal plan • Every cluster mediator functions independently and potentially in parallel • schedules the optimal plan found by the optimizer in a distributed and parallelized way • assigns each sub-plan to the corresponding cluster • System-R style algorithm • performs at compiling time • considers all the tables as being stored in the clustered fashion • decide inter-cluster operations

Cost Model and Optimization Goal • Cost Model • Optimization Goal • to find the distributed join schedule plan with minimum cost.

Problem Definition • Run-time parameters: • Available buffer size • CPU utilization • Parallelism: • Partitioned parallelism • Pipelined parallelism • Reasons: input data partition is not often feasible ; in bushy plans it is common to have two operations that do not each other’s output • Independent parallelism

Optimization Algorithm E.g. SELECT * FROM S1.t1, S2.t2,S5.t7, S1.t2, S6.t5, S2.t3 WHERE S1.t1.CustomerID = S2. t2. CustomerID AND S2.t2. SupplierID = S5.t7.SupplierID AND S5.t7.ItemID = S6.t5. ItemID AND S6.t5.Country = S1.t2.Country AND S1.t2.Year = S2.t3.Year Global Mediator Clustered view Physical design info: B(R), T(R), V(R.attr), …… Rule 1: only determine inter-cluster operations Rule 2: plans that join two relations in distinct clusters are eliminated

Optimization Algorithm (Cont.) Run-time conditions: Available_buffer(S1), CPU_utility(S1), …… Sub-plan Cluster 1 Mediator Physical design info: B(R), T(R), V(R.attr), …… • Search space: • Algebraic transform • Physical method selection – Available_buffer • Site selection – CPU_utility • (fine grain operator scheduling)

Theoretical Analysis • In global mediator • In cluster mediator • Compare to related works

Experiment Design

That is what I want to do for my Master Thesis … Thanks 

Efficient Query Optimization for Distributed Join in Database Federation