1 / 22

Efficient Query Optimization for Distributed Join in Database Federation

Efficient Query Optimization for Distributed Join in Database Federation. A Master’s Thesis Proposal by Di Wang Advisor: Prof. Murali Mani Dec 4, 2008. Outline . Introduction – Query Optimization in Database Federations Architecture and Problem Definition Proposed Work Schedule.

ursula
Download Presentation

Efficient Query Optimization for Distributed Join in Database Federation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Query Optimization for Distributed Joinin Database Federation A Master’s Thesis Proposal by Di Wang Advisor: Prof. Murali Mani Dec 4, 2008

  2. Outline • Introduction – Query Optimization in Database Federations • Architecture and Problem Definition • Proposed Work • Schedule

  3. Introduction: Need for data integration • Various systems -> full picture • Mergers -> access both resources with a common interface • Business partners -> combine data Multiple Access Methods Multiple Data Schemas

  4. Introduction to Database Federation • Database Federation is one approach to data integration • Key performance advantage: efficiently combine data from multiple sources in a single statement • The data sources are federated into a unified middleware, called mediator.

  5. Key Components of Database Federation Query • Research Issues: • containment algorithms for conjunctive queries, • schema mapping, • capability-based optimization Query Rewriter . . . . . . Cost-based optimization -- Closely related to the optimization techniques developed for the distributed database systems Cost-Based Optimizer

  6. The problem

  7. Plan 1 SortMerge on M1 HashJoin on M3 SortMerge on M2 Plan 2 Plan 3 Things that make us unhappy M3.R3 M3.R3 M3.R3 NestLoop on M2 NestLoop on M1 SortMerge on M1 M1.R1 M1.R1 M1.R1 M2.R2 M2.R2 M2.R2 Optimizer Estimated Condition: Available buffer sizes of sites; CPU utility of sites; Network traffics … Statistics: physical designs … M1 M3 M2 Assume: B(R1) < B(R2) < B(R3), B(R1 join R2) < B(R3) Need to take run-time conditions into account at optimization time.

  8. Existing Solution - Parametric Query Optimization • Y. E. Ioannidis, et al. Parametric Query Optimization. VLDB 1992. • Key idea: To identify several execution plans, each one of which is optimal for a subset of ALL possible values of the run-time parameters • E.g. Two parameters: Buffer size B = [2, 151] Kind of indexes I = {no_index, clustered_Btree, non_clustered_BTree} P – possible vectors of values of parameters P = cross product B × I |P| = 150*3 = 450 The optimization problem: pP , to find the plan s0in that plan space S that satisfies the condition: is static parameters, c( ) is the cost function

  9. Existing Solution - Parametric Query Optimization (Cont.) • Efficient exploration algorithm – Randomized Algorithm • Justification for using parametric query optimization Relative cost Buffer size • Problems of the implementation in distributed database • Site selection + algebraic transformation + physical method selection • Much more combinations of run-time parameters

  10. Existing Solution – Two-Phase Algorithm • W. Hong, et al. Optimization of Parallel Query Execution Plans in XPRS. PDIS,1991. • Developed for a parallel database based on a share-memory multiprocessor Phase 1: find the optimal sequential plan assuming the entire buffer pool is available Phase 2: find the optimal parallelization of the optimal sequential plan, considering run-time available buffer size & # of free processors

  11. Existing Solution – Two-Phase Algorithm (Cont.) • Benefits: • Phase 1 has the same plan space as a System-R-style algorithm, but only one plan is explored in Phase 2 • Capability of dealing with compile-time unknown parameters • Problems for applying in database federations: • Communication cost was not considered • Exhaustive search in phase 2 is still expensive for large scale of data sources

  12. Proposed Work

  13. Important Observation • many national-scale or global-scale data federations are built on the networks which consist of both broad, LAN paths and narrow, long-haul paths. • many highly-integrated systems have to access data through a great deal of databases that belong to multiple different organizations.

  14. Cluster-and-Conquer • consider all data resources in the database federation as a set of several clusters of sites • design two layers of mediators to schedule the query plan cooperatively: • Global Mediator + Cluster Mediator Cluster 4 Cluster 1 Cluster 5 Cluster 6 Cluster 2 Cluster12 Cluster 7 Cluster10 Cluster9 Cluster 11 Cluster 8 Cluster13 Global Mediator

  15. Architecture • Consider run-time conditions & static physical designs • Find a intra-cluster optimal plan • Every cluster mediator functions independently and potentially in parallel • schedules the optimal plan found by the optimizer in a distributed and parallelized way • assigns each sub-plan to the corresponding cluster • System-R style algorithm • performs at compiling time • considers all the tables as being stored in the clustered fashion • decide inter-cluster operations

  16. Cost Model and Optimization Goal • Cost Model • Optimization Goal • to find the distributed join schedule plan with minimum cost.

  17. Problem Definition • Run-time parameters: • Available buffer size • CPU utilization • Parallelism: • Partitioned parallelism • Pipelined parallelism • Reasons: input data partition is not often feasible ; in bushy plans it is common to have two operations that do not each other’s output • Independent parallelism

  18. Optimization Algorithm E.g. SELECT * FROM S1.t1, S2.t2,S5.t7, S1.t2, S6.t5, S2.t3 WHERE S1.t1.CustomerID = S2. t2. CustomerID AND S2.t2. SupplierID = S5.t7.SupplierID AND S5.t7.ItemID = S6.t5. ItemID AND S6.t5.Country = S1.t2.Country AND S1.t2.Year = S2.t3.Year Global Mediator Clustered view Physical design info: B(R), T(R), V(R.attr), …… Rule 1: only determine inter-cluster operations Rule 2: plans that join two relations in distinct clusters are eliminated

  19. Optimization Algorithm (Cont.) Run-time conditions: Available_buffer(S1), CPU_utility(S1), …… Sub-plan Cluster 1 Mediator Physical design info: B(R), T(R), V(R.attr), …… • Search space: • Algebraic transform • Physical method selection – Available_buffer • Site selection – CPU_utility • (fine grain operator scheduling)

  20. Theoretical Analysis • In global mediator • In cluster mediator • Compare to related works

  21. Experiment Design

  22. That is what I want to do for my Master Thesis … Thanks 

More Related