1k likes | 1.14k Views
The Sum is Greater Than the Parts Global Query Optimization in Federated Systems. Tony Young M.Math Candidate CS 848 - Fall 2004. Outline. Introduction Motivation Issues System Overview Optimization Semijoin Algorithm Reduction Algorithm The Garlic Approach Conclusion My Project.
E N D
The Sum is GreaterThan the PartsGlobal Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004
Outline • Introduction • Motivation • Issues • System Overview • Optimization • Semijoin Algorithm • Reduction Algorithm • The Garlic Approach • Conclusion • My Project
Outline • Introduction • Motivation • Issues • System Overview • Optimization • Semijoin Algorithm • Reduction Algorithm • The Garlic Approach • Conclusion • My Project
Introduction • References: • L. M. Haas, E. T. Lin, and M. A. Roth. Data integration through database federation. IBM Systems Journal, 41(4):578–596, 2002. • David K. Hsiao. Federated databases and systems: part i — a tutorial on their data sharing. The VLDB Journal, 1(1):127–180, 1992. • David K. Hsiao. Federated databases and systems: part ii — a tutorial on their resource consolidation. The VLDB Journal, 1(2):285–310, 1992. • Hongjun Lu, Beng-Chin Ooi, and Cheng-Hian Goh. On global multidatabase query optimization. SIGMOD Rec., 21(4):6–11, 1992.
Introduction • References: • Neil Coburn and Per-Ake Larson. Multidatabase services: issues and architectural design. In Proceedings of the 1992 conference of the Centre for Advanced Studies on Collaborative research, pages 57–66, Toronto, Ontario, Canada, 1992. IBM Press. • Qiang Zhu. Query optimization in multidatabase systems. In Proceedings of the 1992 conference of the Centre for Advanced Studies on Collaborative research, pages 111–127. IBM Press, 1992.
Introduction • References: • H. Lu, B. C. Ooi, and C. H. Goh. Multidatabase query optimization: Issues and solutions. In Proceedings of Third International Workshop on Research Issues in Data Engineering: Interoperability in Multidatabase Systems, pages 137–143, 1993.
Introduction • What is a federated database system (FDBS)? • Also referred to as a multidatabase system (MDBS) • FDBS’s combine multiple heterogeneous data sources into one global view • Users think that the data all resides in one place
Introduction • The notion of an FDBS has been around for many years • First semi-commercial product was Mermaid (later became InterViso) in 1984 • Flurry of papers began to surface in late 1980’s - early 1990’s • What motivated the development of these systems?
Motivation • Replacement of Data Processing With Databases • Organizations were moving away from traditional data processing techniques (such as storing information in flat files • The rise in popularity of DBMS's can be attributed to powerful data mining applications as well as ease of data access
Motivation • Proliferation of Heterogeneous Databases Within an Organization • It is not uncommon for different departments within an organization to make use of their own database servers • Departments often do not coordinate to ensure that a corporation is using a homogeneous DBMS to store data • There is no guarantee that the schema individual departments use to store data will be homogeneous
Motivation • Data Sharing Within Organizations • Many organizations seek to share data between different departments • Finance department may require information regarding projects in progress in the marketing department • Such information sharing is difficult without the guarantee of schema, data model or access language homogeneity • Finance may use SQL on a relational database and marketing may use Xquery on an XML database
Motivation • Different Rates of Technology Adoption • Different departments will adopt technology at different rates • IT will adopt quickly to store inventory information, call tickets, etc. and are tech savvy • HR may adopt slowly as they use many paper forms and are not so tech savvy • Older systems are in place, and newer systems to come online use different products
Motivation • Geographic Separation of Teams • Different teams may be broken up across geographic locations • Different sites will hold teams working on different projects • Each site may have their own IT staff and make their own purchasing and installation decisions • Often there is no coordination between sites
Motivation • Mergers and Acquisitions • When companies join forces, their IT systems must be joined as well • Old applications will depend on the old software, and users might be reluctant to learn an entirely new system • If we can merge the two system so that each user can make use of their old applications and old access language, transitions might be easier
Issues • Several issues affect query optimization in FDBS’s
Issues • Site Autonomy • Data - Local database administrators have direct and complete control over the schemas • This information cannot be modified in any way • Design - Local database administrators decide when and how to replicate and fragment data • Communication - Each site decides locally whether or not to communicate with the FDBS • Execution - Each site can determine how, when and whether to execute global queries, as well as how queries are prioritized
Issues • Local Parameters • Local cost parameters for individual sites are not always available to the FDBS • The FDBS often doesn't know what indices are available for relations at local sites • Can’t predict what access methods will be used by local sites • No idea what page size and disk latency are • … etc.
Issues • Translation • Queries must be translated to and from the local schema, query language, and data model on-the-fly • This requires additional query processing time
Issues • Heterogeneous Capabilities • Not all local sites have the same capabilities. • Some sites may not implement any ranking operations • This means that intermediate results might have to be shifted to sites that can provide these capabilities, further increasing processing time
Issues • Additional Costs • Cost based optimization needs to take into consideration some additional factors such as: • Transmission speeds • Network loads • Local site configurations • As with local parameters, this information is not always available to the FDBS
Issues • Overriding theme - the FDBS is just another application as far as the local sites are concerned • We can make no assumptions about the local sites • We can assume no global control over the local sites • We have no hooks into the local sites to directly access information
Outline • Introduction • Motivation • Issues • System Overview • Optimization • Semijoin Algorithm • Reduction Algorithm • The Garlic Approach • Conclusion • My Project
System Overview • References: • Qiang Zhu. Query optimization in multidatabase systems. In Proceedings of the 1992 conference of the Centre for Advanced Studies on Collaborative research, pages 111–127. IBM Press, 1992. • M. T. Ozsu and P. Valduriez. Principles of Distributed Database Systems. Prentice Hall, Upper Saddle River, NJ, 2nd edition, 1999. • John Grant, Witold Litwin, Nick Roussopoulos, and Timos Sellis. Query languages for relational multidatabases. The VLDB Journal, 2(2):153–172, 1993.
System Overview • Two main approaches to a federated system: • Multidatabase Language Approach • Users must learn a special access language • Users must use one standardized data model • Users must know the sites they are contacting and how data is organized at those sites • Users must enter their authentication information each time they use a site • Users must…
System Overview • Main idea: • User specifies the sites, relations and columns used in the query by their name at the local site • Queries are still submitted to a middleware, but user must know where things are stored
System Overview • Two main approaches to federated system: • Global Schema Approach • Global DBA implements wrappers/agents to convert access language and data model before sending to source • Global DBA generates an integrated global schema • Global DBA stores authentication for individual users at each local site and FDBS handles login • Global DBA…
System Overview • Main idea: • User specifies the relations and columns used in the query by their global name • Queries are submitted to a middleware that does conversion and subquery generation
System Overview • The difference is how users perceive the system • MDBL: As far as the user is concerned, the data is stored at separate sites and their query must explicitly use those sites • GS: As far as the user is concerned, the data is stored in the middleware and they can access it directly
Outline • Introduction • Motivation • Issues • System Overview • Optimization • Semijoin Algorithm • Reduction Algorithm • The Garlic Approach • Conclusion • My Project
Optimization • There are many many optimization algorithms for federated systems • 2-Phase • Statistical Sampling • Adaptive • Probing Query-based • …etc. • We will look at three • Semijoin and Reduction: From Mermaid - state of the art for many years and pioneer in the field • Garlic: From IBM - incorporated into a shipping product (DB2 Information Integrator)
Semijoin Optimization Algorithm • References: • David Brill, Marjorie Templeton, and Clement T. Yu. Distributed query processing strategies in mermaid, a frontend to data management systems. In Proceedings of the First International Conference on Data Engineering, pages 211–218. IEEE Computer Society, 1984.
Semijion Optimization Algorithm • The semijoin algorithm was proposed in Mermaid (1984) • Assumes that the cost of data transfer through a network outweighs local site CPU overhead • Seeks to reduce the size of relations required for a query at local sites before transferring results back to the Controller • Four steps
Step 1: Site Selection • A set of sites that will be used to perform a query must first be chosen • Requires finding a set of minimal size that includes one copy of each local, partitioned and replicated relation • i.e. each site holding a data fragment must be in the set, but only one replica of a relation must be in the set • Some sites may hold more than one relation required by the query • Allows us to further reduce the size of the site set
Step 1: Site Selection • The system statistics that can be used to optimize this selection, such as link speeds and system loads, remains an open problem
Step 2: Local Reduction • In parallel at each local site in the chosen site set, reduce each relation by performing selections and projections • Parameters used to perform these operations are taken from select, where and join conditions in the original query • It might be possible to optimize the order in which site reduction queries are performed by exploiting network traffic and speed, CPU load at local sites, etc. • i.e. submit queries to slow sites first and hope they don't increase the overall execution time too much
Step 3: Global Reduction • Find and execute an efficient sequence of semijoins that will reduce the set of records to be transmitted • Mermaid uses a hill-climbing algorithm to determine this set • Once the semijoins are performed, the smallest amount of data required to answer the query is ready for transport • Some other algorithms to determine the optimal semijoin sequence should be investigated as this one is slow!
Step 4: Assembly • Transfer the data to one central query site and generate the result set. Return the result set to the user • May perform joins at local sites or wait until we get to the FDBS