1 / 23

The State of the Art in Distributed Query Processing

Explore the state-of-the-art in distributed query processing, covering client-server, peer-to-peer, and middleware architectures. Learn about query shipping, data shipping, hybrid approaches, optimizations, and distributed query plans.

isabelf
Download Presentation

The State of the Art in Distributed Query Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The State of the Art in Distributed Query Processing by Donald Kossmann Presented by Chris Gianfrancesco

  2. Introduction • Distributed database technology is becoming an increasingly attractive enhancement to many database systems • Cost and scalability • Software integration • Legacy systems • New applications • Market forces

  3. Introduction • Topics covered in this paper • Basics of distributed query processing • Client-server distributed DB models • Heterogeneous distributed DB models • Data placement techniques • Other distributed architectures

  4. Client-Server Database Systems • Relationships between distributed nodes take a client-server form • Client: makes requests of the servers, usually the source of queries • Server: responds to client requests, usually the source of data • System architectures: peer-to-peer, strict client-server, middleware/multitier

  5. Peer Node Server or Client Peer Node Server or Client Peer Node Server or Client Architectures: Peer-to-Peer • All nodes are equivalent • Each can be either a client or server on demand (can store data and/or make requests) • Ex: SHORE system

  6. Client Query source Server Data source Architectures: Strict Client-Server • Client or server status is pre-defined and can never change • Clients supply queries, servers supply data • Most common architecture in commercial DBMS’s

  7. Node 1 Client to Node 2 Node 2 Server to Node 1, Client to Node 3 Node 3 Server to Node 2 Architectures: Middleware/Multitier • Multiple levels of client-server interaction • Nodes act as clients to those below them and servers to those above • SAP R/3, web servers with DB backends

  8. Architectures: Evaluation • Peer-to-Peer • Simplest setup • Equal load sharing • Strict Client-Server • Specialization • Administration for servers only • Middleware/Multitier • Functionality integration • Scalability

  9. Client-Server Query Processing • Queries initiated at clients, data stored at servers • Where do we execute the query? • Query shipping: move the query down to the data • Data shipping: move the data up to the query • Hybrid shipping: combination of both

  10. Query Shipping • SQL query code is sent down to the server • Server parses and evaluates query, returns result • Used in DB2, Oracle, MS SQL Server

  11. Data Shipping • Client parses query and requests data from server • Server provides data, then client executes query • Data can be cached at client (main memory or disk)

  12. Hybrid Shipping • Mix-and-match data shipping and query shipping • Query parts can be executed at any level according to query plan • Data is cached when beneficial

  13. Evaluation • Query Shipping • Reliant on server performance • Scales poorly with increasing client load • Data Shipping • Good scalability • High communication costs • Hybrid • Potential to outperform other options • More complex optimizations

  14. Hybrid Shipping Observations • Some observations of optimal performance using hybrid shipping • Preference to not use a client cache • If network transfer cost < client access cost • Shipping down cached data • If in main memory & execution at server • Multiple small updates • Maintain at client and post to server only when necessary

  15. Query Optimization • Query plans must also specify where the query pieces are executed • Data shipping: all execution done at client • Query shipping: all execution done at server • Hybrid: choice can be made for each operator • Results display to user is always at client

  16. Distributed Query Plans • Each operator is annotated with a logical site of execution – plans are shareable • client means an operator is executed from the client where the query is issued • server means: • for scan operators, execute at a location that has the necessary data • for updates, execute at all locations with the relevant data

  17. Query Optimization: Where? • Should optimization occur at the client or the server? • At client: less load on servers, better scalability • At server: more information about system statistics, especially server loads • Potential solution: primary parsing and query rewriting at client, further optimization at server

  18. Query Optimization: Statistics • Even when optimization is done at a server, that server does not usually have full knowledge of the system • System can either: • Guess the status of other servers – less accuracy, less cost • Ask other servers their status – fully accurate, additional communication costs

  19. Query Optimization: When? • Tradeoff of accuracy vs. cost • Traditional-style: optimize once, store plan • No support for changing DB conditions • No incurred cost for query execution • Plan sets: optimize for possible scenarios • Generate a few query plans for diff. conditions • Choose plans based on runtime statistics • On-the-fly: observe intermediate results • Re-optimize query if different from expectations

  20. Query Optimization: Two-Step • Compile-time: generate join order, etc. • Runtime: perform site selection • Reasonable cost at each end • Responds well to changing server loads • Fully utilizes client data caching

  21. Two-Step Optimization: Downside • Optimal plan is generated traditional-style • Site selection is performed • True optimal plan was missed • Optimal was missed because first optimization step was done with no knowledge of the system

  22. Query Execution Techniques • Standard fare: row blocking, multithread when possible • Issues: transactions with both updates and retrieval queries using hybrid shipping • We want to wait to propagate updates for efficiency’s sake • Other option: perform query before update and temporarily pad results

  23. Questions? • Comments?

More Related