1 / 37

Scoped and Approximate Queries in a Relational Grid Information Service

This paper discusses powerful but expensive queries in a relational grid information service and introduces solutions such as scoped queries, approximate queries, scoped approximate queries, and nondeterministic queries. The performance evaluation of these solutions is also presented.

jyeates
Download Presentation

Scoped and Approximate Queries in a Relational Grid Information Service

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scoped and Approximate Queries in a Relational Grid Information Service Dong Lu , Peter A. Dinda , Jason A. Skicewicz Prescience Lab, Dept. of Computer Science Northwestern University, Evanston, IL 60201

  2. Outline • Introduction and motivation • Powerful queries, but expensive to execute • Trade off between result size and query time • Our solutions: Scoped query, Approximate query, Scoped Approximate query • Nondeterministic query (SC Talk on Tuesday) • Performance Evaluation

  3. What is RGIS? • GIS: A Grid Information Service stores information about the resources and services in a distributed computing environment and answer queries about it. • RGIS: Grid Information Service based on relational data model.

  4. Why RGIS? • RGIS can answer complex compositional queries • Relational algebra (SQL) • Joins • Difficult in a hierarchical model (directory service) • Other reasons • Indexes separate from data model • Schema evoluation • Transactional insert/update/delete • Consistency

  5. RGIS Model of a Grid module Software endpoint • Annotated network topology graph • Annotation examples • Hosts: memory, disk, OS, NICs, etc. • Router/Switch: backplane bandwidth, ports • Link: latency and bandwidth • Highly dynamic data in streams, not DB • Virtualization, Futures, Leases • Virtual machines router iplink host Network Data link maclink macswitch Physical connectorswitch connectorlink

  6. The RGIS Design (Per Site)

  7. Challenge/Trade off • Complex queries to a relational database can take a long time, • Hours, days or even weeks when we want seconds. • Typically, returned result set is unnecessarily big. • Get back all results • We need mechanisms to trade off the query time with the size of result set.

  8. Challenge/Trade off All results Approximate results Nondeterministic results Scoped results

  9. Routers IP links Hosts Cluster Example: Cluster Finder Find N hosts connected to the same router, with total memory N*512 MB, all running Linux, and the bisection bandwidth of The cluster is no less than 100Mbits/sec.

  10. Original SQL for 2 Host Cluster Finder SELECT [scoped-approx] h1.distip, h2.distip FROM hosts h1, hosts h2, iplinks l1, iplinks l2, routers r WHERE h1.mem_mb+h2.mem_mb>=1024 and h1.os='linux' and h2.os='linux' and ((l1.src=r.distip and l2.src=r.distip and l1.dest=h1.distip and l2.dest=h2.distip) or (l1.dest=r.distip and l2.dest=r.distip and l1.src=h1.distip and l2.src=h2.distip)) and h1.distip<>h2.distip and L1.BW_MBS >= 100 AND L2.BW_MBS >= 100 [SCOPED BY r.distip=X] WITHIN 100 seconds; Original

  11. Original SQL for Cluster Finder • It is 2*N+1 way join to look for a N node cluster. Not scalable. Routers IP links Hosts Cluster 1 Cluster 2

  12. Scoped Cluster Finder Routers IP links Query the hosts around a random router. Hosts

  13. SELECT H1.DISTIP, H2.DISTIP FROM HOSTS H1, HOSTS H2, IPLINKS L1, IPLINKS L2, ROUTERS R WHERE H1.MEM_MB+H2.MEM_MB>=1024 AND H1.OS='LINUX' AND H2.OS='LINUX' AND ((L1.SRC=R.DISTIP AND L2.SRC=R.DISTIP AND L1.DEST=H1.DISTIP AND L2.DEST=H2.DISTIP) OR (L1.DEST=R.DISTIP AND L2.DEST=R.DISTIP AND L1.SRC=H1.DISTIP AND L2.SRC=H2.DISTIP)) AND H1.DISTIP<>H2.DISTIP AND L1.BW_MBS >= 100 AND L2.BW_MBS >= 100 AND R.DISTIP = X; Scoped Scoped Cluster Finder

  14. Approximate Cluster Finder • When searching for N hosts with total memory N*512, we can approximate the query with “search for N hosts with each having memory over 512”. • Thus reduced or avoided the number of joins. • However, this won’t find, say, N/2 hosts with 256 MB and N/2 hosts with 768 MB

  15. SELECT R.DISTIP, H1.DISTIP FROM HOSTS H1, IPLINKS L1, ROUTERS R WHERE H1.MEM_MB>=512 AND H1.OS='LINUX' AND L1.BW_MBS >= 100 AND ((L1.SRC=R.DISTIP AND L1.DEST=H1.DISTIP) OR (L1.DEST = R.DISTIP AND L1.SRC=H1.DISTIP)) AND R.DISTIP IN (SELECT R.DISTIP FROM HOSTS H1, IPLINKS L1, ROUTERS R WHERE H1.MEM_MB>=512 AND H1.OS='LINUX' AND L1.BW_MBS>=100 AND ((L1.SRC=R.DISTIP AND L1.DEST=H1.DISTIP) OR (L1.DEST = R.DISTIP AND L1.SRC=H1.DISTIP)) GROUP BY R.DISTIP HAVING COUNT(*) >= 2) ORDER BY R.DISTIP; Approximate Cluster Finder

  16. Scoped Approximate Cluster Finder • Combine approximate query with scoped query. • Scoped to one randomly chosen router at a time, if no results found, choose another random router and repeat the query. • Approximate N host join for 512*N memory with searches for N hosts each with >=512. • Always a THREE way join. • regardless of the size of the cluster being searched for. Thus very scalable. • may need to search multiple routers.

  17. SELECT H1.DISTIP FROM HOSTS H1, IPLINKS L1, ROUTERS R WHERE H1.MEM_MB>=512 AND H1.OS='LINUX' AND L1.BW_MBS >= 100 AND ((L1.SRC=R.DISTIP AND L1.DEST=H1.DISTIP) OR (L1.DEST = R.DISTIP AND L1.SRC=H1.DISTIP)) AND R.DISTIP=X AND ROWNUM <=2 Scoped Approximate Cluster Finder The scoped approximate cluster finder has a fixed number of joins.

  18. Time bounded queries • The query rewriter will start the query as a child process. • Parent kills the child process if no results returned within deadline.

  19. Limitations of Scoped and Approximate queries • The returned results are subset of original query, and it is possible to report no results while the original query could return results after running a long time. • Not all queries can be written as Scoped or Approximate queries. • It is hard to automate the Scoped and Approximate query rewriting.

  20. Performance Evaluation • Need to populate the database with large amount of data. • Computational grids are still in early stages. • No large data sets available. • Use Smith MDS data for memory • We generate synthetic grids that are representative of the Internet. • Can generate very large grids

  21. GridG Generated Synthetic Grids • Three-level network: WAN, MAN, LAN. Nodes on WAN, MAN are routers, while nodes on LAN are hosts. • Links: IP links annotated with bandwidth and latency. • Hosts: annotated with memory size, architecture, number of processors, CPU clock rate, disk size, etc. • User can control all the distributions and the size of network.

  22. GridG: Synthesing Realistic Computational Grids SC talk on Tuesday! http://www.cs.northwestern.edu/~urgis/GridG

  23. Experimental Setup • Dell PowerEdge 4400: dual Xeon 1 GHz processors, 2 GB memory, 240 GB RAID 5 storage system. • Oracle 9i Enterprise edition, red hat Linux 7.1. • Each test is repeated either 25 or 100 times, and we provide the average value.

  24. Performance of various Query Technique with Cluster Finder Cluster size | Standard | Scoped | Approx | Scoped Approx 2 | 21.44 | 2.27 | 7.62 | 1.16 4 | >7200 | 2047.9 | 7.48 | 1.32 8 | >9000 | >3600 | 7.46 | 1.43 16 | N/A | >3600 | 7.51 | 1.45 32 | N/A | >3600 | 7.65 | 5.96 64 | N/A | >3600 | >120 | 9.58 (Time to run query in Seconds)

  25. Performance of Scoped Approximate Queries • Cluster Finder : Find N hosts, each running Linux, with total memory at least N*512 MB, all connected to the same router, the bisection width is at least 100Mbits. • Our running example • Non network query : Find N hosts with total memory at least N*512 MB. • No joins needed at all

  26. Performance of Scoped Approximate Queries (2) • Scalability with database size. • Scalability with the complexity of queries. • Scalability with concurrent users and update load.

  27. Performance of Scoped Approximate Query (9.8K hosts, Cluster Finder)

  28. Performance of Scoped Approximate Query (101K hosts , Cluster Finder)

  29. Performance of Scoped Approximate Query (980K hosts , Cluster Finder)

  30. Performance of Scoped Approximate Query (9.8K hosts, Non-network query)

  31. Performance of Scoped Approximate Query (101K hosts , Non-network query)

  32. Performance of Scoped Approximate Query (980K hosts , Non-network query)

  33. Scalability with multiple concurrent users and background load • Other research has shown that GIS servers will undertake frequent updating while serving the requests. • GIS servers serve multiple concurrent users. • Evaluate scoped approximate queries with concurrent users and update load. • Concurrent users: execute queries repeatedly • The update load: execute transactional updates on randomly selected hosts as fast as possible. • About 200 updates/second

  34. Performance of Scoped Approximate Query (9.8K hosts , Cluster Finder, with Concurrent Users, looking for 64 nodes)

  35. Performance of Scoped Approximate Query (9.8K hosts , Non network query, with Concurrent Users, looking for 64 nodes)

  36. Conclusions • Described and evaluated two query techniques to trade off query time with the size of result set: Scoped and Approximate query. • Combination of Scoped and Approximate query can dramatically reduce response time and server load.

  37. For more information • GridG and Related paper: http://www.cs.northwestern.edu/~urgis/GridG “Synthesizing Realistic Computational Grids”, In proceedings of SC03. • RGIS and Related paper: http://www.cs.northwestern.edu/~urgis/ “Nondeterministic Queries in a Relational Grid Information Service”, In proceedings of SC03.

More Related