Scoped and Approximate Queries in a Relational Grid Information Service

Scoped and Approximate Queries in a Relational Grid Information Service Dong Lu , Peter A. Dinda , Jason A. Skicewicz Prescience Lab, Dept. of Computer Science Northwestern University, Evanston, IL 60201

Outline • Introduction and motivation • Powerful queries, but expensive to execute • Trade off between result size and query time • Our solutions: Scoped query, Approximate query, Scoped Approximate query • Nondeterministic query (SC Talk on Tuesday) • Performance Evaluation

What is RGIS? • GIS: A Grid Information Service stores information about the resources and services in a distributed computing environment and answer queries about it. • RGIS: Grid Information Service based on relational data model.

Why RGIS? • RGIS can answer complex compositional queries • Relational algebra (SQL) • Joins • Difficult in a hierarchical model (directory service) • Other reasons • Indexes separate from data model • Schema evoluation • Transactional insert/update/delete • Consistency

RGIS Model of a Grid module Software endpoint • Annotated network topology graph • Annotation examples • Hosts: memory, disk, OS, NICs, etc. • Router/Switch: backplane bandwidth, ports • Link: latency and bandwidth • Highly dynamic data in streams, not DB • Virtualization, Futures, Leases • Virtual machines router iplink host Network Data link maclink macswitch Physical connectorswitch connectorlink

The RGIS Design (Per Site)

Challenge/Trade off • Complex queries to a relational database can take a long time, • Hours, days or even weeks when we want seconds. • Typically, returned result set is unnecessarily big. • Get back all results • We need mechanisms to trade off the query time with the size of result set.

Challenge/Trade off All results Approximate results Nondeterministic results Scoped results

Routers IP links Hosts Cluster Example: Cluster Finder Find N hosts connected to the same router, with total memory N*512 MB, all running Linux, and the bisection bandwidth of The cluster is no less than 100Mbits/sec.

Original SQL for 2 Host Cluster Finder SELECT [scoped-approx] h1.distip, h2.distip FROM hosts h1, hosts h2, iplinks l1, iplinks l2, routers r WHERE h1.mem_mb+h2.mem_mb>=1024 and h1.os='linux' and h2.os='linux' and ((l1.src=r.distip and l2.src=r.distip and l1.dest=h1.distip and l2.dest=h2.distip) or (l1.dest=r.distip and l2.dest=r.distip and l1.src=h1.distip and l2.src=h2.distip)) and h1.distip<>h2.distip and L1.BW_MBS >= 100 AND L2.BW_MBS >= 100 [SCOPED BY r.distip=X] WITHIN 100 seconds; Original

Original SQL for Cluster Finder • It is 2*N+1 way join to look for a N node cluster. Not scalable. Routers IP links Hosts Cluster 1 Cluster 2

Scoped Cluster Finder Routers IP links Query the hosts around a random router. Hosts

SELECT H1.DISTIP, H2.DISTIP FROM HOSTS H1, HOSTS H2, IPLINKS L1, IPLINKS L2, ROUTERS R WHERE H1.MEM_MB+H2.MEM_MB>=1024 AND H1.OS='LINUX' AND H2.OS='LINUX' AND ((L1.SRC=R.DISTIP AND L2.SRC=R.DISTIP AND L1.DEST=H1.DISTIP AND L2.DEST=H2.DISTIP) OR (L1.DEST=R.DISTIP AND L2.DEST=R.DISTIP AND L1.SRC=H1.DISTIP AND L2.SRC=H2.DISTIP)) AND H1.DISTIP<>H2.DISTIP AND L1.BW_MBS >= 100 AND L2.BW_MBS >= 100 AND R.DISTIP = X; Scoped Scoped Cluster Finder

Approximate Cluster Finder • When searching for N hosts with total memory N*512, we can approximate the query with “search for N hosts with each having memory over 512”. • Thus reduced or avoided the number of joins. • However, this won’t find, say, N/2 hosts with 256 MB and N/2 hosts with 768 MB

SELECT R.DISTIP, H1.DISTIP FROM HOSTS H1, IPLINKS L1, ROUTERS R WHERE H1.MEM_MB>=512 AND H1.OS='LINUX' AND L1.BW_MBS >= 100 AND ((L1.SRC=R.DISTIP AND L1.DEST=H1.DISTIP) OR (L1.DEST = R.DISTIP AND L1.SRC=H1.DISTIP)) AND R.DISTIP IN (SELECT R.DISTIP FROM HOSTS H1, IPLINKS L1, ROUTERS R WHERE H1.MEM_MB>=512 AND H1.OS='LINUX' AND L1.BW_MBS>=100 AND ((L1.SRC=R.DISTIP AND L1.DEST=H1.DISTIP) OR (L1.DEST = R.DISTIP AND L1.SRC=H1.DISTIP)) GROUP BY R.DISTIP HAVING COUNT(*) >= 2) ORDER BY R.DISTIP; Approximate Cluster Finder

Scoped Approximate Cluster Finder • Combine approximate query with scoped query. • Scoped to one randomly chosen router at a time, if no results found, choose another random router and repeat the query. • Approximate N host join for 512*N memory with searches for N hosts each with >=512. • Always a THREE way join. • regardless of the size of the cluster being searched for. Thus very scalable. • may need to search multiple routers.

SELECT H1.DISTIP FROM HOSTS H1, IPLINKS L1, ROUTERS R WHERE H1.MEM_MB>=512 AND H1.OS='LINUX' AND L1.BW_MBS >= 100 AND ((L1.SRC=R.DISTIP AND L1.DEST=H1.DISTIP) OR (L1.DEST = R.DISTIP AND L1.SRC=H1.DISTIP)) AND R.DISTIP=X AND ROWNUM <=2 Scoped Approximate Cluster Finder The scoped approximate cluster finder has a fixed number of joins.

Time bounded queries • The query rewriter will start the query as a child process. • Parent kills the child process if no results returned within deadline.

Limitations of Scoped and Approximate queries • The returned results are subset of original query, and it is possible to report no results while the original query could return results after running a long time. • Not all queries can be written as Scoped or Approximate queries. • It is hard to automate the Scoped and Approximate query rewriting.

Performance Evaluation • Need to populate the database with large amount of data. • Computational grids are still in early stages. • No large data sets available. • Use Smith MDS data for memory • We generate synthetic grids that are representative of the Internet. • Can generate very large grids

GridG Generated Synthetic Grids • Three-level network: WAN, MAN, LAN. Nodes on WAN, MAN are routers, while nodes on LAN are hosts. • Links: IP links annotated with bandwidth and latency. • Hosts: annotated with memory size, architecture, number of processors, CPU clock rate, disk size, etc. • User can control all the distributions and the size of network.

GridG: Synthesing Realistic Computational Grids SC talk on Tuesday! http://www.cs.northwestern.edu/~urgis/GridG

Experimental Setup • Dell PowerEdge 4400: dual Xeon 1 GHz processors, 2 GB memory, 240 GB RAID 5 storage system. • Oracle 9i Enterprise edition, red hat Linux 7.1. • Each test is repeated either 25 or 100 times, and we provide the average value.

Performance of various Query Technique with Cluster Finder Cluster size | Standard | Scoped | Approx | Scoped Approx 2 | 21.44 | 2.27 | 7.62 | 1.16 4 | >7200 | 2047.9 | 7.48 | 1.32 8 | >9000 | >3600 | 7.46 | 1.43 16 | N/A | >3600 | 7.51 | 1.45 32 | N/A | >3600 | 7.65 | 5.96 64 | N/A | >3600 | >120 | 9.58 (Time to run query in Seconds)

Performance of Scoped Approximate Queries • Cluster Finder : Find N hosts, each running Linux, with total memory at least N*512 MB, all connected to the same router, the bisection width is at least 100Mbits. • Our running example • Non network query : Find N hosts with total memory at least N*512 MB. • No joins needed at all

Performance of Scoped Approximate Queries (2) • Scalability with database size. • Scalability with the complexity of queries. • Scalability with concurrent users and update load.

Performance of Scoped Approximate Query (9.8K hosts, Cluster Finder)

Performance of Scoped Approximate Query (101K hosts , Cluster Finder)

Performance of Scoped Approximate Query (980K hosts , Cluster Finder)

Performance of Scoped Approximate Query (9.8K hosts, Non-network query)

Performance of Scoped Approximate Query (101K hosts , Non-network query)

Performance of Scoped Approximate Query (980K hosts , Non-network query)

Scalability with multiple concurrent users and background load • Other research has shown that GIS servers will undertake frequent updating while serving the requests. • GIS servers serve multiple concurrent users. • Evaluate scoped approximate queries with concurrent users and update load. • Concurrent users: execute queries repeatedly • The update load: execute transactional updates on randomly selected hosts as fast as possible. • About 200 updates/second

Performance of Scoped Approximate Query (9.8K hosts , Cluster Finder, with Concurrent Users, looking for 64 nodes)

Performance of Scoped Approximate Query (9.8K hosts , Non network query, with Concurrent Users, looking for 64 nodes)

Conclusions • Described and evaluated two query techniques to trade off query time with the size of result set: Scoped and Approximate query. • Combination of Scoped and Approximate query can dramatically reduce response time and server load.

For more information • GridG and Related paper: http://www.cs.northwestern.edu/~urgis/GridG “Synthesizing Realistic Computational Grids”, In proceedings of SC03. • RGIS and Related paper: http://www.cs.northwestern.edu/~urgis/ “Nondeterministic Queries in a Relational Grid Information Service”, In proceedings of SC03.

Scoped and Approximate Queries in a Relational Grid Information Service

Scoped and Approximate Queries in a Relational Grid Information Service

Presentation Transcript

Object Relational Model Spatial Queries

Supporting Top- k join Queries in Relational Databases

Approximate Queries on Very Large Data

Nondeterministic Queries in a Relational Grid Information Service

A ‘scoped’ reference model

Supporting Location-Based Approximate-Keyword Queries

Approximate Nearest Neighbor Queries with a Tiny Index

Relational Algebra (end) SQL Queries

Approximate range selection queries in P2P systems

Supporting top-k join queries in relational databases

Supporting top-k join queries in relational databases

Spitfire A Relational DB Service for the Grid

Answering Approximate Queries Efficiently

Probabilistic answers to relational queries (PARQ)

Supporting Top- k join Queries in Relational Databases

A Context Information Service using Ontology-Based Queries

Approximate Selection Queries over Imprecise Data

Supporting Top- k join Queries in Relational Databases

Components of a Scalable Distributed Relational Information Service

Answering Approximate Queries Efficiently

Nondeterministic Queries in a Relational Grid Information Service

INFN “Grid Information Service” evaluation