300 likes | 437 Views
Service-Based Distributed Query Processing on the Grid. M.Nedim Alpdemir Department of Computer Science University of Manchester. Service-based approaches. facilitate. Virtualisation of Resources. Leads to. A convenient cooperation model for Distributed Systems (e.g. Grid). Context. Data
E N D
Service-Based Distributed Query Processing on the Grid M.Nedim Alpdemir Department of Computer Science University of Manchester
Service-based approaches facilitate Virtualisation of Resources Leads to A convenient cooperation model for Distributed Systems (e.g. Grid)
Context Data Complexity Semantic Web Web Services OGSA Classical Web Classical Grid Computational Complexity
Web Services Are Not Enough • Lack facilities for: • Computational resource description. • Computational resource discovery. • Application staging. • Grid Services combine: • Web Services for service description and invocation. • Grid middleware for computational resource description and utilisation.
Open Grid Services Architecture (OGSA) • OGSA services are described using WSDL. • OGSA service instances are: • Created dynamically by factories. • Identified through Grid Service Handles. • Self describing through Service Data Elements. • Stateful, with soft state lifetime management. Current status: • Globus 3 beta release in June 2003: www.globus.org. • Supports service instances and access to other Globus services. • Core database services from OGSA-DAI project tracking Globus releases.
Grid Database Service (GDS) • Databases are made available on the Grid through integration with other Grid services, and provision of standard interfaces • Build upon OGSA to deliver high-level data management functionality for the Grid. • Seek to provide two classes of components: • Data access components • Data integration components
Distributed Query Processing • DQP involves a single query referencing data stored at multiple sites. • The locations of the data may be transparent to the author of the query. select p.proteinId, Blast(p.sequence) from protein p, proteinTerm t where t.termId = ‘GO:0005942’ and p.proteinId = t.proteinId J. Smith, A. Gounaris, P. Watson, N. Paton, A. Fernandes, R. Sakellariou, Distributed Query Processing on the Grid, 3rd Int. Workshop on Grid Computing, Springer-Verlag, 279-290, 2002.
Mutual Benefit • The Grid needs DQP: • Declarative, high-level resource integration with implicit parallelism. • DQP-based solutions should in principle run faster than those manually coded. • DQP needs the Grid: • Systematic access to remote data and computational resources. • Dynamic resource discovery and allocation.
1.1.1GDQ PortType: Operations and Messages A Service-Based DQP Architecture
Service-based DQP framework • Service-based in two orthogonal sense: • Supports querying over data storage and analysis resources made available as services • Construction of distributed query plans and their execution over the grid are factored out as services • Uses the emerging standard for GDSs to provide consistent access to database metadata and to interact with databases on the Grid. • A query may refer to database (GDS) and computational services.
Extends OGSA & OGSA-DAI … • By adding a new port type and two new services ( and their corresponding factories) : • Grid Distributed Query (GDQ) Port Type • importSchema operation GDQSR importSchema(GDQDataSourceList GDSL) • GDSL : A document containing: • the list of Data Sources. The items on this list should contain the handles of the GDS Factories, along with an instance creation document for each factory. • And/or a set of WSDL URLs for the analysis services to be used
Continued … • Grid Distributed Query Service (GDQS) • Wraps an existing query compiler/optimiser system which compile, optimise, partition and schedule distributed query execution plans • Obtains and maintains metadata and computational resource information required for above • Grid Query Evaluator Service (GQES) • Each GQES instance is an execution node and is dynamically created by the GDQS on the node it is scheduled to run • A GQES is in charge of a partition of the query execution plan assigned to it by the GDQS and is responsible for dispatching the partial results to other GQESs.
Setting up a GDQS • Set-up strategy depends on the life-time model of GDQS and GDSs • GDQS instance is created per-client • But it can serve multiple-queries • This model avoids complexity of multi-user interactions while ensures that the set-up cost is not high • Setup phase involves: • Importing schemas of participating data sources • Importing WSDL documents of participating analysis services • Collecting computational resource metadata (implicit)
Issues in Initialisation • Q: When is a GDQS bound to a particular GDS? • A: When the schema of the GDS is imported. • Q: What is the lifespan of a GDS used by a GDQS? • A: The GDS is kept alive until the GDQS expires. • Q: Are GDSs shared by multiple GDQSs? • A: No. • Q: When is a GQES created? • A: When a query is about to be evaluated that needs it. • Q: What is the lifespan of a GQES? • A: It lasts only as long as a single query. • Q: Is a GQES shared among several queries or GDQSs? • A: No.
An example of data source import list <GDQDataSourceList> <importedDataSource> <GDSFactoryHandle> http://130.88.198.203:8080/ogsa/services/ogsadai/GridDataServiceFactory </GDSFactoryHandle> <GDSCreateDocument> <gridDataServiceFactoryCreate > <dataResourceName> myDataResource </dataResourceName> </gridDataServiceFactoryCreate> </GDSCreateDocument> </importedDataSource> <importedService> <wsdlURL> http://www.ebi.ac.uk/collab/mygrid/service0/axis/services/urn:srs?WSDL </wsdlURL> </importedService> </GDQDataSourceList>
An example of a Query Document <request name = “myRequest”> <oqlQueryStatement name=“myStat"> <dataResource=“myGenomeDB"> <expression> select p.proteinId, Blast(p.sequence) from proteins p, proteinTerms t where t.termId = ‘GO:0005942 ’ and p.proteinId = t.proteinId </expression> </oqlQueryStatement > <deliverToGDT name="delivery"> <fromLocalname=“myStat"> <toGDT streamId="otherrequestasynch/d1" mode=“full"> http://ogsadai.org.uk/GDTService/my/GDT/GSH </toGDT> </deliverToGDT> </request>
Single-node Optimiser OQL Parser Logical Optimiser Physical Optimiser Partitioner Scheduler Multi-node optimiser Evaluator Query Compilation
Logical Optimisation reduce • Plan is expressed using a logical algebra. • Heuristic-based application of equivalence laws. • Multiple equivalent plans generated. op_call (Blast) join (proteinId) reduce reduce scan (protein) scan termID=… (proteinTerm)
Physical Optimisation reduce • Plan is expressed using a physical algebra. • Logical operators replaced with physical operators. • Cost-based ranking of plans. op_call (Blast) hash_join (proteinId) reduce reduce table_scan (protein) index_scan termID=… (proteinTerm)
Partitioning reduce • Plan is expressed in a parallel algebra. • Parallel algebra = physical algebra + exchange. • Exchange operators are placed where data movement may be required. op_call (Blast) exchange hash_join (proteinId) exchange exchange reduce reduce table_scan (protein) index_scan termID=… (proteinTerm)
Scheduling 3,4 reduce • Partitions are allocated to Grid nodes; partitions may be merged during scheduling. • Expressed by decorating parallel algebra expression. • Heuristic algorithm considers memory use, network costs. op_call (Blast) 1 exchange hash_join (proteinId) exchange exchange reduce reduce 1 2 table_scan (protein) table_scan termID=S92 (proteinTerm)
Query Evaluation • Query installation: • GQESs created for partitions as required. • Partitions sent to GQESs. • Query evaluation: • Partitions evaluated using iterator model. • Pipelined and partitioned parallelism. • Results conveyed to client.
<Partitions> ... <Partition isRoot="0"> <evaluatorURI> http://mach1.cs.man.ac.uk:8080/ogsa/services/ogsadai/GQESFactory/GQES1 </evaluatorURI> ... <Operator operatorID="2" operatorType="SEQ_SCAN"> <SEQ_SCAN> <tupleType> <type> string </type> <name> proteinTerms.GOproteinID </name> <type> string </type> <name> proteinTerms.term </name> </tupleType> <inputOperator> <OperatorID></OperatorID> </inputOperator> <DataResourceName> proteinTerms </DataResourceName> <GDSHandle> http://mach1.cs.man.ac.uk:8080/ogsa/services/ogsadai/GridDataServiceFactoryP2R1/GDS1 </GDSHandle> <predicateExpr> <predicate> <comparativeOperator>EQ</comparativeOperator> <leftOperand name="proteinTerms.term" type="tuplefield"/> <rightOperand name="GO:0008372" type="string"/> </predicate> </predicateExpr> </SEQ_SCAN> </Operator> ... </Partition> ... </Partitions> An example of a query sub-plan passed to a GQES <Partition isRoot="0"> <evaluatorURI> http://mach1.cs.man.ac.uk:8080/ogsa/services/ogsadai/GQESFactory/GQES1 </evaluatorURI> ... <Operator operatorID="2" operatorType="SEQ_SCAN"> <SEQ_SCAN> <tupleType> <type> string </type> <name> proteinTerms.GOproteinID </name> <type> string </type> <name> proteinTerms.term </name> </tupleType> <inputOperator> <OperatorID></OperatorID> </inputOperator> <DataResourceName>proteinTermsDataResource</DataResourceName> <GDSHandle> http://mach1.cs.man.ac.uk:8080/…/GridDataServiceFactoryP2R1/GDS1 </GDSHandle> <predicateExpr> <predicate> <comparativeOperator>EQ</comparativeOperator> <leftOperand name="proteinTerms.term" type="tuplefield"/> <rightOperand name="GO:0008372" type="string"/> </predicate> </predicateExpr> </SEQ_SCAN> </Operator> ... </Partition>
Summary • DQP on the Grid provides: • The normal benefits of DQP. • Some added benefits from a Grid setting. • The Grid specifically enables: • Runtime computational resource discovery. • Dynamic creation of remote evaluators. • Authentication/Transport services. • Access to non-database services.
Features of Our GDQS • Low cost of entry: • Imports source descriptions through GDSs. • Imports service descriptions as WSDL. • Throw-away GDQS: • Import sources on a task-specific basis. • Discard GDQS when task completed. • Builds on parallel database technology: • Implicit parallelism. • Pipelined + partitioned parallel evaluation. • Public release in July 2003.
The SB-DQP Team • Manchester: • Nedim Alpdemir • Anastasios Gounaris • Alvaro Fernandes • Norman Paton • Rizos Sakellariou • Newcastle: • Arijit Mukherjee • Jim Smith • Paul Watson