150 likes | 299 Views
RCQ-GA: RDF Chain Query Optimization using Genetic Algorithms. Introduction. The application of Semantic Web technologies in an Electronic Commerce environment implies a need for good support tools
E N D
RCQ-GA: RDF Chain Query Optimization using Genetic Algorithms EC-Web 2009
Introduction • The application of Semantic Web technologies in an Electronic Commerce environment implies a need for good support tools • Fast query engines are needed for efficient querying of large amounts of data, usually represented using the Resource Description Framework (RDF) • Problem: optimizing query paths (the order in which different parts of a query are evaluated) • Two-phase optimization (2PO) has already been proposed (Stuckenschmidt et al. 2005) in a Semantic Web context, but a genetic algorithm (GA) appears to be a feasible alternative EC-Web 2009
RDF and Query Paths (1) • RDF model is a collection of facts declared using RDF • Facts are triples in the form of a node-arc-node link consisting of a subject, a predicate, and an object • RDF sources can be queried using SPARQL • We consider a subset of SPARQL queries: chain queries, where a query path is followed by performing joins between its subpaths of length 1 1. PREFIX c: <http://www.daml.org/2001/09/countries/fips#>2. PREFIX o: <http://www.daml.org/2003/09/factbook/factbook-ont#>3. SELECT ?partner4. WHERE { c:SouthAfrica o:importPartner ?impPartner .5. ?impPartner o:country ?partner .6. ?partner o:border ?border .7. ?border o:country ?neighbour .8. ?neighbour o:internationalDispute ?dispute .9. } EC-Web 2009
RDF and Query Paths (2) Bushy query tree Right-deep query tree EC-Web 2009
RDF Query Path Optimization (1) • Challenge: determine the right order in which the joins should be computed, hereby optimizing the overall response time • Consider a solution space with query paths • Solutions are associated with data transmission and processing costs • Data processing costs are the sum of all join costs, which are influenced by the cardinalities of each operand and the join method used • Neighbouring solutions in solution space can be identified using transformation rules introduced by Ioannidis and Kang (1990) EC-Web 2009
RDF Query Path Optimization (2) • Stuckenschmidt et al. (2005) propose to use 2PO for RDF chain query optimization: • Using Iterative Improvement (II), local optima are found by walking through solution space (from random starting points), while only taking steps yielding improvement in solution quality • The best local optimum thus found is used as starting point for Simulated Annealing (SA); a walk through solution space is performed, where moves not yielding improvement are accepted with a declining probability • We propose to optimize RDF chain queries using a GA, RCQ-GA EC-Web 2009
RDF Query Path Optimization (3) • In a GA, a population of chromosomes (solutions) is exposed to evolution: selection, crossovers, and mutations • A GA generally is aware of good solutions faster than 2PO, but tends to spend a lot of time optimizing these already good results before it terminates • We adopt the BushyGenetic (BG) algorithm proposed by Steinbrunn et al. (1997) for traditional query path optimization, but stimulate quicker convergence through elitist selection, fitness-based selection, a decreased population size, and tighter stopping conditions EC-Web 2009
RDF Query Path Optimization (4) • Solutions are encoded using an efficient ordinal number encoding scheme, facilitating easy crossover and mutation operations • The algorithm iteratively joins two concepts in an ordered list of concepts • Result is saved on position of first appearing concept • Example: • (c1, c2, c3, c4): join 3 and 4 • (c1, c2, c3c4): join 1 and 2 • (c1c2, c3c4): join 1 and 2 • (c1c2c3c4) • Encoding: ((3,4),(1,2),(1,2)) EC-Web 2009
Performance (1) • We benchmark execution times and solution quality of BG and our adaptation to RDF query environments, RCQ-GA, against those of 2PO • The effects of a time limit (1 second) on 2PO and RCQ-GA are also assessed • The entire solution space is considered (i.e., bushy query trees are valid options) • Each algorithm is tested on chain queries varying in length from 2 to 20 predicates • Each experiment is iterated 100 times • For now, we focus on a single source: RDF version of CIA World Factbook EC-Web 2009
Performance (2) Relative deviation of average execution times from 2PO average EC-Web 2009
Performance (3) Relative deviation of average solution costs from 2PO average EC-Web 2009
Performance (4) Relative deviation of coefficients of variation of solution costs from 2PO average EC-Web 2009
Conclusions • In optimizing the query path for chain queries in a single-source RDF query execution environment, the performance of a GA compared to 2PO is positively correlated with the complexity of the solution space and the restrictiveness of the environment • An appropriately configured GA can outperform 2PO in solution quality, execution time needed, and consistency of solution quality EC-Web 2009
Future Work • Optimize parameters (e.g., using meta-algorithms) • Evaluate performance in a distributed setting • Experiment with other algorithms, such as ant colony optimization or particle swarm optimization EC-Web 2009
Questions? • Feel free to contact: Alexander HogenboomErasmus School of EconomicsErasmus University RotterdamP.O. Box 1738, 3000 DR, The Netherlandshogenboom@ese.eur.nl EC-Web 2009