170 likes | 185 Views
SPARQL Query Optimization. Lesley Charles lxc090120@utdallas.edu November 23, 2009. Query Optimization. Query Optimization is a process that tends to device a query execution plan that takes the minimum response time.
E N D
SPARQL Query Optimization Lesley Charles lxc090120@utdallas.edu November 23, 2009
Query Optimization • Query Optimization is a process that tends to device a query execution plan that takes the minimum response time. • The response time is minimized by reducing the number of blocks that must be read or to be written to the memory to complete the query. • Query Optimization is vital especially in cases where numerous transactions are made every second.
SPARQL • SPARQL uses multiple triples to match certain conditions and extract data based on these conditions. • There are two factors that play an important role in the response time of a SPARQL query, • The order in which the triples are accessed, • The necessity of each triple • SPARQL Optimization also depends on the platform on which it is implemented.
Types of Optimization • There are two types of Query Optimization, • Logical Optimization • Physical Optimization • Logical Optimization generates a sequence or an order in which the triples are processed so as to minimize the response time. • Physical Optimization is a high level optimization where we determine how each operation is done.
Logical Optimization • The aim of a logical optimization is to find an execution plan which is expected to return the result set fastest without actually executing the query or subset. • The technique is to use selectivity based Basic Graph Patterns for optimization. We will try to find which triple has a minimum selectivity, by referring to graph patterns and based on this decide the execution plan.
Selectivity • Selectivity of a triple pattern is the fraction of triples matching the pattern. This helps us in deciding the execution plan. Consider the following query, ?x NS:type NS:animal ?x NS:species “zebra” Changing the order in which they are executed can save us a lot of time.
Architecture • The triple patterns are considered nodes of a directed graph, where the directed edge denotes a triple pattern pair. • The node with the minimum selectivity is first visited and is added to the execution plan. • Further each node is checked for the following two conditions and added to the final execution plan. • Minimum selectivity, • Visited or not.
Heuristics • There are various heuristics the optimizer can implement and use for the selectivity estimation of graph patterns. • Basically these heuristics can be classified into two types, • Heuristics without pre-computed statistics, • Heuristics with pre-computed statistics. • It also depends on whether the subject or predicate or the object is more selective.
Heuristics without pre-computed statistics • These types of heuristics do not require any kind of statistical data. • Variable Counting : The selectivity of a triple pattern is computed according to the type and number of unbound components and is characterized by the ranking sel(S) < sel(O) < sel(P). • Variable Counting Predicates : The selectivity of bound joins is set to 1.0 by default. • Graph Statistics Handler : It enables graph patterns to lookup for an exact size information of any triple pattern component. However it doesn’t support joins of any kind. The selectivity is determined by the size information.
Heuristics with pre-computed statistics • These types of heuristics are more accurate but they require pre – computed statistics about RDF data. • Probabilistic Framework : It is a standalone framework of the selectivity estimation of RDF graph patterns. • Probabilistic Framework Join : It differs from PF in the sense that it includes the selectivity of the more selective triple pattern in estimating selectivity of joined triple patterns. • PFN is another variation of PF which does not limit the lower bound of selectivity estimation.
Summary Statistics • It is a traditional practice to keep track of metadata, i.e. data about the data in order to calculate cardinalities, generate indices of data etc. . • These metadata can be used to create summary statistics in such a way to facilitate the estimation of the size or result set of any query. • It can also help in calculating selectivity for a particular component of a triple pattern. • Histograms can be used to represent the distribution of data.
Triple Patterns and Joined Triple Patterns • We need to consider the subject, the predicate, the object and along with them, we need to consider whether it is a bound or unbound component. • The bound object size can be approximated by means of equal width histograms. For each distinct predicate we compute a histogram to represent corresponding object – value distribution. • For joined triple patterns we consider the join and decide whether both triples contain components of the same class.
Selectivity Estimation • The selectivity is the ratio of estimated number of triples matching a pattern to the total number of triples in the dataset. • sel(t) = sel(s) * sel(p) * sel(o) • sel(s) = 1/R, R - No. of Resources. • sel(p) = Tp/T, T – Total No. of triples, Tp – Triples matching predicate p. • sel(o) = hc(p,oc)/Tp, where (p,oc) represents the class of the histogram for predicate p in which object o falls.
Limitations – Logical Optimization • Scalability is the basic limitation of query optimization based on selectivity, as it is not feasible to find selectivity in a dataset containing millions of triples. • The major limitation of logical optimization is the use of special modifiers like, OPTIONAL, UNION, FILTER etc . . These modifiers affect the selectivity and undermines the whole algorithm presented for logical optimization. • Another important issue to take into account is the type of ontology used and the pattern in which data is stored. These cannot be generalized as it varies with each dataset.
Physical Optimization • Physical Optimization is a customized solution for a specific ontology or framework. Here we decide how each and every query can be implemented based on the ontology and the data. • Usually the queries are rewritten, sometimes eliminating certain triples to obtain the same result. ?x NS:typeNS:animal ?x NS:species “zebra” • For each query, we analyse the triples and get an idea of what data is required and then figure out the most beneficial way of extracting the same data from the data store.
Thank you! Reference M. Stocker, A. Seaborne, A. Bernstein, C. Kiefer, D. Reynolds, “SPARQL Basic Graph Pattern Optimization Using Selectivity Estimation”, In WWW’08: Proceeding of the 17th International Conference on World Wide Web, pages 595-606, New York, NY, USA, 2008, ACM.