320 likes | 405 Views
Capability-Sensitive Query Processing on Internet Sources. Hector Garcia-Molina Wilburt Labio Ramana Yerneni Presented by Bimbi Koduru Date : 03/29/2007. Brief Overview.
E N D
Capability-Sensitive Query Processing on Internet Sources Hector Garcia-Molina Wilburt Labio Ramana Yerneni Presented by Bimbi Koduru Date : 03/29/2007
Brief Overview • On the Internet, the limited query-processing capabilities of sources make answering even the simplest queries challenging. • To solve this, a scheme was developed called GenCompact. • GenCompact is used for generating capability-sensitive plans for queries on Internet sources.
Overview Contd., • Advantages of GenCompact over query plans generated by existing query-processing systems: • Sources are guaranteed to support the query plans. • The plans take advantage of the source capabilities. • The plans are more efficient since a larger space of plans is examined.
Introduction • Processing queries on a wide range of query processing capabilities pose some interesting challenges. • Examples: • A Bookstore (Think BarnesAndNoble!! ) • A car shopping guide (autobytel website)
Example 1 • Consider BarnesAndNoble • Won’t allow searching for two authors on the same topic. • A good plan is to break up the query into two. • At the end, we take the union of the results of the two queries to obtain the answer.
Example 2 • Suppose we want to look information of midsize or compact sedans, like Toyotas under $20k and BMWs under $40k. • The query condition in this case is: • (style = “sedan” ^ (size = “compact” v size = “midsize”) ^ ((make = “Toyota” ^ price <= 20000) v (make = “BMW” ^ price <= 40000))).
Example 2 Contd., • This query condition cannot be supported directly by the web source. • Breaking up the condition is feasible. • (style = “sedan” ^ make = “Toyota” ^ price <= 20000 ^ (size = “compact” v size = “midsize”)) • (style = “sedan” ^ make = “BMW” ^ price <= 40000 ^ (size = “compact” v size = “midsize”))
Explanation • In this example, both DNF and CNF query-processing system produce less desirable plans. • DNF transforms the query into one with four terms. In this case our two-query plan is more feasible. • A CNF system transforms the query into one with six clauses. Here, CNF may transform many more queries than necessary.
Disadvantages of query capabilities of Internet Sources • It is difficult to generate plans that are a-priori known to be feasible. • The size of the plan space for even moderately complex queries can be very large. • In some cases, they choose infeasible plans when feasible plans exist. • In other cases, they choose inefficient plans when much more efficient feasible plans exist.
Different query processing systems • Very few query-processing system take into account source capabilities. • Conventional systems such as: • System R • DB2 • NonStop SQL assume relational source capabilities without limitations.
Query processing systems Contd., • Relatively new systems like • Information Manifold • TSIMMIS • Garlic • DISCO have addressed issues surrounding limited source capabilities.
Notation • The efficient feasible plans for a given target query can be generated in the form • The condition expression “c” is represented by a condition tree (CT). • Leaf Nodes – Atomic Conditions Non-leaf Nodes – Boolean connectors
Notation • Given a condition expression C, we denote the set of attributes in C as Attr(C). • An alternative denotation for is SP(C, A, R) • In the case of a node n of some CT, SP(n, A, R) is short hand for SP(Cond(n), A, R).
Source Capabilities • Internet sources have a wide variety of query-processing limitations. • Condition-Attribute Restrictions • Condition-Expression-Size Restrictions • Condition-Expression-Structure Restrictions.
SSDL • Simple Source-Description Language (SSDL) is a powerful language that describes the wide range of query capabilities. • SSDL is based on context-free grammars (CFGs). • Using SSDL, standard parsing technology can be used to check for the supportability of a source query very efficiently.
SSDL Example • Consider R( make, model, year, color, price ) • The query capabilities of R can be described in SSDL as follows:
SSDL Example : Rules • CFG Rules – Describes the condition expressions R can evaluate. • Rule (2) – R can evaluate conditions like (make = “BMW” ^ price < 40000) • Rule (3) – R can evaluate conditions like (make = “BMW” ^ color = “Red”) • The last two rules indicate the attributes that can be exported by R.
A modular scheme - GenModular • GenModular is used for generating efficient feasible query plans for target queries. • GenModular considers various rewritings of the target query condition and chooses the least expensive among the plans. • GenModular identifies parts of the condition that can be answered by the source and pieces together the source queries.
GenModular • Four Modules work together in GenModular.
Rewrite Module • It produces a set of equivalent rewritings of the target-query condition. • Starts from the condition tree (CT) for the target query and generates the CTs for the rewritings. • Rewrite module uses a set of rules that are also input to the module. • Examples – Commutative, Associative and Distributive transformations of condition expressions.
Mark Module • Determines the various parts of each CT produced by the rewrite module. • Example Description: • Each node n in the CThas a field n.export that records the set of attributes that can be exported by the source when asked to evaluate Cond(n). • By using the Check function described earlier, the mark module computes the export fields of all the nodes in the CT.
Generate Module • Generate module uses an algorithm called Exhaustive Plan Generator (EPG) that computes the feasible plans for a given CT. • Generate module produces the set of feasible plans by repeatedly invoking EPG on each of the CTs passed on by the mark module.
Exhaustive Plan Generator • EPG generates a plan for n by combining the plans for all the children of n. • EPG explores the possibility of downloading the relevant portion of the source contents and evaluating the condition expression corresponding the n at the mediator.
GenCompact • GenCompact generates the same plans in a much more efficient manner than GenModular. • The major disadvantage of GenModular is that it is very inefficient in generating feasible plans for target queries.
GenCompact vs. GenModular • GenCompact improves upon GenModular with two techniques. • Intelligent Plan Generation • Pruning techniques
Rewrite Module • GenCompact employs a rewrite module to generate a set of CTs equivalent to the CT representing the target-query condition. • However, GenCompact can work with a lot fewer CTs than GenModular by firing fewer rewrite modules without compromising the optimality of the plans being generated.
Cost Model • Given a plan for the target query that uses a set of source queries SQ, the cost of the plan is: • where, • K1 and k2 are constants that depend on the source referred to by the target query.
Pruning rules • Based on the cost model, the following pruning rules can be formulated: • PR1: Prune impure plans when pure plans exists. • PR2: prune locally sub-optimal plans. • PR3: Prune dominated plans. • These rules are used in plan-generation module.
Plan Generation Module • Plan-generation module takes each CT produced by the rewrite module and generates a single query plan for the CT. • The associativity and copy rules are not used in the rewrite module. • To compensate, the plan generation module has to do more work on each CT it receives from the rewrite module.
Conclusion • GenCompact produces excellent feasible plans for queries over Internet sources with limited capabilities. • GenCompact is a flexible scheme and can be adapted to source-capability-description languages and cost models.