540 likes | 613 Views
CS848 Presentation. Heng YU (Henry) h3yu@hopper.uwaterloo.ca. Paper to present. Answering queries using views: A survey by A. Y. Halevy VLDB Journal 10: pp. 270-294 2001. Outline. Introduction with examples Formal problem definitions Conditions of view usability
E N D
CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca
Paper to present Answering queries using views: A survey by A. Y. Halevy VLDB Journal 10: pp. 270-294 2001
Outline • Introduction with examples • Formal problem definitions • Conditions of view usability • Using materialized views in query optimization • Answering queries using views in data integration • Theoretical results • Extensions • Conclusion and challenges
Problems (informal) Given a query Q and a set of views V1, .., Vnover a database schema, • Is it possible to answer Qusing only the answers to V1, .., Vn? • What is the maximal set of tuples in the answer of Q that we can get from V1, .., Vn? • If we can access both the views and the database relations, what is the cheapest query execution plan for answering Q?
Fields of applications • Query optimization • Physical data independence • Data integration • More: e.g. semantic cache
Example: a university schema Prof(name, area) Course(c-number, title) Teaches(prof, c-number, quarter) Registered(student, c-number, quarter) Major(student, dept) Works(prof, dept) Advises(prof, student) Keys: Prof(name) Courses(c-cumber) graduate course c-cumber ≥ 400 Ph.D. course c-cumber ≥ 500
Query Optimization Suppose we have a view for graduate course registration info: create view Graduate as select Registered.student, Course.title, Course.c-cnumber, Registered.quarter from Registered.course where Registered.c-number = Course.c-number and Course.c-number ≥ 400
Query optimization (cont.) Want to query students registering in Ph.D. level courses taught by a professor who in interested in DB area: select Registered.student, Course.title from Teaches, Prof, Regestered, Course where Prof.name = Teaches.prof and Teaches.c-number = Register.c-number and Teachers.quarter = Registered.quarter and Registered.c-number = Course.c-number and Course.c-number ≥ 500 and Prof.area = ‘DB’
Queryselect Registered.student, Course.titlefrom Teaches, Prof, Registered, Coursewhere Prof.name = Teaches.prof and Teaches.c-number = Register.c-number and Teachers.quarter = Registered.quarter and Registered.c-number = Course.c-number and Course.c-number ≥ 500 and Prof.area = ‘DB’ View create view Graduate as select Registered.student, Course.title, Course.c-cnumber, Registered.quarter from Registered. Course where Registered.c-number = Course.c-number and Course.c-number ≥ 400
Query optimization (cont.) Result of query rewriting select Graduate.student, Graduate.title from Teachers, Prof, Graduate where Prof.name = Teachers.prof and Teaches.c-number = Graduate.c-cumber and Teaches.quarter = Graduate.quarter and Graduate.c-number ≥ 500 and Prof.area = ‘DB’
Maintaining physical data independence • Relational database systems rely on 1-1 mapping between relations and files. • In object-oriented and semistructured databases, logical model is more redundant and does not reflect optimal physical design. • Physical storage can be described as views over the logical model. e.g. GMAP (Tsatalos et al. 96)
Maintaining physical data independence (cont.) GMAP (generalized multi-level access paths) def.gmap G1 as b+-tree by given Student.name select Department where Student.major Department. def.gmap G2 as b+-tree by given Student.name select Course.c-number where Student registered Course def.gmap G3 as b+-tree by given Course.c-number select Department where Student.registered Course and Student major Department
Maintaining physical data independence (cont.) Query: select Student.name, Department where Student registered Course and Student major Department and Course.c-number ≥ 500 Plans: • PStudent.name, Department (SCourse.c-number≥500 (JStudent.name(G1, G2))) • JCourse.c-number (SCourse.c-number≥50(G3), G2)
Data integration • Providing a uniform query interface to a multitude of autonomous heterogeneous data sources. • Giving users a mediated schema. • Local as View: specifying data source descriptions as a view over the mediated schema.
Data integration (cont.) Example: Prof(name, area) Course(c-number, title, univ) Teaches(prof, c-number, quarter, univ) Register(student, c-number, quarter) Major(student, dept) Works(prof, dept) Advises(prof, student)
Data integration (cont.) Suppose we have only 2 views available: create view DB-courses as select Course.title, Teaches.prof, Course.c-number, Course.univ from Teaches, Course where Teaches.c-number = Course.c-number and Teaches.univ = Course.univ and Course.title = “Database Systems” create view UW-phd-courses as select Course.title, Teaches.prof, Course.c-number,Course.univ from Teaches, Course where Teaches.c-number = Course.c-number and Course.univ = ‘UW’ and Teaches.univ = ‘UW’ and Course.c-number ≥ 500
Data integration (cont.) • Query who teaches database courses in UW: select prof from DB-courses where univ = ‘UW’ • Query all graduate courses in UW: select title, c-number from DB-courses where univ = ‘UW’ and c-number ≥ 400 UNION select title, c-number from UW-phd-courses
Í Í Í Í Query containment and equivalence Definition A query Q1 is said to be contained in a query Q2, denoted by Q1 Q2, if for all database instances D, the set of tuples computed for Q1 is a subset of those computed for Q2, i.e., Q1(D) Q2(D) ; The two queries are equivalent if Q1 Q2 and Q2 Q1 .
Equivalent rewritings Definition Let Q be a query, V = {V1, V2, …, Vm } be a set of view definitions. The query Q’ is an equivalent rewriting of Q using V if: • Q’ refers only to the views in V; • Q’ is equivalent to Q.A query Q1 is said to be contained in a query Q2,
Í Í Maximally-contained rewritings Definition Let Q be a query, V = {V1, V2, …, Vm } be a set of view definitions, and L be a query language. The query Q’ is maximally-contained rewriting of Q’ w.r.t. L if: • Q’ is a query in L that refers only to the views in V; • Q’ is contained in Q; • there is no rewriting Q1 L, such that Q’ Q1 Q, and Q1 is not equivalent to Q’.
Certain Answers • Problem: finding all the answers to a query given a set of views. • Not equivalent to maximally-contained rewriting because Maximal containment relies on languages. • Formalized by certain answers (Abiteboul et.al. 98) • A tuple α is a certain answer of Q w.r.t. a set of view definitions {Vi} and their extensions {vi}, if α is inQ(D) for any possible database instance D such that Vi (D) = vi (CWA) or Vi (D) vi (OWA) .
View usability conditions For SPJ views to be usable in an equivalent rewriting of a SPJ query Q under bag semantics: • There is a mapping ψfrom occurrences of tables mentioned in the from clause of V to those mentioned in the from clause of Q, mapping every table name to itself. For bag semantics, ψmust be 1-1. • V must either apply the join and selection predicates in Q on the attributes on the attributes of the tables in the domain of ψ, or must apply to them a logical weaker selection, and select the attributes on which predicate still need to applied. • V must not project out any attributes of the tables in the domain of ψthat are needed in the selection of Q.
Queries with grouping and aggregation Example: View: create view V as select c-number, year, Max(evaluation) as maxeval, Count(*) as offerings from Teaches where c-number ≥ 400 group by c-number, year Query: select year, Count(*), Max(evaluation) from Teaches where c-number ≥ 500 group by year
Queries with grouping and aggregation (cont.) The query can be rewritten to: select year, sum(offering), Max(evaluation) From V where c-number ≥ 500 group by year Comment: • More limitations if grouping and aggregation are concerned. • Grouping in view must be finer than that in query. • Aggregations in query must be recoverable from the output fields and aggregations in the view.
Main approches • Using datalog query representation for both Q and V. • Algorithms: • Bucket algorithm (Levy et al. 96) • Inverse rules algorithm (Qian et al. 96 ) • MiniCon algorithm (Pottinger et al. 00)
Bucket algorithm • Create a bucket for each non-comparison subgoal g in Q: For each subgoal g’ in V, if there is a unifier θ for g and g’and the view, and after unification, • the comparison predicates in Q and V are simultaneously satisiable; • if a variable appears in head(Q) and subgoal g in the query, the corresponding variable in g’ alsoappears in head(V) in V, add θ(head(V)) into the bucket of g. • Find a set of conjunctive query rewritings, and each produces a conjunctive query including one conjunct from each bucket. It is a conjunctive rewriting if either • The conjunctive is contained in Q, or • It is possible to add atoms of comparison predicates such that the resulting conjunction is contained in Q.
Bucket algorithm example V1(student, c-number, quarter, title):- Registered(student, c-number, quarter), Course(c-number, title), c-number ≥ 500, quarter ≥ Aut98. V2(student, prof, c-number, quarter):- Registered(student, c-number, quarter), Teaches(prof, c-number, quarter) V3(student, c-number):- Registered(student, c-number, quarter), quarter ≤ Aut94. V4(prof, c-number, title, quarter):- Registered(student, c-number, quarter), Course(c-number, title), Teaches(prof, c-number, quarter), quarter ≤ Aut97.
Bucket algorithm example (cont.) Query: Q(S,C,P) :- Teaches(P,C,Q), Registered(S,C,Q), Course(C,T), C ≥ 300, Q ≥ Aut95. Bucket:
Bucket algorithm example (cont.) Result of rewriting: q’(S,C,P) :- V2(S’,P,C,Q), V1(S,C,Q,T’) q’(S,C,P) :- V4(P,C,T’,Q), V1(S,C,Q,T’), V4(P’,C,T,Q’) q’(S,C,P) :- V2(S,P,C,Q), V4(P,C,T’,Q) The second query is empty, so the result is the union of the first and the third conjunctive queries.
Bucket algorithm comments • Advantage • Prune significant number of query rewritings. • Return maximally-contained rewriting when the query does not have comparison. • Disadvantage • Cartesian product of buckets is still large • Testing query containment is costly and -complete.
Inverse-rules algorithm • Construct a set of rules that invert the view definitions. • Idea: each tuple in the head of view definition query is a witness of tuples in relations corresponding to subgoals in the body. • Assign one skelom function symbol for each existential variable in the view definition.
Inverse-rules algorithm example Example: View definition: V3(dept, c-number) :- Major(student, dept), Registered(student, c-number) Inverse rules: Major(f1(dept, X), dept) :- V3(dept, X) Registered(f1(Y, c-number), c-number) :- V3(Y, c-number)
Inverse-rule algorithm example (cont.) Query: q(dept) :- Major(student, dept), Registered(student, 444) V3 has tuples: {(CS, 444), (EE, 444), (CS 333)} Applying inverse rules: Registered: {(f1(CS, 444), CS), (f1(EE, 444), EE), (f1(CS, 333), CS)} Major: {(f1(CS, 444), 444), (f1(EE, 444), 444), (f1(CS, 333), 333)} Answer: {EE, CS}
Inverse-rule algorithm comments • Advantage • Simplicity and modularity • Return maximally-contained rewriting • Disadvantage • Keep more non-contributive views than bucket algorithm • Require recomputing the relations from the views. The reason to use precomputed materialized views is lost.
MiniCon algorithm • Improvement on bucket algorithm. • Aim to eliminate more views that are useless to the query. • When we find a unification between a subgoal g’ in V and a subgoal g in Q, all other subgoals that join with g in Q are examined. V must either have the join attribute in its head, or contain the corresponding joined subgoals in the body. • For each view, compute a MiniCon consisting all subgoals in the query the view contributes.
MiniCon example Example: q(D) :- Major(S, D), Registered(S, 444, Q), Advises(P, S) V1(dept) :- Major(student, dept), Registered(student, 444, quarter). V2(prof,dept,area) :- Advises(prof, student), Prof(name, area). V3(dep,c-number) :- Major(student, dept), Registered(student, c-number, quarter) Advises(prof, student) MiniCon(V1) = Φ, MiniCon(V2) = Φ, MiniCon(V3) = {Major, Registered, Advises}
Completeness • Question: given a query Q and a set of views V, will the algorithm find an equivalent rewriting of Q using V, when there one exists? • When a CQ has no comparison predicates and has n subgoals, there exists an equivalent conjunctive rewriting of Q using V only if there is a rewriting with at most n subgoals. The complexity is NP-hard. (Levy et al. 1995)
Recursive rewriting • Goal: when we apply maximally-contained rewriting, we can also get the set of all certain answers. • Recursive query rewriting is necessary when: • The query is recursive. • Database relations have functional dependencies. • There exist access pattern limitations on the views. • Views have unions. • Additional semantic information about class hierarchies on objects is expressed in DL.
Recursive rewriting (example with fd) Relation: schedule(Airline, Flight_no, Date, Pilot, Aircraft) FDs: Pilot -> Airline, Aircraft->Airline View: V(D,P,C) :- schedule(A, N, D, P, C) Query: Q(P) :- schedule(A, N, D, ‘mike’, C), schedule(A, N’, D’, P, C’) Rewriting: relevantPilot(‘mike’) relevantAircraft(C) :- v(D, ‘mike’, C) relevantAircraft(C) :- v(D, P, C), relevantPilot(P) relevantPilot(P) :- relevantPilot(P1), relevantAircraft(C), v(D1, P1, C), v(D2, P, C)
Finding certain answers • Open-world assumption: polynomial in most practical cases. NP-hard (in the size of view extensions) if unions are allowed in view definition or inequality predicates are allowed in query languages. • Close-world assumption: co-NP-hard even if both views and queries are CQs and have no comparison predicates. c.f. GAV: polynomial • In cases views can contain incorrect tuples : • assume no comparison predicates in views or query • If all views are complete or all views may have incorrect tuples: ploynomial in view ext. size • o.w.: co-NP-hard