350 likes | 527 Views
Answering Queries using views: A survey. The problem…. We have a schema A and a set of materialized views V1, V2,…, Vn We are given a query Q over schema A Can we answer Q using another query Q’ referring only to (one or more) views? Depending on the specific context:
E N D
The problem… • We have a schema A and a set of materialized views V1, V2,…, Vn • We are given a query Q over schema A • Can we answer Q using another query Q’ referring only to (one or more) views? • Depending on the specific context: • Schema A can be either virtual or materialized • Q’ may be required to be an equivalent rewriting or a maximally contained rewriting of Q
Equivalent and Containment rewritings • Query Containment: Q1 is contained in Q2 (Q1Q2), if for all db instances D, Q1(D)Q2(D) • Query Containment: Q1 is equivalent to Q2 if Q1Q2 and Q2 Q1 Let A a schema and V={V1, V2, …, Vm} a set of views over A and Q a query over A • Q’ is a equivalent rewriting of Q, if • Q’ refers only to views in V • Q’ is equivalent to Q • Q’ is a maximally contained rewriting of Q, if • Q’ refers only to views in V • Q’ Q • There is no other rewriting Q1, such as Q’ Q1 Q
When the problem arises? • Data integration • The sources are described as views over the mediated schema • Data warehouse • The schema of the warehouse can be defined as views over the schemata of the sources • Query Optimization • Computing the query using previously materialized views • semantic caching • Database design • Independence of the physical view of the data and logical view • Rewriting of a query over the logical schema into a query over the physical view
A running example… • Suppose we have the following schema Student(st_id, st_name, dp_id) Professor(pr_id, pr_name) Department(dp_id, dp_description) Course(cr_id, cr_title, cr_semester) Prof_Dep(pr_id, dp_id, university) Student_course(st_id, cr_id) Prof_course(pr_id, cr_id)
Data Integration Mediator Student(st_id, st_name, dp_id) Professor(pr_id, pr_name) Department(dp_id, dp_description) Course(cr_id, cr_title, cr_semester, cr_field) Prof_Dep(pr_id, dp_id, university) Student_course(st_id, cr_id) Prof_course(pr_id, cr_id) Create View AuebProfs As Select cr_title, pr_name From Course, Prof_course, Professor Where Course.cr_id = Prof_course.cr_id and Prof_course.pr_id = Professor.pr_id and Course.univ = “AUEB” Create View DBProfs As Select cr_title, pr_name From Course, Prof_course, Professor Where Course.cr_id = Prof_course.cr_id and Prof_course.pr_id = Professor.pr_id and Course.cr_field = “DB” Source 1 Professors Of Databases Source 1 Professors Of AUEB
Data Integration Mediator Student(st_id, st_name, dp_id) Professor(pr_id, pr_name) Department(dp_id, dp_description) Course(cr_id, cr_title, cr_semester, cr_field) Prof_Dep(pr_id, dp_id, university) Student_course(st_id, cr_id) Prof_course(pr_id, cr_id) Query Q: Select pr_name From Professor Query Q’: Select pr_name From DBProfs Union Select pr_name From AUEBProfs Source 1 Professors Of Databases Source 1 Professors Of AUEB Q’ is a maximally-contained Rewriting of Q (professors of Algorithms in the university of Piraeus cannot be retrieved…)
Data Integration Mediator Query P: Select pr_name From Course, Prof_course, Professor Where Course.cr_id = Prof_course.cr_id and Prof_course.pr_id = Professor.pr_id and Course.cr_field = “DB” And Course.univ = “AUEB” Student(st_id, st_name, dp_id) Professor(pr_id, pr_name) Department(dp_id, dp_description) Course(cr_id, cr_title, cr_semester, cr_field) Prof_Dep(pr_id, dp_id, university) Student_course(st_id, cr_id) Prof_course(pr_id, cr_id) Query P’: Select pr_name From DBProfs, AUEBProfs Where DBProfs.pr_name = AUEBProfs.pr_name Source 1 Professors Of Databases Source 1 Professors Of AUEB P’ is an Equivalent Rewriting of P
Query Optimization • Suppose that we have a database with the schema of our running example • Suppose that we have the following materialized views • create view StudentCoursesView As Select st_name, cr_id from Student, Stud_course where Student.st_id = Stud_course.st_id • create view ProfCoursesView As Select pr_name, pr_id from Professor, Prof_course where Professor.pr_id = Prof_course.pr_id
Query Optimization • Suppose we are given the query Select st_name, prof_name from Student, Stud_course, Prof_course, Professor where Student.st_id = Stud_course.cr_id and Stud_course.cr_id = Prof_course.cr_id and Prof_course.pr_id = Professor.pr_id
Query Optimization • We can exploit the materialized views in order to eliminate joins • There are many possible combinations • One could be: Select st_name, pr_name From StudentCoursesView, ProfessorCoursesView Where StudentCoursesView.cr_id = ProfessorCoursesView.cr_id
Query Optimization • But: • Our aim here is not just to find an equivalent rewriting • We have to find all rewritings and evaluate their costs in order to select the cheapest plan • In some cases views may not help due to loss of indexes • Therefore, we need to make a cost-based rewriting • Similar case is the semantic caching • Client-server architecture • Query results can be temporary maintained as relations in the client side to be exploited for the evaluation of future queries
A common question: When a view V is useful for a query Q? • There must be a mapping μ from the relations of V to the relations of Q • If so, if R1 is both in V and Q then: • If R1 participates in a join in Q with relation R2, then • Either R1 and R2 must be joined in the same attributes with the same conditions in V • Or R1 and R2 must be joined in V in the same attributes but with weaker conditions than those in Q, provided that the join attributes are projected in V • Or only R1 is in V and the attributes of R1 participating in the join in Q, are projected in V • If Q applies a selection predicate in an attribute A of R1, then V: • Either V must apply the same selection predicate in attribute A of R1 • Or V must apply a weaker selection predicate, provided that attribute A is projected in V • Or does not apply a predicate in attribute A of R1 at all and attribute A is projected in V • V must project at least all attributes projected in Q
Examples Q: Select st_name, prof_name from Student, Stud_course, Prof_course, Professor where Student.st_id = Stud_course.st_id and Stud_course.cr_id = Prof_course.cr_id and Prof_course.pr_id = Professor.pr_id and Student.st_year > 2000 V1: Select st_name from Student where st_year>2000 • V1: Select st_name, cr_name • from Student, Stud_course, Course • where Student.st_id = Stud_course.st_id • and Stud_course.cr_id = Course.cr_id V1: Select st_name, st_id from Student where st_year> 2000 • V1: Select st_name, st_year, cr_name, • cr_id • from Student, Stud_course, Course • where Student.st_id = Stud_course.st_id • and Stud_course.cr_id = Course.cr_id V1: Select st_name, st_id from Student where st_year >1996 V1: Select st_name, st_id, ma from Student where am >1996
Algorithms for finding query rewritings using views • In the query optimization context • A variation of the Selinger Algorithm • Cost-based equivalent rewritings • In the data integration context • The bucket algorithm • The inverse-rules algorithm • The MiniCon algorithm Maximally-contained rewritings
Query Optimization – the Selinger Algorithm • Recall: the Selinger Algorithm chooses the best join order • A bottom-up dynamic programming algorithm • Each iteration explores all join combinations of certain length • For each combination of joins, it keeps only the cheapest plan, taking into account costs estimated in the previous iteration
A variation of the Selinger Algorithm • The Selinger Algorithm can be modified in order to choose the cheapest equivalent rewriting • We start with views that seem useful for a certain query • We continue in a bottom-up fashion exploring all join combinations • For each combination we prune plans estimated to be more expensive or less contributing than others
A variation of the Selinger Algorithm • First Iteration: • Find all views that are relevant (useful) to the query • Choose the best access path for each view and evaluate its cost • Second Iteration: • Find all pair combinations • For each pair find all possible Joins in all possible attributes that lead to useful plans • Plans that are complete (i.e. are sufficient to answer the query) are distinguished as final candidates • For the rest plans do the following: • Compare pairs of plans • For each pair (p, p’), if p’ is cheaper and has greater or equal contribution (covers more relations), then p is pruned • Third Iteration: • Find all possible pairs of plans derived from the previous iteration and the plans derived from the first iteration • …
A variation of the Selinger Algorithm Query: R1 R2 R3 R4 Views: V1=R1 R2 , V2=R2 R3 R4 V3=R2 R3, V4=R3 R4, (V1-V3)-V2 (V1-V3)-V4 V1-V2 V1-V3 V1-V4 V2-V3 V2-V4 V3-V4 V1 V2 V3 V4
Algorithms in the Data Integration Context • We will discuss 2 algorithms • The bucket algorithm • The inverse-rules algorithm • But first a short introduction to conjunctive queries
Conjunctive Queries • Conjunctive queries are able to express select-project-join queries • h(X,…) :- a(Y,…), b(Z,…), … • Head and subgoals are atoms • An atom consists of a predicate and a tuple of arguments: a(Y, Z, ‘aueb’, W, 2006) • Predicates stand for relations body subgoals head constants variables
Conjunctive Queries • Subgoals can also be arithmetic comparisons h(X,…) :- a(Y,…), b(Z,…), Z>3 • A variable that appears in the head is said to be distinguished ; otherwise nondistinguished. h(X,Y):-a(X,Z),b(Z,Y) X,Y: distinguished, Z:nondestinguished • project-select-join SQL queries as conjunctive queries: Select pr_name, dep_name From Professor, Department Where Professor.pr_dep = Department.dep_id and Department.dep_year>1996 and Professor.field = “DB” q(pr_name, dep_name) :- Professor(pr_name, dep_id, “DB”), Department(dep_id, dep_name, dep_year), dep_year>1996.
Conjunctive Queries • The problem of rewriting queries using views in the notation of conjunctive queries: • He have • a set of Database Relations • a set of views declared as conjunctive queries over the Database Relations • a query q expressed as a conjunctive query over the Database Relations • We want to find: • A set of conjunctive queries over the view relations, the disjunction of which being a maximally contained rewriting of q
example Relations: R(S,C,Q), CO(C,T), TE(P,C,Q) Views: V1(S, C, Q, T) :- R(S, C, Q), CO(C, T), C>500, Q>97 V2(S, P, C, Q) :- R(S, C, Q), TE(P, C, Q) V3(S, C) :- R(S, C, Q), Q<94 V4(P, C, T, Q) :- R(S, C, Q), CO(C, T), TE(P, C, Q), Q<97 Query: Q(S, C, P) :- TE(P,C,Q), R(S,C,Q), CO(C,T), C>300, Q>95
The Bucket Algorithm • Create a ‘bucket’ for each subgoal of the query q • Fill each bucket with views that ‘cover’ the subgoal • A view V covers a subgoal p(A,B) of q if there is a subgoal in V with the same predicate p(X,Y) and: • there is a mapping from the variables A, B of the subgoal of q to the variables X, Y of the subgoal of the view • For variables of the subgoal of q appearing in the head of q, their mapping variables in the subgoal of the view must also appear in the head of the view • For each variable of the subgoal of q that also appear in arithmetic conditions C in q, if its mapping variable in the subgoal of the view also appear in arithmetic conditions C’ in the view, then C and C’ should not be mutually exclusive • A solution must include q view from each bucket
The Bucket Algorithm V1(S, C, Q, T) :- R(S, C, Q), CO(C, T), C>500, Q>97 V2(S, P, C, Q) :- R(S, C, Q), TE(P, C, Q) V3(S, C) :- R(S, C, Q), Q<94 V4(P, C, T, Q) :- R(S, C, Q), CO(C, T), TE(P, C, Q), Q<97 Q(St, Cn, Pr) :- TE(Pr,Cn,Qu), R(St,Cn,Qu), CO(Cn,Tl), Cn>300, Qu>95 V4(P,C,T’, Q) V2(S,P’,C,Q) V4(S’, C, T, Q’) V2(S’,P,C, Q) V1(S,C,Q,T’) V1(S’, C, Q’, T) CO TE R
The Bucket Algorithm • There are 8 combinations we have to check • For each, we equaling appropriate variables to enforce joins • We check for condition confligs • If there are duplicate subgoals we try to eliminate them V4(P,C,T’, Q) V2(S,P’,C,Q) V4(S’, C, T, Q’) V2(S’,P,C, Q) V1(S,C,Q,T’) V1(S’, C, Q’, T) TE R CO q’(S,C,P):-V2(S’,P,C,Q), V1(S,C,Q,T’), V1(S’,C,Q’,T) The first solution q’(S,C,P):-V2(S’,P,C,Q), V1(S,C,Q,T’)
The inverse-rules algorithm • Key idea: invert the view definitions to give the global predicates (not only those appearing in the query) in terms of views • For each nondistinguished variable X, pick a new function symbol f Example: v(X,Y) :- p(X,Z), p(Z,Y) Replace Z by f(X,Y): v(X,Y) :- p(X,f(X,Y)), p(f(X,Y),Y)
The inverse-rules algorithm • Replace a view definition by rules with: • A subgoal as the head, and • The view itself as the only subgoal of the body. • Example: v(X,Y) :- p(X,f(X,Y)), p(f(X,Y),Y) becomes: p(X,f(X,Y)) :- v(X,Y) p(f(X,Y),Y) :- v(X,Y)
The inverse-rules algorithm - example View: v(D,C):-m(S,D), r(S, C) Query: q(D) :- m(S,D), r(S, 444) After applying the inverse-rules, we have: m(f1(D,C), D) :- v(D, C) r(f1(D,C), C) :- v(D,C)
The inverse-rules algorithm - example • Supose that v has the following tuples: {(CS, 444), (EE, 444), (CS, 333)} • Applying the inverse-rules on these tuples we have: m: { (f1(CS, 444), CS), (f1(EE, 444),EE), (f1(CS, 333), CS)} r: { (f1(CS, 444), 444), (f1(EE, 444),444), (f1(CS, 333), 333)}
The inverse-rules algorithm - example • We can now apply the query in this instance q: {(CS),(EE)}
The MiniCon algorithm • A variation of the bucket algorithm • It takes into account the non-distinguished variables as well, which usually imply joins • If a view contains a ‘target’ subgoal, which participates in a join in the query, then the ‘join’ variables should appear in the head of the view (unless the join takes place also within the view) • It excludes much more views from participating in buckets (here called MCDs) • There are fewer combination to check at the end