550 likes | 660 Views
Searching and Integrating Information on the Web. Seminar 2: Data Integration Professor Chen Li UC Irvine. Motivation. Biblio sever. Legacy database. Plain text files. Support seamless access to autonomous and heterogeneous information sources. Comparison Shopping. Applications.
E N D
Searching and Integrating Information on the Web Seminar 2: Data Integration Professor Chen Li UC Irvine
Motivation Biblio sever Legacy database Plain text files Support seamless access to autonomous and heterogeneous information sources. Seminar 2
Comparison Shopping Applications • Comparison shopping Lowest price of the DVD: “The Matrix”? • Supply-chain management Buyer 1 Supplier 1 Buyer 2 Supplier 2 Integrator … … Supplier M Buyer M Seminar 2
Mediation architecture Mediator Wrapper Wrapper Wrapper Source 1 Source 2 Source n TSIMMIS (Stanford), Garlic (IBM), Infomaster (Stanford), Disco (INRIA), Information Manifold (AT&T), Hermes(UMD), Tukwila (UW), InfoSleuth (MCC), … Seminar 2
Challenges • Sources are heterogeneous: • Different data models: relational, object-oriented, XML, … • Different schemas and representations: “Keanu Reeves” or “Reeves, Keanu” or “Reeves, K.” etc. • Describe source contents • Use source data to answer queries • Sources have limited query capabilities • Data quality • … Seminar 2
Outline • Basics: theories of conjunctive queries • Global-as-view (GAV) approach to data integration • Local-as-view (LAV) approach to data integration Seminar 2
Basics: conjunctive queries • Reading: Ashok K. Chandra and Philip M. Merlin, “Optimal implementation of conjunctive queries in relational data bases,” STOC, 77-90, 1977. • Fundamental for data integration • Source content description • Query description • Plan formulation Seminar 2
Conjunctive Queries (CQ’s) • Most common form of query; equivalent to select-project-join (SPJ) queries • Useful for data integration • Form: q(X) :- p1(X1),p2(X2),…,pn(Xn) • Head q(X) represents the query answers • Body p1(X1),p2(X2),…,pn(Xn) represents the query conditions • Each pi(Xi) is called a subgoal • Shared variables represent join conditions • Constants represent “Attribute=const” selection conditions • A relation can appear in multiple predicates (subgoals) Seminar 2
Conjunctive Queries: example • student(name,courseNum), course(number,instructor) SELECT name FROM student, course WHERE student.courseNum=course.number AND instructor=‘Li’; Equal to: ans(SN) :- student(SN, CN), course(CN,’Li’) • Predicates student and course correspond to relations names • Two subgoals: student(SN, CN) and course(CN,’Li’) • Variables: SN, CN. Constant: ‘Li’ • Shared variable, CN, corresponds to “student.courseNum=course.number” • Variable SN in the head: the answer to the query Seminar 2
Answer to a CQ • For a CQ Q on database D, the answer Q(D) is set of heads of Q if we: • Substitute constants for variables in the body of Q in all possible ways • Require all subgoals to be true • Example: ans(SN) :- student(SN, CN), course(CN,’Li’) • Tuples are also called “EDB” (external database) facts: student(Jack, 184), student(Tom,215), …, course(184,Li), course(215,Li), … • Answer “Jack”: SNJack,CN184 • Answer “Tom”: SNTom,CN215 • Answer “Jack”: SNJack,CN215 (duplicate eliminated) Course Student Seminar 2
Query containment • For two queries Q1 and Q2, we say Q1 is contained in Q2, denoted Q1Q2, if any database D, we have Q1(D)Q2(D). • We say Q1 and Q2 are equivalent, denoted Q1Q2, if Q1(D)Q2(D) and Q1(D) Q2(D). • Example: Q1: ans(SN) :- student(SN, CN), course(CN,’Li’) Q2: ans(SN) :- student(SN, CN), course(CN,INS) We have: Q1(D) Q2(D). Seminar 2
Another example Q1: p(X,Y) :- r(X,W), b(W,Z), r(Z,Y) Q2: p(X,Y) :- r(X,W), b(W,W), r(W,Y) • We have: Q2 Q1 • Proof: • For any DB D, suppose p(x,y) is in Q2(D). Then there is a w such that r(x,w), b(w,w), and r(w,y) are in D. • For Q1, consider the substitution: X x, W w, Z w, Y y. • Thus the head of Q1 becomes p(x,y), meaning that p(x,y) is also in Q1(D). • In general, how to test containment of CQ’s? • Containment mappings • Canonical databases Seminar 2
Containment mappings • Mapping from variables of CQ Q2 to variables of CQ Q1, such that: • Head of Q2 becomes head of Q1 • Each subgoal of Q2 becomes some subgoal of Q2 • It is not necessary that every subgoal of Q1 is the target of some subgoal of Q2. • Example: Q1: p(X,Y) :- r(X,W), b(W,Z), r(Z,Y) Q2: p(X,Y) :- r(X,W), b(W,W), r(W,Y) • Containment mapping from Q1 to Q2: X X, Y Y, W W, Z W • No containment mapping from Q2 to Q1: • For b(W,W) in Q2, its only possible target in Q1 is b(W,Z) • However, we cannot have a mapping WW and WZ, since each variable cannot be mapped to two different variables Seminar 2
Example of containment mappings • Example: C1: p(X) :- a(X,Y), a(Y,Z), a(Z,W) C2: p(X) :- a(X,Y), a(Y,X) • Containment mapping from C1 to C2: X X, Y Y, Z X, W Y • No containment mapping from C2 to C1. Proof: • For the two heads, the mapping must have X X • For a(X,Y) in C2, its target in C1 can only be a(X,Y) (since XX). Thus YY. • However, for a(Y,X) in C2, its target, which must be a(Y,X), does not exist in C1. Seminar 2
Theorem of Containment Mappings • Theorem: Q1 Q2 iff there is a containment mapping from Q2 to Q1. • Notice: the direction is the “opposite” • Proof (“If”): • Suppose is a containment mapping from Q2 to Q1 • For any DB D, let tuple t is in Q1(D) • t is produced by a substitution on the variables of Q1 that makes all Q1’s subgoals facts in D. • Therefore, is a substitution for variables of Q2 that produces t • Thus each t in Q1(D) must be in Q2(D) Q1: p(X) :- G1, G2, … Gk Q1: p(X,Y) :- r(X,W), b(W,Z), r(Z,Y) Q2: p(X,Y) :- r(X,W), b(W,W), r(W,Y) Q2: p(X) :- H1, H2, … Hj Seminar 2
Proof (only if) • Key idea: frozen CQ • Use a unique constant to replace a variable • Frozen Q is a DB consisting of all the subgoals of Q, with the chosen constants substituted for variables • This DB is called a “canonical database” of the query. • Example: • Q1: p(X,Y) :- r(X,W), b(W,Z), r(Z,Y) • Frozen Q1: Xreplaced by constant x0, W by constant w0, Z by z0, Y by y0 • Result: DB with {r(x0, w0), b(w0, z0), r(z0, y0)} Seminar 2
Proof (only if) -- cont • Let Q1 Q2. Let D be the frozen Q1. Let be the substitution from those constants to the variables in Q1. • Since we chose a unique constant for each variable, this substitution exists. • Since Q1 Q2 the “frozen” head of Q1 must be in Q2(D). Thus there is a substitution from Q2 to D. • We can show that is a containment mapping from Q2 to Q1 • The head of Q2 is mapped to the head of Q1. • Each subgoal in Q2 is mapped to a subgoal in Q2. Q1: p(X) :- G1, G2, … Gk Q1: p(X,Y) :- r(X,W), b(W,Z), r(Z,Y) Q2: p(X,Y) :- r(X,W), b(W,W), r(W,Y) Q2: p(X) :- H1, H2, … Hj Seminar 2
Testing query containment • To test Q1 Q2.: • Get a canonical DB D of Q1. • Compute Q2(D) • If Q2(D) contains the frozen head of Q1, then Q1 Q2. otherwise not. • Testing containment between CQ’s is NP-complete. • Some polynomial-time algorithms exist in special cases. Seminar 2
Extending CQ’s • CQ’s with built-in predicates: • We can add more conditions to variables in a CQ. • Example: student(name, GPA, courseNum), course(number,instructor,year) ans(SN) :- student(SN, G, CN), course(CN,’Li’), G>=3.5 ans(SN) :- student(SN, G, CN), course(CN,’Li’, Y), G>=3.5, Y < 2002 • More results on CQ’s with built-in predicates • Datalog queries: • a (possibly infinite) set of CQ’s with (possibly) recursion • Example: r(Parent, Child) • Query: finding all ancestors of Tom ancestor(P,C) :- r(P, C) ancestor(P,C) :- ancestor(P,X), r(X, C) result(P) :- ancestor(P, ‘tom’) Seminar 2
Further Reading • Jeff Ullman, “Principles of Database and Knowledge Systems,” Computer Science Press, 1988, Volume 2. Seminar 2
Outline • Basics: theories of conjunctive queries • Global-as-view (GAV) approach to data integration • Local-as-view (LAV) approach to data integration Seminar 2
GAV approach to data integration • Readings: • Jeffrey Ullman, Information Integration Using Logical Views, ICDT 1997. • Ramana Yerneni, Chen Li, Hector Garcia-Molina, and Jeffrey Ullman, Computing Capabilities of Mediators, SIGMOD 1999. Seminar 2
Global-as-view Approach med(Dealer,City,Make,Year) = R S Mediator R1(Dealer,City) R2(Dealer, Make, Year) • Mediator exports views defined on source relations med(Dealer,City,Make,Year) = R1 R2 • A query is posted on mediator views: SELECT * FROM med WHERE Year = ‘2001’; ans(D,C,M) :- med(D,C,M,‘2001’) • Mediator expands query to source queries: SELECT * FROM R1, R2 WHERE Year = ‘2001’; ans(D,C,M,Y) :- R1(D,C), R2(D,M,2001) Seminar 2
GAV Approach (cont) • Project: TSIMMIS at Stanford • Advantages: • User queries easy to define • Plan generation is straightforward • Disadvantages: • Not all source information is exported: • What if users want to get dealers that may not the city information? • Those dealers are not “visible.” • Not easily scalable: every time a new source is added, mediator views need to be changed • Research issues • Efficient query execution? • Deal with limited source capabilities? Seminar 2
Limited source capabilities • Complete scans of relations not possible • Reasons: • Legacy databases or structured files: limited interfaces • Security/Privacy • Performance concerns • Example 1: legacy databases with restrictive interfaces title author Given an author, return the books. Ullman DBMS TeX Knuth … … Seminar 2
Another example: Web search forms www.imdb.com Seminar 2
Problems • How to describe source restrictions? • How to compute mediator restrictions from sources? • How to answer queries efficiently given these restrictions? • How to compute as many answers as possible to a query? • … Seminar 2
Describe source capabilities: using attribute adornments. f: free b: bound u: unspecified c[S]: chosen from a list S of constants, e.g., “state” o[S]: optional; if chosen, must be from a list S of constants A search form is represented as multiple templates: (Title, Author, ISBN, Format, Subject) b f u u u 1 f b u u u 1 u u u o[] o[] 2 u u b u u 3 1 2 3 Seminar 2
Computing mediator restrictions • Motivation: do not want users to be frustrated by submitting a query that cannot be answerable by the mediator • Example: • Source 1: book(author, title, price) • Capability: “bff” • I.e., we must provide a title, and can get author and price info • Source 2: review(title, reviewer, rate) • Capability: “bff” • I.e., we must provide a book title, and can get other info • Mediator view: MedView(A,T,P,RV,RT) :- book(A,T,P),review(T,RV,RT) • Query on the mediator view: • Ans(RT) :- MedView(A, ‘db’, P, RV, RT). • I.e., “find the review rates of DB books” • But the mediator cannot answer this query, since we do not know the authors. • We want to tell the user beforehand what queries can be answered Seminar 2
Solutions: Compute mediator capabilities Need algorithms that do the following: • Given • Source relations with restrictions. • Mediator views defined on source relations: • Union • Join • Selection • Projection • Main idea of the algorithms • compute restrictions on mediator views • minimize number of view templates Seminar 2
“Union” views • Assumption: • MedView :- V1V2 • We want to get all tuples from two sources that satisfy a query condition • No mediator post-processing power • Table to compute view adornments • E.g., “f, o[s3] o[s3]” • “c[s2], o[s3] c[s2s3]” • Invalid combination: “b,u -” V2 V1 Seminar 2
“Union” views with postprocessing • Mediator can postprocess results from a source, and check if the results satisfy certain conditions • Thus some entries are more “relaxing” • Essentially: “o” can be treated as “f”, and “u” can be treated as “f” • E.g., “f, o[s3] f” instead of “o[s3]” • “c[s2], o[s3] c[s2]” instead of “c[s2s3]” • “b,u b” instead of “invalid combination” V2 V1 Seminar 2
“Join” views with passing bindings • Assumption: • MedView :- V1 JOIN V2 • The mediator can pass bindings from V1 to V2 • So the join order matters V2 V1 Seminar 2
Other views • Union • Join • Selection • Projection • Multiple views Seminar 2
Concise template description • Some adornments subsume other adornments • E.g.: “f” subsumes “b”, since every query supported by “b” is also supported by “f” • Adornment graph: “subsumption” relationships • Use the graph to “compress” templates: experiments shrank 26 8 templates f n1 n2 Adornment n1 is at least as restrictive as adornment n2 b o n1 n2 Adornment n1 is at least as restrictive as adornment n2, if the constant set of n1 is a subset of that of n2 u c Adornment graph Seminar 2
Outline • Basics: theories of conjunctive queries • Global-as-view (GAV) approach to data integration • Local-as-view (LAV) approach to data integration Seminar 2
Local-as-view (LAV) approach Mediator sources • There are global predicates, e.g., “car,” “person,” “book,” etc. • They can been seen as mediator views • The content of each source is described using these global predicates • A query to the mediator is also defined on the global predicates • The mediator finds a way to answer the query using the source contents Seminar 2
Example Mediator S1(Dealer,City) S2(Dealer,Make,Year) • Global predicates: Loc(Dealer,City),Sell(Dealer,Make,Year) • Source content defined on global predicates: S1(Dealer,City) :- Loc(Dealer,City); S2(Dear,Make,Year) :- Sell(Dear,Make,Year) In general, each definition could be more complicated, rather than direct copies. • Queries defined on global predicates. Q: ans(D,M,Y) :- Loc(D,’irvine’), Sell(D,M,Y) • Users do not know source views. • The mediator decides how to use source views to answer queries. • “Answering queries using views”: ans(D,M,Y) :- S1(D,’irvine’), S2(D,M,Y) Seminar 2
Another LAV Example • Mediator predicates: car(C), sell(Car, Dealer), loc(dealer, city) • Views: • v1(x) :- car(x) • v2(x) :- car(x), sell(x, d) • v3(x,d) :- sell(x, d), loc(d, ’la’) • v4(x) :- sell(x, d), loc(d, ’la’) • Query: q(x) :- car(x), sell(x, d), loc(d, ’la’) Seminar 2
OWA CWA W1 W2 All car tuples W1 = W2 = Open-world assumption (OWA) and Close-world assumption (CWA) W1(Make, Dealer) :- car(Make, Dealer) W2(Make, Dealer) :- car(Make, Dealer) • W1 and W2 have some car tuples. • E.g.: W1 and W2 are from two different web sites. • W1 and W2 have all car tuples. • E.g.: W1 and W2 are computed from the same car table in a database. Seminar 2
Projects using the LAV approach • Projects: Information Manifold, Infomaster, Tukwila, … • Advantages: • Scalable: new sources easy to add without modifying the mediator views • All we need to do is to define the new source using the existing mediator views (predicates) • Disadvantages: • Hard to decide how to answer a query using views Seminar 2
Reading • Alon Halevy, Answering Queries Using Views: A Survey. Seminar 2
Answering queries using views Mediator Query V(D,C,M,Y) :- Loc(D,C),Sell(D,M,Y) • Source views can be complicated: SPJs, arithmetic comparisons,… • Not easy to decide how to answer a query using source views Query: ans(D,M) :- Loc(D,'irvine'), Sell(D,M,Y). Rewriting: ans(D,M) :- V(D,‘irvine’, M,Y) • “Equivalent rewriting”: compute the “same” answer as the query • A rewriting can join multiple source views • This problem exists in many other applications: • data warehousing • web caching • query optimizations Seminar 2
Arithmetic comparisons Mediator V(D,C,M,Y):- Loc(D,C),Sell(D,M,Y),Y<1970 • Comparisons can make the problem even trickier • Query: ans(D,M) :- Loc(D,'irvine'), Sell(D,M,Y). Rewriting: ans(D,M) :- V(D,‘irvine’, M,Y) Contained rewriting: only retrieve cars before 1970. • Query: ans(D,M) :- Loc(D, 'irvine'), Sell(D,M,Y), Y < 1960 Rewriting: ans(D,M) :- V(D,‘irvine’,M,Y), Y < 1960 Seminar 2
Dropping attributes in views Mediator Drop “Year” in the view: V(D,C,M):- Loc(D,C),Sell(D,M,Y),Y<1970 • A variable in a CQ is called: • “distinguished”: if it appears in the query’s head • “nondistinguished”: otherwise • The problem becomes even harder when we have nondistinguished variables. • Query: ans(D,M) :- Loc(D,'irvine'), Sell(D,M,Y), Y<1960 No rewriting! Since we do not have “Year” information. • Query: ans(D,M) :- Loc(D,'irvine'), Sell(D,M,Y), Y<1980 Contained rewriting: ans(D,M) :- V(D, ‘irvine’, M) Seminar 2
Problems Query Source views • How to answer a query using views? • We will focus on the case where both the query and views are simply conjunctive. Seminar 2
Query Expansion • For each query P on views, we can expandP using the view definitions, and get a new query, denoted as Pexp, on the base tables. • Pexp can be considered to be the “real” meaning of the query. • Example: • View: V(D,C,M) :- Loc(D,C), Sell(D,M,Y) • A query P using V:ans(D,M) :- V(D,’la’,M) • Expansion:ans(D,M) :- Loc(D,’la’), Sell(D,M,Y) Query P: ans() :- v1(), v2(), …, vk() Expansion Pexp: ans():- p1,1(),…,p1,i1(),…, pk,1(),…,pk,ik() Seminar 2
Rewritings • Given a query Q and a set of views V: • A conjunctive query P is called a “rewriting” of Q using V if P only uses views in V, and P computes a partial answer of Q. That is: Pexp Q. A rewriting is also called a “contained rewriting” (CR). • A conjunctive query P is called an “equivalent rewriting” (ER) of Q using V if P only uses views in V, and P computes the exact answer of Q. That is: Pexp Q. • A query P is called a “maximally-contained rewriting” of Q using V if P is a union of CRs of Q using V, and for any CR P1of Q, the answer to P contains the answer to query P1, that is, P1exp Pexp. • See earlier slides for examples • Notice that all these definitions depend on the language of the rewriting considered. Here we consider “conjunctive queries.” Seminar 2
Focus: MiniCon algorithm • MiniCon Algorithm: Rachel Pottinger and Alon Levy, “A scalable algorithm for answering queries using views,” VLDB 2000. • See also: The Shared-variable-bucket algorithm by Prasenjit Mitra: "An Algorithm for Answering Queries Efficiently Using Views"; in Proceedings of the Australasian Database Conference, Jan 2001. • Formulation: • Input: a conjunctive query Q and a set V of conjunctive views • Output: an maximally-contained rewriting (MCR) of Q using V • Main idea: • For each query subgoal and for each view • Check if the view can be used to “answer” the query subgoal, and if so, in what “form” • Some “shared” variables are treated carefully • Combine views to answer all query subgoals • Reduced to a set-cover problem Seminar 2
Example • Query: q(x) :- car(x), sell(x, d), loc(d, ’la’) • Views: • v1(x) :- car(x) • v2(x) :- car(x), sell(x, d) • v3(x,d) :- sell(x, d), loc(d, ’la’) • v4(x) :- sell(x, d), loc(d, ’la’) Seminar 2