810 likes | 1.03k Views
Corso di Rappresentazione della Informazione e della Conoscenza Anno Accademico 2007-2008 Matteo Palmonari Query Processing in Data Integration .
E N D
Corso di Rappresentazione della Informazione e della Conoscenza Anno Accademico 2007-2008Matteo PalmonariQuery Processing in Data Integration Materiale organizzato da presentazioni di: L. Tanca, S. Costantini, J. Sun & L. Zhao, M. Van Der Wielen, Ir. R. Vdovjak, Kambhampati & Knoblock, S. Bergamaschi
An introduction to data integration Prof. Letizia Tanca Politecnico di Milano Technologies for Information Systems
addall.com ? Information Integration barnes&noble.com A1books.com amazon.com half.com An Online Shopper’s Information Integration Problem El Cheapo: “Where can I get the cheapest copy (including shipping cost) of Wittgenstein’s Tractatus Logicus-Philosophicus within a week?” “One-World” Mediation
What is a Data Integration System? A system providing: • Uniform (same query interface to all sources) • Access to (queries; eventually updates too) • Multiple (we want many, but 2 is hard too) • Autonomous (DBA doesn’t report to you) • Heterogeneous (data models are different) • Structured (or at least semi-structured) • Data Sources (not only databases).
Virtual Integration Architecture User queries Mediated schema Reformulator Mediator: Optimizer Data source catalog Execution engine wrapper wrapper wrapper Data source Data source Data source Sources can be: relational, hierarchical (IMS), structured files, web sites.
Query Model in Virtual Integration • User formulates query in terms of his/her ontology on the mediated (or “global”) schema • System reformulates queries in terms of sub-queries for each source (“local” schema) • Structure of the query model should be more intuitive for the user
Reformulation Problem • Given: • A query Q posed over the mediated schema • Descriptions of the data sources • Find: • A query Q’ over the data source relations, such that: • Q’ provides only correct answers to Q, and • Q’ provides all possible answers from to Q given the sources.
Mapping between the global logical schema and the single source schemata (logicalview definition) • Two basic approaches • GAV (Global As View) • LAV (Local As View) • Can be used also in case of different data models • In that case a model transformation is required
GAV (Global As View) • Up to now we supposed that the global schema be derived from the integration process of the data source schemata • Thus the global schema is expressed in terms of the data source schemata • Such approach is called the Global As View approach
The other possible ways… LAV (Local As View) • The global schema has been designed independently of the data source schemata • The relationship (mapping) between sources and global schema is obtained by defining each data source as a view over the global schema GLAV (Global and Local As View) • The relationship (mapping) between sources and global schema is obtained by defining a set of views, some over the global schema and some over the data sources
Mapping between data sources and global schema • Global schema G • Source schemata S • Mapping M between sources and global schema: a set of assertions qS qG qG qS Intuitively, the first assertion specifies that the concept represented by a view (query) qS over a source schema S corresponds to the concept specified by qG over the global schema. Viceversa for the second assertion.
Mapping between data sources and mediated schema • A data integration system is a triple (G, S, M) • The query to the integrated system are posed in terms of G and specify which data of the virtual database we are interested in • The problem is understanding which real data (in the data sources) correspond to those virtual data
GAV • A GAV mapping is a set of assertions, one for each element g of G g qS That is, the mapping specifies g as a query qS over the data sources. This means that the mapping tells us exactly how the element g is computed. • OK for stable data sources • Difficult to extend with a new data source
GAV example SOURCE 1 Product(Code, Name, Description, Warnings, Notes, CatID) Category(ID, Name, Description) Version(ProductCode, VersionCode, Size, Color, Name, Description, Stock, Price) SOURCE 2 Product(Code, Name, Size, Color, Description, Type, Price, Q.ty) Tipe(TypeCode, Name, Description) n.b.: WE DO NOT CARE ABOUT DATA TYPES…
SOURCE 1 Product(Code, Name, Description, Warnings, Notes, CatID) Version(ProductCode, VersionCode, Size, Color, Name, Description, Stock, Price) SOURCE 2 Product(Code, Name, Size, Color, Description, Type, Price, Q.ty) GLOBAL SCHEMA CREATE VIEW GLOB-PROD AS SELECT Code AS PCode, VersionCode as VCode, Version.NAme AS Name, Size, Color, Version.Description as Description, CatID, Version.Price, Stock FROM SOURCE1.Product, SOURCE1.Version WHERE Code = ProductCode UNION SELECT Code AS PCode, null as VCode, Name, Size, Color, Description,Type as CatID, Price, Q.ty AS Stock FROM SOURCE2.Product
GAV • Suppose now we introduce a new source • The simple view we have just created is to be modified • In the simplest case we only need to add a union with a new SELECT-FROM-WHERE clause • This is not true in general, view definitions may be much more complex
GAV • Quality depends on how well we have compiled the sources into the global schema through the mapping • Whenever a source changes or a new one is added, the global schema needs to be reconsidered • Query processing can be based on some sort of unfolding • Example: one already seen
LAV The Local-as-View approach, see Figure, describes local sources as a view defined in terms of the global schema. The global schema is predefined and for each local source is described how it delivers information to the global schema. Each mapping associates the entities in the source schemas by way of a query over the global schema. Thereby, each source schema is defined as a view over the global schema, hence the name “Local-as-View”.
LAV A mapping LAV is a set of assertions, one for each element s of each source S s qG Thus the content of each source is characterized in terms of a viewqG over the global schema • OK if the global schema is stable, e.g. based on a domain ontology or an enterprise model • It favours extensibility • Query processing much more complex
LAV • Quality depends on how well we have characterized the sources • High modularity and extensibility (if the global schema is well designed, when a source changes or is added, only its definition is to be updated) • Query processing needs reasoning
SOURCE 1 Product(Code, Name, Description, Warnings, Notes, CatID) Version(ProductCode, VersionCode, Size, Color, Name, Description, Stock, Price) SOURCE 2 Product(Code, Name, Size, Color, Description, Type, Price, Q.ty) GLOBAL SCHEMA GLOB-PROD (PCode, VCode, Name, Size, Color, Description, CatID, Price, Stock) In this case we have to express the sources as views over the global schema
GLOBAL SCHEMA GLOB-PROD (PCode, VCode, Name, Size, Color, Description, CatID, Price, Stock) SOURCE 2 Product (Code, Name, Size, Color, Description, Type, Price, Q.ty) CREATE VIEW SOURCE2.Product AS SELECT PCode AS Code, Name, Size, Color, Description, CatID as Type, Price, Stock AS Q.ty FROM GLOB-PROD
GLOBAL SCHEMA GLOB-PROD (PCode, VCode, Name, Size, Color, Description, CatID, Price, Stock) SOURCE 1 Product(Code, Name, Description, Warnings, Notes, CatID) Version(ProductCode, VersionCode, Size, Color, Name, Description, Stock, Price) CREATE VIEW SOURCE1.Product AS SELECT Pcode AS Code, ?Name?, ?Description?,?Warnings ?, ?Notes?, CatID FROM GLOB-PROD CREATE VIEW SOURCE1. Version AS SELECT Pcode AS ProductCode, VCode as VersionCode, Size, Color, Name, ? Description ? , Stock, Price) FROM GLOB-PROD N.B.: Some information is lacking: either we don’t know where description has to be taken from, or there is no correspondent value (e.g.Warnings, Notes, etc..). A difficult job
The GLAV approach • Is a combination of LAV and GAV. Part of the mappings are of LAV type, part of GAV type. • It combines the advantages of the two approaches.
“View” Refresher Virtual vs Materialized Differences minor for data aggregation… Approaches for relating source & Mediator Schemas • Global-as-view (GAV): express the mediated schema relations as a set of views over the data source relations • Local-as-view (LAV): express the source relations as views over the mediated schema. • Can be combined…? Let’s compare them in a movie Database integration scenario.. Information Integration on the Web (MA-1)
Global-as-View Express mediator schema relations as views over source relations Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Create View Movie AS select * from S1[S1(title,dir,year,genre)] union select * from S2[S2(title, dir,year,genre)] union[S3(title,dir), S4(title,year,genre)] select S3.title, S3.dir, S4.year, S4.genre from S3, S4 where S3.title=S4.title Information Integration on the Web (MA-1)
Global-as-View Express mediator schema relations as views over source relations Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Create View Movie AS select * from S1 [S1(title,dir,year,genre)] union select * from S2 [S2(title, dir,year,genre)] union[S3(title,dir), S4(title,year,genre)] select S3.title, S3.dir, S4.year, S4.genre from S3, S4 where S3.title=S4.title Mediator schema relations are Virtual views on source relations Information Integration on the Web (MA-1)
Global-as-View: Example 2 Express mediator schema relations as views over source relations Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Create View Movie AS [S1(title,dir,year)] select title, dir, year, NULL from S1 union [S2(title, dir,genre)] select title, dir, NULL, genre from S2 Null values Information Integration on the Web (MA-1)
Global-as-View: Example 2 Express mediator schema relations as views over source relations Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Source S4: S4(cinema, genre) Create View Movie AS select NULL, NULL, NULL, genre from S4 Create View Schedule AS select cinema, NULL, NULL from S4. But what if we want to find which cinemas are playing comedies? “Lossy Mediation” Information Integration on the Web (MA-1)
Create Source S1 AS select * from Movie Create Source S3 AS select title, dir from Movie Create Source S5 AS select title, dir, year from Movie where year > 1960 AND genre=“Comedy” S1(title,dir,year,genre) S3(title,dir) S5(title,dir,year), year >1960 Sources are “materialized views” of mediator schema Local-as-View: example 1 Express source schema relations as views over mediator relations Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Information Integration on the Web (MA-1)
Create Source S1 AS select * from Movie Create Source S3 AS select title, dir from Movie Create Source S5 AS select title, dir, year from Movie where year > 1960 AND genre=“Comedy” Creat Source S4 AS select cinema, genre from Movie m, Schedule s where m.title=s.title Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Express source schema relations as views over mediator relations S1(title,dir,year,genre) S3(title,dir) S5(title,dir,year), year >1960 S4(Cinema,Genre) Information Integration on the Web (MA-1)
Local-as-View: Example 2 Express source schema relations as views over mediator relations Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Source S4: S4(cinema, genre) Create Source S4 select cinema, genre from Movie m, Schedule s where m.title=s.title Now if we want to find which cinemas are playing comedies, there is hope! Information Integration on the Web (MA-1)
GAV vs. LAV Source S4: S4(cinema, genre) Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Lossy mediation Information Integration on the Web (MA-1)
Not modular Addition of new sources changes the mediated schema Can be awkward to write mediated schema without loss of information Query reformulation easy reduces to view unfolding (polynomial) Can build hierarchies of mediated schemas Best when Few, stable, data sources well-known to the mediator (e.g. corporate integration) Garlic, TSIMMIS, HERMES, MOMIS Modular--adding new sources is easy Very flexible--power of the entire query language available to describe sources Reformulation is hard Involves answering queries only using views (can be intractable—see below) Best when Many, relatively unknown data sources possibility of addition/deletion of sources Information Manifold, InfoMaster, Emerac, Havasu GAV vs. LAV Information Integration on the Web (MA-1)
Views: Sound & Completeness • The terms sound, exact and complete are used to express to what degree the extent of a view corresponds to its definition. • A view defined over some data source is sound if it provides a subset of the available data in the data source that corresponds to the definition. • It delivers only, but not necessarily all answers to its definition. The answers it does deliver might be incomplete, but they are correct (sound). • If a view is complete, it provides a superset of the available data in the data source that corresponds to the definition. It delivers all answers to its definition, and maybe more. • Since the set of answers might contain more than the answers corresponding to the views definition, but does contain all answers corresponding to the views definition, the term complete is being used. • A view is exact if it provides all and only data corresponding to the definition.
Query answering in GAV Query answering in the Global-as-View happens through query unfolding. Unfolding queries to the local sources is relatively easy compared to the Local as-View approach in which queries posted to the global schema have to be rewritten before being able to answer the query.
Query answering in GAV • Because each global entity is defined as a view over the local schemas, the total number of GAV-rules necessary to define the mappings in a data-integration system corresponds to the number of entities in the global schema. • the extent of a global schema defined by GAV-rules is assumed to be exact.
Query answering in LAV The Local-as-View approach, see Figure, describes local sources as a view defined in terms of the global schema. The global schema is predefined and for each local source is described how it delivers information to the global schema. Each mapping associates the entities in the source schemas by way of a query over the global schema. Thereby, each source schema is defined as a view over the global schema, hence the name “Local-as-View”.
Query answering in LAV • Each entity in the local schemas is defined as a view over the global schema. Therefore, the total number of LAV-rules necessary to define the mappings in a data-integration system corresponds to the number of entities in the local schemas. • Processing queries in the Local-as-View approach is difficult because the only knowledge of the global-schema available is through the views representing the sources. Such view only provides partial information about the data. Since the mapping associated to each source as a view over the global schema it is not clear how to use the sources in order to answer queries expressed over the global schema. Therefore extracting information from an integration system using the Local-as-View approach is a complex task because one has to answer queries with incomplete information. • Views defined by LAV-rules might be sound or complete, in terms of 3.1.
Introduction • Traditional Integration Formalisms • Global as View (GAV) • Mediated schema as views over data sources • Local as View (LAV) • Data sources as views of mediated schema GAV: T :- S1, S2, S3 LAV: S1 T Med. Schema T S1 S2 S3
Schema Conjunctive queries For example: Give me students together with their advisors who took courses given by their advisors after the 1st trimester 2003. Some DB Theory
SQL SELECT Advises.prof, Advises.student FROM Registered, Teaches, Advises WHERE Registered.c-number = Teaches.c-number and Registered.trimester=Teaches.trimester and Advises.prof=Teaches.prof and Advises.student=Registered.student and Registered.trimester > “2003\1” DATALOG Q(prof, student) :- Registered(student,course, trimester), Teaches(prof,course, trimester) , Advises(prof, student), trimester > “2003\1” DB Theory head body
Problem Definition • GAV-like Definition – Definition in Datalog 9DC : SkilledPerson(PID, “Doctor”): - H :Doctor(SID, h, l, s, e) 9DC : SkilledPerson(PID, “EMT”) : - H : EMT(SID, h, vid, s, e) 9DC : SkilledPerson(PID, “EMT”) : - FS : Schedule(PID, vid), FS : FirstResponse(vid, s, l, d), FS : Skills(PID, “medical”)
Problem Definition • LAV-like Definition – Inclusion Definition LH : CritBed(bed, hosp, room, PID, status) H : CritBed(bed, hosp, room), H : Patient(PID, bed, status) LH : EmergBed(bed, hosp, room, PID, status) H : EmergBed(bed, hosp, room), H : Patient(PID, bed, status)
Deductive Databases • Tables viewed as predicates. • Ops on tables expressed as “datalog” rules • (Horn clauses, without function symbols) Enames(Name) :- Employe(Name, SSN) [Projection] Wealthy-Employee(Name):- Employee(Name,SSN), Salary(SSN,Money),Money> 10 [Selection] Ed(Name, Dname):- Employee(Name, SSN), E_Dependents(SSN, Dname) [Join] ERelated(Name,Dname) :- Ed(Name,Dname) ERelated(Name,Dname) :- Ed(Name,D1), ERelated(D1,D2) [Recursion]
More datalog terminology A datalog program is a set of datalog rules. A program with one rule is a conjunctive query. We distinguish EDB predicates and IDB predicates • EDB’s are stored in the database, • appear only in rule bodies • IDB’s are intensionally defined, • appear in both bodies and heads.
Approaches to Specifying Schema Descriptions • Global-as-View: express the mediated schema relations as a set of views over the data source relations • Local-as-View: express the source relations as views over the mediated schema.
A query Q’ is contained in Q if for all databases D, the set of tuples computed for Q’ is a subset of those computed for Q, i.e., Q’ Q iff D Q’(D) Q(D) A query Q’ is equivalent to Q iff Q’ Q and Q Q’ Example: Q’(prof, student) :- Registered(student,course, trimester), Teaches(prof,course, trimester) , Advises(prof, student), trimester > “2003\2” is contained in Q(prof, student) :- Registered(student,course, trimester), Teaches(prof,course, trimester) , Advises(prof, student), trimester > “2003\1” DB Theory, query containment
Create a canonical database D that is the “frozen” body of Q’ Compute Q(D) If Q(D) contains the frozen head of Q’ then Q’ Q otherwise not Example: Q’: p3(x,y) :- arc(x,z),arc(z,w),arc(w,y) Q: path(x,y) :- arc(x,y) path(x,y) :- path(x,z), path(z,y) 1. Freeze Q’ with say 0,1,2,3 for x,z,w,y respectively: Q’(0,3) :- arc(0,1), arc(1,2), arc(2,3) D={arc(0,1), arc(1,2), arc(2,3)} 2. Compute Q(D): Q(D)={(0,1), (1,2), (2,3), (0,2), (1,3), (0,3)} 3. The frozen head of Q’ = (0,3) is in the Q(D) Q’Q Query containment check
Problem: Given a user query Q and view definitions V={V1..Vn} Q’ is an Equivalent Rewriting of Q using V iff Q’ refers only to views in V, Q’ is equivalent to Q Q’ is a Maximally-contained Rewriting of Q using V iff Q’ refers only to views in V, Q’ Q, there is no such rewriting Q1 that Q’ Q1 Q, Q1 Q Query reformulation using views