310 likes | 712 Views
Integration . what it takes to put data together. Ir. Richard Vdovjak, MTD. richardv@win.tue.nl. The Grand Challenge. The Integration System should: Provide flexible, homogeneous and transparent access to several (possibly) distributed, autonomous, and heterogeneous sources.
E N D
Integration what it takes to put data together Ir. Richard Vdovjak, MTD. richardv@win.tue.nl
The Grand Challenge The Integration System should: • Provide flexible, homogeneous and transparent access to several (possibly) distributed, autonomous, and heterogeneous sources. • Answer queries expressed in terms of the global schema by means of the source’s data. • Preserve the source’s data semantics. • Automate the integration process as much as possible.
Why is it so difficult? Those Sources! • Number of sources / size of the problem • the Web is large! New optimization techniques are needed (Pipelining, Distributed processing) • Location of the sources / source discovery • does a source that supposedly fulfills my info needs exist? • where is it located? (The Semantic Web helps – “google” for SW agents” ) • Heterogeneity of the sources • syntactic heterogeneity (HTML, XML, RDF, RDBS, ORDBS, etc..) • semantic heterogeneity (class Painter =? Creator) • designation heterogeneity (“John Smith” =? http://foo-ns/Smith)
Those Sources! 2 • Autonomy of the sources • sources may change their data, schemas especially on the Internet and they will not notify the integration system about these changes. • Volatility of the sources • sources may come and go, the system should be flexible to recover from these changes with minimal effort (use flexible ways to map sources to the system, use of adaptive query answering techniques) • Different source capabilities • some source are more intelligent than others, e.g. they allow to perform e.g. join like queries on them) • some sources are more restrictive than others, e.g. they require certain input (filling out some forms) before they provide results
Related Fields • Distributed Databases • Sources are homogeneous • Sources are not (so) autonomous • Similarities at the optimization and execution level • Information Retrieval • Keywords only, no semantics • IR techniques help for addressing the designation problem • Data mining • discovering “hidden” properties of data
Approaches to Data Integration • Materialized views • global (materialized) schema • all data pre-fetched into a central repository, • should be automated • similar to querying in VV • “easy” querying • after the data is “in house” the system is independent from the sources • difficult to keep up-to-date • intranet applications (e.g. enterprise data integration) • Virtual views • global (mediated) schema • data retrieved on demand (no central repository) • a query against the global schema is translated into sub-queries and shipped to the sources • always up-to-date • most WWW applications e.g. shopbots, different internet portals
Approaches to Data Integration • Materialized views • Virtual views
Schema Conjunctive queries For example: Give me students together with their advisors who took courses given by their advisors after the 1st trimester 2003. Some DB Theory
SQL SELECT Advises.prof, Advises.student FROM Registered, Teaches, Advises WHERE Registered.c-number = Teaches.c-number and Registered.trimester=Teaches.trimester and Advises.prof=Teaches.prof and Advises.student=Registered.student and Registered.trimester > “2003\1” DATALOG Q(prof, student) :- Registered(student,course, trimester), Teaches(prof,course, trimester) , Advises(prof, student), trimester > “2003\1” DB Theory head body
A query Q’ is contained in Q if for all databases D, the set of tuples computed for Q’ is a subset of those computed for Q, i.e., Q’ Q iff D Q’(D) Q(D) A query Q’ is equivalent to Q iff Q’ Q and Q Q’ Example: Q’(prof, student) :- Registered(student,course, trimester), Teaches(prof,course, trimester) , Advises(prof, student), trimester > “2003\2” is contained in Q(prof, student) :- Registered(student,course, trimester), Teaches(prof,course, trimester) , Advises(prof, student), trimester > “2003\1” DB Theory, query containment
Create a canonical database D that is the “frozen” body of Q’ Compute Q(D) If Q(D) contains the frozen head of Q’ then Q’ Q otherwise not Example: Q’: p3(x,y) :- arc(x,z),arc(z,w),arc(w,y) Q: path(x,y) :- arc(x,y) path(x,y) :- path(x,z), path(z,y) 1. Freeze Q’ with say 0,1,2,3 for x,z,w,y respectively: Q’(0,3) :- arc(0,1), arc(1,2), arc(2,3) D={arc(0,1), arc(1,2), arc(2,3)} 2. Compute Q(D): Q(D)={(0,1), (1,2), (2,3), (0,2), (1,3), (0,3)} 3. The frozen head of Q’ = (0,3) is in the Q(D) Q’Q Query containment check
Problem: Given a user query Q and view definitions V={V1..Vn} Q’ is an Equivalent Rewriting of Q using V iff Q’ refers only to views in V, Q’ is equivalent to Q Q’ is a Maximally-contained Rewriting of Q using V iff Q’ refers only to views in V, Q’ Q, there is no such rewriting Q1 that Q’ Q1 Q, Q1 Q Query reformulation using views
Bored? • How about some motivation? • Why do we bother? • Is it really worth it? • Is anybody out there interested in integration?
Estimated Annual Integration costs + data quality costs worldwide: $1 Trilion/year!1,2 ...and this is just a beginning “The Grand Challenge is now becoming mission critical”. How Big is Big? 1 Dr. M. Brodie, The Grand Challenge of Information Technology, Invited talk, CAiSE 2002 2 Trillion is 12 or 18 zeros, depending of which side of Atlantic you are :)
Global As View (GAV) The mediated schema is defined as a view over the source schemas Local As View (LAV) Sources are described as views over the mediated schema Ontological Approach An explicit integration model instance that adheres to an integration model ontology is created. Approaches to map/describe the Sources
X is defined as a view over Y and Z Mapping Example: Faculty(name,area) -> Person(id, name), Expert(id, area) Teaches(name, c-number, trim) -> Person(id, name), Gives(id, c-number, trimester) Query answering means unfolding the mappings/rules Global As View (GAV)
Query Answering = Query Unfolding • Query Example: Who from the area of “IS” was teaching in the trimester “2003/2”? Q(name):- Faculty(name, “IS”), Teaches(name, c,2003/2”) Person(id, name), Expert(id,”IS”), Gives(id, c, “2003/2”)
Each data source is described by one or more views wrt the mediated schema. The views might not be complete, i.e., they may only contain a subset of tuples satisfying their definition. Local As View (LAV)
Mapping Example V1(student, c-number, trimester,title) :- Registered(student, c-number, trimester), Course(c-number, title), c-number> “500”, trimester> “2002/1” V2(student, prof, c-number, trimester) :- Registered(student, c-number, trimester), Teaches(prof, c-number, quarter) V3(student, c-number) :- Registered(student, c-number, trimester), trimester < “1998/1” V4(prof, c-number, title, trimester) :- Registered(student, c-number, trimester), Course(c-number, title), Teaches(prof, c-number, trimester), trimester < “2000/1” User Query Example Q(s,c,p) :- Teaches(p,c,q), Registered(s,c,q), Course(c,t), c > “300”, q > “1999/2” Local As View (LAV)
Query Answering = Query Rewriting • Goal: reformulate a user query (against the mediated schema) into a query that refers directly to the available sources/views. • = Find a maximally-contained rewriting of the user query in the views describing the sources. • How?
The main idea: the number of rewritings that need to be considered can be reduced by considering each sub-goal of the query in isolation. input = set of views output = maximally-contained rewriting composed from (some of the) views informally, a view can be useful for a query if the set of relations it mentions overlaps with that of the query, and it selects some of the attributes selected by the query + the constrains of the view must be equivalent or weaker than those from the query The Bucket Algorithm 1
Step 1: Create a bucket for each sub-goal of Q and fill it in with views/sources that are relevant for that particular sub-goal. A view V is put in the bucket of the relation/sub-goal g if it contains this relation AND the constraints in V are compatible (weaker than) those in Q. Step 2: Combine source relations from each bucket into conjunctive queries (each consisting of one conjunct from every bucket). Check containment wrt to Q, if contained add the rewriting to the answer. The result is a union of conjunctive rewritings. The Bucket Algorithm 2
V1(student, c-number, trimester,title) :- Registered(student, c-number, trimester), Course(c-number, title), c-number> “500”, trimester> 2002/1” V2(student, prof, c-number, trimester) :- Registered(student, c-number, trimester), Teaches(prof, c-number, quarter) V3(student, c-number) :- Registered(student, c-number, trimester), trimester < “1998/1” V4(prof, c-number, title, trimester) :- Registered(student, c-number, trimester), Course(c-number, title), Teaches(prof, c-number, trimester), trimester < “2000/1” Query Constraints: c > “300”, q > “1999/2” Sub-goals: Teaches(p,c,q), Registered(s,c,q), Course(c,t) , Buckets: V2(s’,p,c,q) V1(s,c,q,t’) V1(s’,c,q’,t) V4(p,c,t,q) V2(s,p’,c,q) V4(p’,c,t,q’) variable mappings: {p-> prof, c->c-number, q-> trimester} {s->student, c->c-number, q-> trimester} {c->c-number, t->title} The Bucket Algorithm, Example: Step 1
V1(student, c-number, trimester,title) :- Registered(student, c-number, trimester), Course(c-number, title), c-number> “500”, trimester> 2002/1” V2(student, prof, c-number, trimester) :- Registered(student, c-number, trimester), Teaches(prof, c-number, quarter) V3(student, c-number) :- Registered(student, c-number, trimester), trimester < “1998/1” V4(prof, c-number, title, trimester) :- Registered(student, c-number, trimester), Course(c-number, title), Teaches(prof, c-number, trimester), trimester < “2000/1” Query Constraints: c > “300”, q > “1999/2” Sub-goals: Teaches(p,c,q), Registered(s,c,q), Course(c,t) , Buckets:V2(s’,p,c,q)V1(s,c,q,t’)V1(s’,c,q’,t) V4(p,c,t,q) V2(s,p’,c,q) V4(p’,c,t,q’) Q’(s,c,p) :- V2(s’,p,c,q), V1(s,c,q,t’), V1(s’,c,q’,t) = V2(s,p,c,q), V1(s,c,q,t’), Q’’(s,c,p) :- V2(s,p,c,q), V4(p,c,t’,q) Q’ Q, Q’’ Q Q’Q’’ Q The Bucket Algorithm, Example: Step 2
GAV adding/deleting a new source means changing the global schema (that is defined by all it’s rewriting rules) “easy” query reformulation LAV adding/deleting the source means just adding/deleting views from, i.e. the schema stays intact “difficult” query reformulation GAV versus LAV
The road doesn’t end here, there are still problems to tackle • Performance issues • the size of the problem, remember? • network behavior (delay, bursting data) • Source Discovery • Schema Discovery • Approximate answers • a less precise answer is often better than no (precise) answer at all • Choosing the right sources wrt to user demands • not everybody has the same evaluation criterion for an answer especially when multimedia are involved, e.g. (e.g. most “vivid”, with highest resolution, fastest, most reliable, etc. )
A Light at the End of the Tunnel? • Pipelineing, Parallelism, Adaptivity • Ontologies, Semantic Web, Agents • Emerging Standards e.g. RDF(S) • Combination of IR techniques and data integration, use of BBN • Programmable Mediators
The main idea: Sources are clustered (based on the information they offer) and the order in which they are consulted within a cluster is not “hard-wired” but flexible based on their annotations. RDFS-based application independent integration meta ontology (IMO) and its application dependent specialization Sources are mapped and annotated as RDF instantiation of the IMO Programmable Mediators
Sources are mapped by articulations to the mediated schema. Each such mapping can be annotated with a decoration(s) which rates that mapping for a concrete ordering criterion (criteria). When the mediator encounters “equivalent” sources it chooses those that are having better “score” in the query dependent criterion Programmable Mediators 2
The End • Thank you for your attention! • Those who fell asleep: WAKE UP! It’s time to go home :)