270 likes | 404 Views
INFORMATION INTEGRATION. Shengyu Li CS-257 ID-211. Outline. Basic Capability-Based Optimization Optimizing Mediator Queries. Basic. Movie:. Attributes: Appear at the tops of the columns. Describe the meaning of entries in the column below. Example: title, year, length, type. Basic.
E N D
INFORMATION INTEGRATION Shengyu Li CS-257 ID-211
Outline • Basic • Capability-Based Optimization • Optimizing Mediator Queries
Basic Movie: Attributes: Appear at the tops of the columns. Describe the meaning of entries in the column below. Example: title, year, length, type
Basic Movie: Schemas: The name of the relation + the set of attributes Example: Movies (title, year, length, type)
Basic Movie: Tuples: The rows of a relation, other than the header row containing the attribute names. A tuple has one component for each attribute of relations. Example: Tuple 1 has 4 components: Gone With the Wind, 1939, 231, drama for attributes title, year, length, and type.
Basic Movie: Projection: The projection operator is used to produce from a relation R a new relation that has only some of R’s columns: Example: πTitle, year, length (Movies) The resulting relation is:
Basic • Datalog Rules and Queries: • A relational atom called the head, followed by • The symbol <-, which we often read “if”, followed by • A body consisting one or more atoms, called subgoals, which may be either relational or arithmetic. Subgoals are connected by AND, and any subgoal may optionally be preceded by the logical operatior NOT. • Ex: LongMovie(title,year) • <- Movies(title,year,length,type) AND length > 100
Capability-Based Optimization • Limited Source Capabilities • Web-based Interfaces • The top 20 sellers? • SELECT * FROM Books
Capability-Based Optimization • Why? • Legacy Sources • Archaic/unique system • Security • “tell me about all your books” • Medical database • Indexes on large databases • Books database infeasible queries
Capability-Based Optimization • Source Capabilities Notation – adornments: • Sequences of codes that represent the requirement for attributes of the relation for relational data • f (free): can be specified, we choose • b (bound): must be specified • u (unspecified): is not permitted to specified
Capability-Based Optimization • c[S] (choice from set S): a value must be specified and that value must be one of the values in the finite set S. • o[S] (optional , from set S): we either do not specify a value, or we specify one of the values in the finite set S. • Place a prime on a code to indicate the attribute is not part of the output of the query.
Capability-Based Optimization • Capabilities Specification: • A set of adornments • to query the source successfully, the query must match one of the adornments in its capabilities specification • For f (free) or o[S], queries with different sets of attributes may match that adornment.
Capability-Based Optimization • Example: Cars (serialNo, model, color, autoTrans, navi) Dealer 1 might allow this data to be queried: • The user specifies a serial number. All the information about the car with that serial number is produced as output. • The user specifies a model and color, and perhaps whether or not automatic transmission and navigation system. All five attributes are printed for all matching cars.
Capability-Based Optimization • Capability-Based Query-Plan Selection • Capability based query optimizer: • consider what queries that will help to answer the query. • takes binding for some more attributes • may make some more queries at the sources possible. • This process will repeat until either: • feasible: answer the query • Impossible query: no more valid forms
Capability-Based Optimization • Capability-Based Query-Plan Selection • The simplest form of mediator query for which we need to apply this strategy is a join of relations, each of which is available, with certain adornments, at one or more sources. • If so, then the search strategy is to try to get tuples for each relation in the join, by providing enough argument bindings that some source allows a query about that relation to be asked and answered.
Capability-Based Optimization • Capability-Based Query-Plan Selection • Example: • Autos (serial, model, color) • Options (serial, option) • adornment: • Autos: ubf • Options: two adornments: bu and uc[autoTrans, navi] • Query: find the serial numbers and colors of Gobi models with navigation system”
Capability-Based Optimization • Adding Cost Based Optimization • Cost-based optimization requires that the mediator has to know about the cost of the queries involved. • Since the sources are usually independent of the mediator, it is difficult to estimate the cost.
Optimizing Mediator Queries • Optimizing Mediator Queries • Chain algorithm • greedy algorithm • sends a sequence of requests to its sources • always finds a way to answer the query • provides at least one solution exists • The class of queries that can be handled • involve joins of relations that come from the sources • followed by an optional selection • can be expressed as Datalog rules • To describe a relational algebra
Optimizing Mediator Queries • Simplified Adornment Notation • b (bound) and f (free) adornments • Use c[S] adornment as soon as we know all possible values of interest for that attribute • Free for o[S], u
Optimizing Mediator Queries • Example: Autos buu (serial, model, color) Options uc[autoTrans, navi] (serial, option) • “find the serial numbers and colors of Gobi models with a navigation system” • Answer (s, c) <- Autos fbf (s, “Gobi”, c) AND Options fb (s, “navi”)
Optimizing Mediator Queries • Obtaining Answers for Subgoals • Supposed we have a subgoal: Rx1x2…xn (a1,a2,…,an) xi: b or f R: Relation that can be queried at some source y1y2…yn: one of the adornments for R at its source, It is possible to obtain a relation for the subgoal provided, for each i = 1,2,…n, privided: • If yi is b or of the form c[S], then xi = b • If xi = f, then yi is not output restricted (i.e. not primed) We say that the adornment on the subgoal matches the adornment at the source.
Optimizing Mediator Queries • Example: • Subgoal: Rbbff (p,q,r,s) • Adornments for R at its sources are: • α1 = fc[S1]uo[S2] -- set q as member of S1 • α2 = c[S3]bfc[S4] -- not match
Optimizing Mediator Queries • The Chain Algorithm • Greedy approach to select an order in which we obtain relations for each of the subgoals of a Datalog rule. • Not guaranteed to provide the most efficient solution, but it will provide a solution whenever one exists. • In practice, it is very likely to obtain the most efficient solution.
Optimizing Mediator Queries • Chain Algorithm maintain 2 kinds of information: • An adornment is maintained for each subgoal. Initially, the adornment for a subgoal has b if and only if the mediator query provides a constant binding for the corresponding argument of that subgoal, as for instance: • Answer (s, c) <- Autos fbf (s, “Gobi”, c) AND Options fb (s, “navi”)
Optimizing Mediator Queries • Consider the mediator query • Q: Answer(c) <- Rbf (1,a) AND Sff(a,b) AND Tff(b,c) • There are three sources that provide answers to queries about R, S, and T, respectively: