340 likes | 473 Views
Optimizing Recursive Information Gathering Plans. Eric Lambrecht, Subbarao Kambhampati Senthil Gnanaprakasam Arizona State University Tempe, USA rakaposhi.eas.asu.edu/yochan.html. Information Gathering. user. Gatherer. wrapper. db. wrapper. <html>. cgi.
E N D
Optimizing Recursive Information Gathering Plans Eric Lambrecht, Subbarao Kambhampati Senthil Gnanaprakasam Arizona State University Tempe, USA rakaposhi.eas.asu.edu/yochan.html
Information Gathering user Gatherer wrapper db wrapper <html> cgi
EMERAC Query Planning System Execution Optimizations: Source call ordering Build query plan using source inversion Logical Optimizations:Redundancy removal Execute query plan [Duschka (with Genesereth & Levy) 97] [Optimization steps]
Organization • Optimization challenges in EMERAC • Building Source Complete Plans: Review • Logical optimization • Minimization of recursive IG plans by removing redundant source calls • Execution optimization • Ordering source calls to minimize both access and tuple-transfer costs • Implementation and Results • Contributions
Modeling Information Gathering Information sources: • relational • answer ‘select’ queries (possibly a restricted set of query patterns) • autonomous World model: • relational Query on the world model: Reformulate the query as calls on information sources. Optimize. Execute. “Local as View” model
Modeling Sources Sources related to world model by describing them as views over world model: movie-hut(X, Y) -> title-time(X, Y), title-actor(X, Z) house-of-movies($X, Y) -> title-time(X, Y), title-actor(X, Z) query(X, Y) :- title-time(X, Y) Required binding..
Each relation is exported in to-to by a single database All sources are assumed to be fully relational Multiple sources export partial and overlapping portions of a relation Need to minimize plans to remove redundancy Sources are rarely fully relational Only limited types of queries allowed Wrapped web-pages Form-interfaced databases Certain forms of join computation may be precluded Need to model query capabilities Optimization challenges in EMERAC Traditional Information Gathering
[Continued] Optimization challenges in EMERAC • Tuple-transfer costs are assumed to dominate the query-execution costs • Use of “Bound-is-easier” assumption • Assume availability of full source-statistics • Selectivity indices, histograms etc. • Access cost & source latencies tend to equal or dominate the transfer cost • Need to consider number of source calls • Need for considering bushy joins (instead of just left-linear join trees) • Full statistics are rarely available about internet sources • Sources are decentralized and autonomous • Difficult to do systematic optimization
Source Access Limitations • Sources can have a variety of access limitations • Form interfaced databases may require certain attributes to be bound • Whitepages may require the name of the person • To get the numbers of a set of n people, we will have to access the source n times • and may be unable to handle bindings of other attributes • A Whitepages database may not take the address of a person as a bound attribute • To get the number of John Doe, who lives on Lemon St, we will have to get the numbers of all John Does, and locally filter the ones not living on Lemon Street • Wrapped web-pages cannot select over any attributes
Representing Source Access Limitations • Use annotations on the attributes of the source relation • “$” annotation identifies attributes that must be bound • “%” annotation identifies un-selectable attributes • S($X,%Y,Z) • A form-interfaced web-page that requires bindings for X and is able to do selections only on Z. • $ and % annotations help identify feasible binding patterns for sources • Sb-- are feasible; Sf--are infeasible; • Sbbf must be modeled as Sbff filtered locally with binding on Y
Properties of optimal information gathering plans • Source-complete: no other plan returns more information using the available sources • Source-minimal: a plan for which no information source can be removed, yet the plan returns the same answer. • Access-cost minimal: a plan which reduces the number of separate accesses to individual sources • Bandwidth-minimal: a plan that, when executed, transfers the smallest amount of data over the network yet is still source complete
Ensuring properties of optimal information gathering plans Build query plan Logical Optimizations Execution Optimizations Execute query plan [Source completeness] [Source-minimality] [Access cost and bandwidth minimality]
[Duschka, Genesereth 97] Building Source Complete Plans query(X, Y) :- title-time(X, Y) house-of-movies($X, Y) -> title-time(X, Y), title-actor(X, Z) movie-hut(X, Y) -> title-time(X, Y), title-actor(X, Z) Source Inversion Rules title-time(X, Y) :- movie-hut(X, Y) <X, f1(X, Y)> title-actor (X, X, Y) :- movie-hut(X, Y) dom(X) :- movie-hut(X, Y) dom(Y) :- movie-hut(X, Y) title-time(X, Y) :- dom(X), house-of-movies(X, Y) <X, f2(X, Y)> title-actor (X, X, Y) :- dom(X), house-of-movies(X, Y) dom(Y) :- dom(X), house-of-movies(X, Y) query(X, Y) :- title-time(X, Y) Binding restrictions lead to recursion in the plan
Problems with Plans derived from source inversion rules Every source that is remotely relevant to the query is made part of the plan • Many of these sources may be overlapping title-time(X, Y) :- movie-hut(X, Y) title-actor (Y, X, Y) :- movie-hut(X, Y) dom(X) :- movie-hut(X, Y) dom(Y) :- movie-hut(X, Y) <X, f1(X, Y)> If both movie-hut and house-of-movies have same information: • both sources are not necessary • the recursion is not necessary title-time(X, Y) :- dom(X), house-of-movies(X, Y) title-actor (Y, X, Y) :- dom(X), house-of-movies(X, Y) dom(Y) :- dom(X), house-of-movies(X, Y) query(X,Y) :- title-time(X, Y) <X, f2(X, Y)>
Minimizing information gathering plans Model source overlaps • Use LCW statements Rewrite the source-complete plan: • Greedily remove rules from plan with uniform equivalence and LCW statements (= make the plan source-minimal) • Uniform containment checks [Sagiv, 88] • Use heuristics to guide removal and pull out recursion first
LCW Statements View: movie-hut(X, Y) -> title-time(X, Y), title-actor(X, Z) LCW: movie-hut(X, Y) <- title-time(X, Y), title-actor(X, Z) To check if one rule, r , with information source predicates contains another rule, r , see if r [s s l] contains r [s s v] 1 2 2 1 Inter-source subsumption relations [Mirror sources] can also be handled [Etzioni et al 97], [Duschka 97]
Uniform Equivalence Equivalence: • Two datalog programs X and Y are equivalent if, for every set of extensional predicates, the two programs produce the same output. • Undecidable Uniform Equivalence: • X and Y are equivalent if, for every set of extensional and intensional predicates the two plans produce the same output • Decidable • Implies equivalence [Sagiv 88]
Testing for Uniform Containment p(X, Y) :- q(X, Y) q(X, Y) :- r(X, Y) uniformly contain ? p(W, X) :- r(W, X) does assert r(“W”, “X”) and try to derive p(“W”, “X”)
Greedily Minimizing Information Gathering Plans Source costs can be used Remove non-recursive IDB predicates Sort the rules so those with dom predicates come before those without dom predicates for each rule r do let r be a rule of P that has not yet been considered let P be the program obtained by deleting rule r from P if P[s s l] uniformly contains r[s s v] then replace P with P. Prune unreachable rules. ^ ^ ^ Uniform containment check is exponential in the worst case
Minimization example title-time(X, Y) :- movie-hut(X, Y) <X, f1(X, Y)> title-actor (X, X, Y) :- movie-hut(X, Y) dom(X) :- movie-hut(X, Y) dom(Y) :- movie-hut(X, Y) title-time(X, Y) :- dom(X), house-of-movies(X, Y) <X, f2(X, Y)> title-actor (X, X, Y) :- dom(X), house-of-movies(X, Y) dom(Y) :- dom(X), house-of-movies(X, Y) query(X, Y) :- title-time(X, Y) movie-hut(X, Y) <- title-time(X, Y), title-actor(X, Z)
EMERAC Build query plan Logical Optimizations Execution Optimizations Execute query plan [Source completeness] [Source-minimality] [Access cost and bandwidth minimality]
Issues in ordering source calls • Execution cost is a function of both access cost and the tuple-transfer cost (ignoring local processing costs…) • Tension between access costs & traffic costs • E.g. Execute “S1(W,X) & S2(X,Y)” where the query binds W • Tuple-transfer cost reduction motivates calling sources with the least general binding patterns possible • Bound-is-easier (S1 first, and then feed X bindings to S2) • Access cost reduction motivates calling sources with the most general binding patterns possible • Feeding X bindings for S2 will generate many separate accesses, increasing the access cost
Our Approach: Assumptions • Exact optimization is not worth it… • Lack of full source statistics • NP-hardness of the optimization problem • Join-ordering, which is a special case, is already NP-Complete • Source access costs dominate tuple-transfer costs by default • Reasonable given the large setup and latency costs for internet sources
Our Approach: Overview • A greedy approach (along the lines of “bound-is-easier” type procedures) • By default, attempts to access each source with the most general feasible binding pattern • Reasonable given the assumption that access costs dominate transfer costs • The default is over-ridden if a binding pattern is known to produce too much traffic • Binding patterns producing high traffic are stored in a table called HTBP • Implicitly produces bushy join trees
The HTBP Table • The HTBP table contains, for every source S, the least general binding patterns of S which are known to produce “high” traffic • A call to source S with binding pattern B is considered high-traffic producing, if HTBP contains SB’ and B is either equal or more general than B’ • E.g. Book(Author,Title,ISBN,Subj,Price,Pages) • HTBP may contain all binding patterns that do not bind at least one of the first four attributes • Bookffffbb listed explicitly in HTBP • Bookfffffb Bookfffffbf Bookffffffwould be considered to be implicitly in HTBP • Advantage: HTBP should be easy to specify even if full source statistics are not available
The Algorithm For each stageifrom 1 to m do For each unchosen subgoal S pick the most general & feasible BP Bof S w.r.t. V & FBP such that B is not in HTBP. If such a B exists, Push SB into C[i]. Mark S chosen. Add all variables of S to V If no such B exists, but there is a feasible binding pattern for S Pick the BP B’ with most bound variables (in terms of #(.)) Push SB’ into P[i] If no subgoal has been chosen at this level (C[i] is empty), and there are some postponed sources (P[i] is non-empty) Choose SkBin P[i] with the maximum #(B) value Push SkBinto C[i] Add all variables of Skto V Return the array C[1…m] Default case: Reduce accesses HTBP case: Reduce transfer costs
Example • Sources: DP(A:Author,T:Title,Y:Year) SM98(T:Title,U:URL) • Query: Q(A,T,U,1998) • Plan: Q(A,T,U,1998) :- DP(A,T,1998) & SM98(T,U) HTBP: {DPbbb SM98bb} Step 1. V={Y} Cand: DPfff DPffbSM98ff XX XX XX P[1] = {DPffb SM98ff} C[1] = DPffb Step 2. V={A,T,Y} Cand: SM98ff SM98bf XX XX P[2]={SM98bf} C[2]=SM98bf HTBP: {DPffb} Step 1. V={Y} Cand: DPfff DPffbSM98ff XX XX C[1] = SM98ff Step 2. V={Y, U, T} Cand: DPfff DPffb DPfbf DPfbb XX XX XX C[2] = DPfbf HTBP: {} Step 1. V={Y} Cand: DPfff DPffbSM98ff C[1] = SM98ff DPfff Bound-is-easier
Implementation The Emerac Information Gatherer • written in Java • incorporates rewriting and execution ordering techniques • executes plans in parallel • returns partial results during plan execution • object oriented design makes it easy to modify
Experiments • Experimented with simulated sources derived form DBLP data • Our minimization approach reduces access costs by removing redundant recursive sources • Minimization cost offset by the improvements in execution time • Our source ordering approach tended to reduce the total cost over bound-is-easier approach whenever there were significant number of binding patterns that are not subsumed by HBTP
Contributions • An approach for minimizing recursive information gathering plans • An approach for ordering source calls in information gathering plans • Attempts at minimizing both access cost and tuple-transfer cost • Implementation & Evaluation in EMERAC
Current directions • Integrate minimization & source-call ordering phases • Model cost-quality tradeoffs • Handling run-time exceptions • unavailability of sources etc. • Tracking time and solution quality statistics • Improve the granularity of the HTBP table