280 likes | 385 Views
Cost-based Query Scrambling for Initial Delays. Tolga Urhan Michael J. Franklin Laurent Amsaleg. Introduction:. Problem : Remote data access (e.g. from the Web) in query processing introduce unpredictable delays.
E N D
Cost-based Query Scrambling for Initial Delays Tolga Urhan Michael J. Franklin Laurent Amsaleg
Introduction: • Problem: Remote data access (e.g. from the Web) in query processing introduce unpredictable delays. • Traditional query scrambling based on simple heuristics is susceptible to problems from bad scrambling decisions. • Three different approaches to using query optimization for scrambling are proposed. • More intelligent scrambling decisions are made.
Query Scrambling Overview: • Goal: To hide the delays encountered when obtaining data from remote sources by performing other useful work. • Consists of two phases: • Rescheduling phase • Operator Synthesis phase • Objective function of optimization based either on • total work • or response time
Goals of this paper: • Focus on the problem of initial delay: • delay in receiving the first tuple from a particular remote source. • Due to • difficulty in establishing a connection to a remote source, • heavy load of the remote source • large amount of work remote source needs to perform (lack of global optimization) • Investigate both the use of response time-based and total work-based optimization for query scrambling
Cost-based Query Scrambling: • Assumption: • Query execution environment consists of a query site and remote data sources. • Processing work occurs in both query site and remote sites • example of a complex query tree: Communication Link Query Result Select Join C A B D E
(1) Rescheduling: execution plan of a query is dynamically rescheduled when delay is detected. (2) Operator Synthesis: new operators can be created when there are no other operators that can execute. E.g.: Stalls in getting A tuples: (1) retrieve B tuples, execute D join E (2) execute a new join between B and (D join E) join C Iterative Process of Query Scrambling
Cost-based Rescheduling: • Identify runnable subtrees: subtrees made up entirely of nonbocked operators. • Runnable subtrees can be scheduled out of order by inserting materialization operators at its root. • Materialization operators: they issue Open, Next, close calls to the root of the subtree and save results in a temporary relation.
Cost-based Rescheduling: • Selection of runnable subtrees to execute: • Traditional way: “maximal” one. • Maximal efficiency: (P - MR)/(P + MW) • MW: cost of writing materialized temporary result to disk • MR: cost of reading temporary results form disk • P: cost of executing the subtree • efficiency: improvement in response time per unit of scrambling execution • Another iteration of rescheduling is started until the delayed data has arrived.
Cost-based Operator Synthesis: • Second phase starts when no more progress can be made in phase 1. • Three approaches of optimization strategies: • Pair • (IN) Include Delayed • (ED) Estimated Delay
Pair: total work-based optimizer • Construct a query plan containing only a single join using two unblocked relations. • Analyzes each pair of unblocked relations sharing a join predicate. • Chooses the join with the least total cost to execute. • Materialize the results of the join to disk. • Avoids Cartesian products, joins whose produced results take longer to read from disk than to compute from scratch.
Pair continued: • At the end of each join, checks for the arrival of delayed data. If not arrived, do another iteration • If no qualified joins exist, wait for delayed data to arrive • Reconstruction phase: • when all blocked relations become available, need to construct a single query tree • necessary, since Pair policy works only on pairs of relations and does not maintain a complete query plan
IN (Include Delayed): response time-based optimizer • Each iteration generates a complete alternative plan • Chooses a very long delayduration (relative to response time) to postpone any access to the delayed data. • Chooses a plan with the greatest benefit (potential improvement in response time) whose risk (duration of the optimization step) can be overlapped with the expected delay duration.
IN Contined: • Use risk/benefit knob (Rbknob) to prevent optimizer from choosing high-risk plans for relatively small potential gains over low risk plans. • Rbknob: ratio of the amount of benefit the optimizer is willing to give up for a given savings in risk. • Increasing Rbknob ----> more conservative plans.
ED (Estimated Delay): response time-based optimizer • Similar to IN except that it uses relatively short delayestimates (relative to the response time of the non-delayed plan) • Delay estimates successively increase when necessary to make more progress • Motivation: Use low risk plans when delays are short, use high risk/high pay off plans for larger delays.
ED Continued: • Execution steps: • Starts by picking an estimated delay value equal to 25% of the original query response time • Repeat iterations until progress is too small • Increase delay value to 50% of response time • Increase to 100% of response time if progress is still too small. • For short delays: scrambling more useful • For long delays: more aggressive.
Experimental Setup: • Two-phase randomized query optimizer • Workload based on queries from TPC-D benchmarks • Single query site, six remote data source sites. • Experimental methodology: plots the duration of initial delay of a remote source vs. the response time achieved using scrambling
Lessons learned: • 1. With sufficient memory, all cost-based approaches (Pair, IN, ED) can effectively hide initial delays. • When delayed relations are encountered early in the query execution, delay as long as normal response time can be absorbed. • Heuristic-based algorithm performs worse than original query w/o scrambling, unless there is substantial amount of extra memory for scrambling
Lessons learned: • 2. Cost-based scrambling: tradeoff between conservative approaches and aggressive ones. • conservative: safer for short delays • aggressive: bigger savings for long delays • Amount of delay to be hidden is limited by the normal response time of the query • As delay increases beyond normal response time, benefits of scrambling as a percentage of total execution time decreases. • So, maybe more conservative plans?
Lessons Learned: • 3. As memory available for scrambling is reduced, scrambling plans are more expensive. • Longer delay duration is needed for scrambling to pay off • Good predictions of delay duration needed to make good scrambling decisions
Lessons Learned: • 4. Aggressiveness of IN and ED policies can be adjusted using Rbknob. • Give up potential gains for long delays in order to reduce risks for short delays • This tradeoff is useful in the absence of accurate predictions of delay duration.
Lessons Learned: • 5. Pair (total work-based optimizer) may perform unnecessary work • lack of a global view of the scrambled plan • unable to pick slightly sub-optimal plan to generate interesting orders • Therefore, response time should be used as the objective function to generated a complete and reasonable scrambled plan.
Discussion: • How to tune Rbknob? • Query scrambling can be very dangerous, when delay duration is short. How to reduce the risks? • Cross products might be Okay sometimes? • Reality of scenarios studied? • QS treats remote sites as black boxes. Additional processing at data source sites? • Nonblocking join algorithm instead of hash join?
Discussion continued: • Better delay duration estimates? (using probes) • Quality of decisions limited by that of optimizer. • Replicas complicate problems? • Query scrambling decision should take selectivity, size of intermediate results, importance of operators into consideration. • Only addressed the problem of initial delay, ignores bandwidth problem.
Discussion continued: • Checking for arrival of delayed data during a scrambling step?
Related Work: • Do not adapt to changes in the system parameters during query execution: • Volcano optimizer: • introduces choose-plan operators into a query plan to compensate for the lack of information about system parameters at compile time. • Y. Ioannidis et.al: • generates multiple alternative plans, chooses among them when the query is initialized.
Related Work: • Rdb/VMS • Start multiple different executions of the same logical operators, choose the best operator, terminate the others. • MIND heterogeneous database project • divide query into subqueries and send to each subquery to a site for execution, compose results incrementally • resembles Pair. • Reorder left-deep join trees into busy join trees • Mermaid, InterViso
Conclusion: • Three different approaches (Pair, IN, ED) are investigated to make intelligent query scrambling decisions • In general, use of a response time optimizer has the advantage of being able to construct complete query execution plans that include access to delayed data • Fundamental trade-offs between risk aversion and ability to hide large delays in ED and IN.