1 / 49

Query Reformulation & Processing in Data Integration

Query Reformulation & Processing in Data Integration. Recall: Data Integration Challenge. Find Olympus cameras on sale and their reviews. (Brand, Cameras) (Olympus, C-3000). (Cameras, Reviews) (C-3000, review-article-1). TARGET.COM. EPINIONS.COM. WAL-MART.COM. DPREVIEW.COM. CONSUMER

sadah
Download Presentation

Query Reformulation & Processing in Data Integration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Query Reformulation & Processing in Data Integration

  2. Recall: Data Integration Challenge Find Olympus cameras on sale and their reviews (Brand, Cameras) (Olympus, C-3000) (Cameras, Reviews) (C-3000, review-article-1) TARGET.COM EPINIONS.COM WAL-MART.COM DPREVIEW.COM CONSUMER REPORTS.ORG BESTBUY.COM

  3. Architecture of a Data Integration System Find Olympus cameras on sale and their reviews TARGET.COM EPINIONS.COM TARGET.COM DPREVIEW.COM TARGET.COM CONSUMERREPORTS.COM WAL-MART.COM EPINIONS.COM BESTBUY.COM CONSUMERREPORTS.COM Query Reformulator logical query plans Query Optimizer physical query execution plans Answers = UNION of outputs of all logical query plans Execution Engine Must execute multiple plans!

  4. Query Reformulation • Generate good logical query plans quickly • Recent works • bucket algorithm [Levy et. al.] • inverse rule algorithm [Duschka et. al.] • Minicon algorithm [Pottinger et. al.] • iDRIPS/Streamer [Doan et. al.] • Will illustrate with the bucket alg. and Streamer • Strongly encouraged to read others

  5. The Bucket Algorithm [Levy et al., VLDB-96] • Collect sources into buckets • sources in a bucket can return answer to a certain part of query • Take cross product of buckets • to form logical query plans Find Olympus cameras on sale and their reviews Bucket B1 Bucket B2 V1: TARGET V4: EPINIONS V2: WAL-MART V5: DPREVIEW V3: BESTBUY V6: CONSUMERREPORT V1V4 V1V5 ... V3V5 V3V6

  6. Streamer Reformulation Algorithm • Goal: order logical query plans • Time to & quality of first answers is important! • executing all plans is expensive or infeasible • plans tend to vary significantly in their utility • coverage, execution time, monetary cost, ... • Solution • find query plans in decreasing order of utility • execute best plans first • abort query execution as soon as • satisfactory answer is found, or • resource limits have been reached

  7. Contributions of Streamer Work • Formally defined plan-ordering problem • does not assume any specific utility measure • models dependencies among plans • Developed three efficient solutions • GREEDY: exploits utility monotonicity • iDRIPS: exploits source similarity • STREAMER: exploits source similarity, plan independenceutility-diminishing returns • work with a broad range of utility measures • find the best plans very fast

  8. Problem Definition • Utility measure • plan coverage: number of new answers returned by a plan • execution time, monetary fee • plan utility depends on plans previously executed! • Plan-ordering problem • modify query reformulator so thatgiven user query and utility measure, it outputs • best plan p1 • next best plan p2, assuming p1 has been executed • next best plan p3, assuming p1 & p2 have been executed, ... • focus on finding first few best plans • Similar to top-k, but with correlation between plans

  9. Recall: Bucket Algorithm [Levy et al., VLDB-96] • Collect sources into buckets • sources in a bucket can return answer to a certain part of query • Take cross product of buckets • to form logical query plans Find Olympus cameras on sale and their reviews Bucket B1 Bucket B2 V1: TARGET V4: EPINIONS V2: WAL-MART V5: DPREVIEW V3: BESTBUY V6: CONSUMERREPORT V1V4 V1V5 ... V3V5 V3V6

  10. GREEDY Algorithm • Properties • linear run time • broadly applicable • many practical utility measures are monotonic [Yerneni et al., EDBT-98] • Utility monotonicity • if replacing a source by a “better” source yields a better plan • e.g., cost(ViVj) = cost(Vi) + cost(Vj) • Finds best plan • by local comparison of sources B1 B2 V1 V4 V2 V5 V1 V5 V3 V6 • Removes best plan & finds next best plan, ...

  11. Source Similarity • Two sources are similar • if replacing one by the other changes plan utility very little • Large domains often have many similar sources • similar in monetary fee, access time, coverage, etc • Key idea • similar sources can be grouped and treated as a single source V1 V4: time = 3, fee = 5 V2 V4: time = 4, fee = 6 V1: time = 2, fee = 3 V2: time = 3, fee = 4 V4: time = 1, fee = 2 utility(V1V4) = 0.5 utility(V2V4) = 0.7 Abstract source Abstract plan V12: time = [2,3], fee = [3,4] V12 V4: time = [3,4], fee = [5,6] utility(V12 V4) = [0.4,0.7]

  12. Grouping Sources to Find Best Plan:DRIPS Algorithm [Haddawy et al., UAI-95] V123 V456 B1 B2 V1 V4 V2 V5 V3 V4 V12 V56 V3 V6 V1 V2 V5 V6 Source Grouping Branch & Bound Search Dominance graph V123 V456 V3 V4 [0.5, 0.8] [0.1, 0.7] V12 V456 V3 V456 V1 V456 V3 V56 V1 V456 V2 V456 V3 V4 V3 V56 V2 V456 0.8 [0.6, 0.7] [0.4, 0.6] [0.1, 0.3]

  13. Extending DRIPS: iDRIPS & STREAMER • iDRIPS (iterative DRIPS) • applies DRIPS to find best plan • removes best plan, re-groups sources • applies DRIPS to find second best plan, ... • Observation • iDRIPS may re-establish dominance relations many times • Challenge: recycle dominance relations • Solution: STREAMER • applicable when utility-diminishing returns holds • exploits plan independence V3 V4 V1 V456 V3 V56 V2 V456

  14. The STREAMER Algorithm First Iteration Second Iteration V1 V4 V1 V5 V1 V6 V3 V4 V1 V456 V3 V56 V1 V456 V3 V56 V2 V456 V2 V456 V2 V4 V2 V5 V2 V6 still true if utility-diminishing returns holds + V3V4 is independent of V1 V456

  15. Adaptive Query Processing

  16. Motivations for Adaptive Query Processing Many domains where cost-based query optimization fails: Complex queries in traditional databases: estimation error grows exponentially with # joins [IC91] Querying over the Internet: unpredictable access rates, delays Querying external data sources: limited information available about properties of this source Monitor real-world conditions, adapt processing strategy in response

  17. Can We Get RDBMS-Level Optimizationfor Data Integration, without Statistics? Multiple remote sources • Described and mapped “loosely” • Data changes frequently Generally, would like to support same kinds of queries as in a local setting Results Query Data Integration System Mediated Schema Source Catalog Schema Mappings Remote, Autonomous Data Sources

  18. What Are the Sources of Inefficiency? • Delays – we “stall” in waiting for I/O • We’ll talk about this on Monday • Bad estimation of intermediate result sizes • The focus of the Kabra and DeWitt paper • No info about source cardinalities • The focus of the eddies paper – and Monday’s paper • The latter two are closely related • Major challenges: • Trading off information acquisition (exploration) vs. use (exploitation) • Extrapolating performance based on what you’ve seen so far

  19. Kabra and DeWitt • Goal: “minimal update” to a traditional optimizer in order to compensate for bad decisions • General approach: • Break the query plan into stages • Instrument it • Allow for re-invocation of optimizer if it’s going awry

  20. Elements of Mid-Query Re-Optimization • Annotated Query Execution Plans • Annotate plan with estimates of size • Runtime Collection of Statistics • Statistics collectors embedded in execution tree • Keep overhead down • Dynamic Resource Re-allocation • Reallocate memory to individual operations • Query Plan Modification • May wish to re-optimize the remainder of query

  21. Annotated Query Plans • We save at each point in the tree the expected: • Sizes and cardinalities • Selectivities of predicates • Estimates of number of groups to be aggregated

  22. Statistics Collectors • Add into tree • Must be collectable in a single pass • Will only help with portions of query “beyond” the current pipeline

  23. Resource Re-Allocation • Based on improved estimates, we can modify the memory allocated to each operation • Results: less I/O, better performance • Only for operations that have not yet begun executing, i.e., not in the pipeline

  24. Plan Modification • Only re-optimize part not begun • Suspend query, save intermediate in temp file • Create new plan for remainder, treating temp as an input

  25. Re-Optimization When to re-optimize: • Calculate time current should take (using gathered stats) • Only consider re-optimization if: • Our original estimate was off by at least some factor 2 and if • Topt, estimated < 1Tcur-plan,improved where 1  5% and cost of optimization depends on number of operators, esp. joins • Only modify the plan if the new estimate, including the cost of writing the temp file, is better

  26. Low-Overhead Statistics Want to find “most effective” statistics • Don’t want to gather statistics for “simple” queries • Want to limit effect of algorithm to maximum overhead ratio,  • Factors: • Probability of inaccuracy • Fraction of query affected How do we know this without having stats?

  27. Inaccuracy Potentials The following heuristics are used: • Inaccuracy potential = low, medium, high • Lower if we have more information on table value distribution • 1+max of inputs for multiple-input selection • Always high for user-defined methods • Always high for non-equijoins • For most other operators, same as worst of inputs

  28. More Heuristics • Check fraction of query affected • Check how many other operators use the same statistic • The winner: • Higher inaccuracy potentials first • Then, if a tie, the one affecting the larger portion of the plan

  29. Implementation • On top of Paradise (parallel database that supports ADTs, built on OO framework) • Using System-R optimizer • New SCIA (Stat Collector Insertion Algorithm) and Dynamic Re-Optimization modules

  30. It Works! • Results are 5% worse for simple queries, much better for complex queries • Of course, we would not really collect statistics on simple queries • Data skew made a slight difference - both normal and re-optimized queries performed slightly better

  31. Pros and Cons • Provides significant potential for improvement without adding much overhead • Biased towards exploitation, with very limited information-gathering • A great way to retrofit an existing system • In SIGMOD04, IBM had a paper that did this in DB2 • But fairly limited to traditional DB context • Relies on us knowing the (rough) cardinalities of the sources • Query plans aren’t pipelined, meaning: • If the pipeline is broken too infrequently, MQRO may not help • If the pipeline is broken too frequently, time-to-first-answer is slow

  32. The Opposite Extreme: Eddies • The basic idea: • Query processing consists of sending tuples through a series of operators • Why not treat it like a routing problem? • Rely on “back-pressure” (i.e., queue overflow) to tell us where to send tuples • Part of the ongoing Telegraph project at Berkeley • Large-scale federated, shared-nothing data stream engine • Variations in data transfer rates • Little knowledge of data sources

  33. Telegraph Architecture • Simple “pre-optimizer” to generate initial plan • Creates operators, e.g.: • Select: predicate(sourceA) • Join: predicate(sourceA, sourceB) • (No support for aggregation, union, etc.) • Chooses implementations Select using index, join using hash join • Goal: dataflow-driven scheduling of operations • Tuple comes into system • Adaptively routed through operators in “eddies” May be combined, discarded, etc.

  34. Can’t Always Re-order Arbitrarily • Need “moment of symmetry” • Some operators have scheduling dependency • e.g. nested loops join: for each tuple in left table for each tuple in right table If tuples meet predicate, output result • Index joins, pipelined hash joins always symmetric • Sometimes have order restrictions e.g. request tuple from one source, ship to another

  35. The Eddy • Represents set of possible orderings of a subplan • Each tuple may “flow” through a different ordering (which may be constrained) • N-ary module consisting of query operators • Basic unit of adaptivity • Subplan with select, project, join operators U

  36. Eddy Pros and Cons • Mechanism for adaptively re-routing queries • Makes optimizer’s task simpler • Can do nearly as well as well-optimized plan in some cases • Handles variable costs, variable selectivities • But doesn’t really handle joins very well – attempts to address in follow-up work: • STeMs – break a join into separate data structures; requires re-computation at each step • STAIRs – create intermediate state and shuffle it back and forth

  37. Generalizing Adaptive Query Processing • We’ve seen a range of different adaptive techniques • How do they fit together? • Can we choose points between eddies and mid-query re-optimization?

  38. Types of Adaptive Query Processing Adaptive scheduling (q. scrambling [UF98], dyn. pipeline sched. [UF01], XJoin [UF00], PH Join [RS86][I+99], ripple join [HH99]) Changes CPU scheduling to improve feedback or reduce delays Redundant computation (competitive exec. [AZ96]) Compare two+ ways of executing the query Plan partitioning ([S+76][KD98][I+99][M+04]) Break the plan into stages; re-optimize future stages as necessary Adaptive info passing ([IT05 sub.]) Pass data between parts of an executing plan Adaptive data partitioning • Break the data into subsets; use a different plan for each subset • The only way to reduce overall computation with fine granularity • First (only) implementation has been eddies[AH00][R+03][DH04]

  39. Eddies Combine Adaptive Scheduling and Data Partitioning Decisions Intuitively, each tuple gets its own query plan • Route to next operator based on speed and selectivity of each operator • Elegant and simple to implement But performing a join creates subresults at the next level! Local & greedy choices may result in state that needs to join with all future data! Consider long-term effects of decisions before making them – separate CPU scheduling from plan selection

  40. Focusing Purely on Adaptive Data Partitioning Use adaptively scheduled operators to “fill CPU cycles” Now a query optimizer problem: Choose a plan that minimizes long- term cost (in CPU cycles) To allow multiple plans, distribute union through join (and select, project, etc.): If R1 = R11[ R12, R2 = R21[ R22 then: R1⋈ R2 = (R11[ R12) ⋈ (R21[ R22) = (R11⋈ R21) [ (R12⋈ R22) [ (R11⋈ R22) [ (R12⋈ R21) R11 R21 R22 R12 R2 R1 This generalizes to njoins, other SPJ + GUoperators…

  41. Exclude R0S0T0,R1S1T1 R1 S1T1 S1T1 È T 0 T 0 1 1 T R R S 0 0 Exclude R0S0 0 0 1 1 R S S T R 0 S 0 R S 1 1 Adaptive Data Partitioning:Routing Data across Different Plans R ⋈S ⋈ T  R0 S0T0 Options for combining across phases: • New results always injected into old plan • Old results into new plan • Wait until the end – “stitch-up” plan based on best stats … R0 S0 R S T

  42. Special Architectural Features for ADP Monitoring and re-optimization thread runs alongside execution: • System-R-like optimizer with aggregation support;uses most current selectivity estimates • Periodic monitoring and re-optimization revises selectivity estimates, recomputes expected costs Query execution with “smart router” operators Special support for efficient stitch-up plans: • Uses intermediate results from previous plans (specialized-case of answering queries using views [H01]) • Join-over-union (“stitch-up-join”) operator that excludes certain results

  43. ADP Application 1:Correcting Cost Mis-estimates Goal: react to plans that are obviously bad • Don’t spend cycles searching for a slightly better plan • Try to avoid paths that are likely to not be promising Monitor/reoptimizer thread watches cardinalities of subresults • Re-estimate plan cost, compare to projected costs of alternatives, using several techniques & heuristics (see paper) • Our experiments: re-estimate every 1 sec. “Smart router” operator does the following: • Waits for monitor/reoptimizer to suggest replacement plan • Re-routes source data into the new plan • New plan’s output is unioned with output of previous plan; this is fed into any final aggregation operations

  44. Correcting for Unexpected Selectivities Pentium IV 3.06 GHzWindows XP

  45. ADP Application 2:Optimizing for Order Most general ADP approach: • Pre-generate plans for general case and each “interesting order” • “Smart router” sends tuple to the plan whose ordering constraint is followed by this tuple • But with multiple joins, MANY plans Instead: do ADP at the operator level • “Complementary join pair” • Does its own stitch-up internally • Easier to optimize for! Can also do “partial sorting” at the router (priority queue) Q ... Merge Hash h(R) h(S) h(R) h(S) Q Q Q Q Routers R S

  46. Exploiting Partial Order in the Data Pentium IV 3.06 GHzWindows XP (1024 tuple)

  47. SUM(T.y) GROUP BY T.x T R SUM(T.y sums) GROUP BY T.x R SUM(T.y) GROUP BY T.x, T.joinAttrib T ADP Over “Windows”:Optimizing for Aggregation • Group-by optimization [CS94]: • May be able to “pre-aggregate” some tuples before joining • Why: aggregates can be applied over union • But once we insert pre-aggregation, we’re stuck (and it’s not pipelined) • Our solution: • “Adjustable window pre-aggregation” • Change window size depending on how effectively we can aggregate • Also allows data to propagate through the plan – better info for adaptivity, early answers vs.

  48. Pre-Aggregation Comparison

  49. Adaptive QP in Summary A variety of different techniques, focusing on: • Scheduling • Comparison & competition • Information passing • Data + plan partitioning A field that is still fairly open – missing: • Effective exploration methods • A true theory! • What’s possible? What kinds of queries make sense to adapt? • Guarantees of optimality and convergence (perhaps under certain assumptions)

More Related