150 likes | 239 Views
Efficiently Ordering Query Plans for Data Integration. AnHai Doan & Alon Halevy Department of Computer Science & Engineering University of Washington. Data Integration Challenge. Find Olympus cameras on sale and their reviews. (Brand, Cameras) (Olympus, C-3000). (Cameras, Reviews)
E N D
Efficiently Ordering Query Plans for Data Integration AnHai Doan & Alon Halevy Department of Computer Science & Engineering University of Washington
Data Integration Challenge Find Olympus cameras on sale and their reviews (Brand, Cameras) (Olympus, C-3000) (Cameras, Reviews) (C-3000, review-article-1) TARGET.COM EPINIONS.COM WAL-MART.COM DPREVIEW.COM CONSUMER REPORTS.ORG BESTBUY.COM
Architecture of a Data Integration System Find Olympus cameras on sale and their reviews TARGET.COM EPINIONS.COM TARGET.COM DPREVIEW.COM TARGET.COM CONSUMERREPORTS.COM WAL-MART.COM EPINIONS.COM BESTBUY.COM CONSUMERREPORTS.COM Query Reformulator logical query plans Query Optimizer physical query execution plans Answers = UNION of outputs of all logical query plans Execution Engine Must execute multiple plans!
Ordering Query Plans • Time to & quality of first answers is important! • executing all plans is expensive or infeasible • plans tend to vary significantly in their utility • coverage, execution time, monetary cost, ... • Solution • find query plans in decreasing order of utility • execute best plans first • abort query execution as soon as • satisfactory answer is found, or • resource limits have been reached
Our Contributions • Formally defined plan-ordering problem • does not assume any specific utility measure • models dependencies among plans • Developed three efficient solutions • GREEDY: exploits utility monotonicity • iDRIPS: exploits source similarity • STREAMER: exploits source similarity, plan independenceutility-diminishing returns • work with a broad range of utility measures • find the best plans very fast
Problem Definition • Utility measure • plan coverage: number of new answers returned by a plan • execution time, monetary fee • plan utility depends on plans previously executed! • Plan-ordering problem • modify query reformulator so thatgiven user query and utility measure, it outputs • best plan p1 • next best plan p2, assuming p1 has been executed • next best plan p3, assuming p1 & p2 have been executed, ... • focus on finding first few best plans
Current Query Reformulator: the Bucket Algorithm [Levy et al., VLDB-96] • Collect sources into buckets • sources in a bucket can return answer to a certain part of query • Take cross product of buckets • to form logical query plans Find Olympus cameras on sale and their reviews Bucket B1 Bucket B2 V1: TARGET V4: EPINIONS V2: WAL-MART V5: DPREVIEW V3: BESTBUY V6: CONSUMERREPORT V1V4 V1V5 ... V3V5 V3V6
Our GREEDY Algorithm • Properties • linear run time • broadly applicable • many practical utility measures are monotonic [Yerneni et al., EDBT-98] • Utility monotonicity • if replacing a source by a “better” source yields a better plan • e.g., cost(ViVj) = cost(Vi) + cost(Vj) • Finds best plan • by local comparison of sources B1 B2 V1 V4 V2 V5 V1 V5 V3 V6 • Removes best plan & finds next best plan, ...
Source Similarity • Two sources are similar • if replacing one by the other changes plan utility very little • Large domains often have many similar sources • similar in monetary fee, access time, coverage, etc • Key idea • similar sources can be grouped and treated as a single source V1 V4: time = 3, fee = 5 V2 V4: time = 4, fee = 6 V1: time = 2, fee = 3 V2: time = 3, fee = 4 V4: time = 1, fee = 2 utility(V1V4) = 0.5 utility(V2V4) = 0.7 Abstract source Abstract plan V12: time = [2,3], fee = [3,4] V12 V4: time = [3,4], fee = [5,6] utility(V12 V4) = [0.4,0.7]
Grouping Sources to Find Best Plan:DRIPS Algorithm [Haddawy et al., UAI-95] V123 V456 B1 B2 V1 V4 V2 V5 V3 V4 V12 V56 V3 V6 V1 V2 V5 V6 Source Grouping Branch & Bound Search Dominance graph V123 V456 V3 V4 [0.5, 0.8] [0.1, 0.7] V12 V456 V3 V456 V1 V456 V3 V56 V1 V456 V2 V456 V3 V4 V3 V56 V2 V456 0.8 [0.6, 0.7] [0.4, 0.6] [0.1, 0.3]
Extending DRIPS: iDRIPS & STREAMER • iDRIPS (iterative DRIPS) • applies DRIPS to find best plan • removes best plan, re-groups sources • applies DRIPS to find second best plan, ... • Observation • iDRIPS may re-establish dominance relations many times • Challenge: recycle dominance relations • Solution: STREAMER • applicable when utility-diminishing returns holds • exploits plan independence V3 V4 V1 V456 V3 V56 V2 V456
The STREAMER Algorithm First Iteration Second Iteration V1 V4 V1 V5 V1 V6 V3 V4 V1 V456 V3 V56 V1 V456 V3 V56 V2 V456 V2 V456 V2 V4 V2 V5 V2 V6 still true if utility-diminishing returns holds + V3V4 is independent of V1 V456
Summary & Experiments • Empirical evaluation of iDRIPS and STREAMER • seven non-monotonic utility classes • for five classes: source grouping worked • both algorithms found first 100 plans very fast • STREAMER outperformed iDRIPS (when it is applicable) Algorithms Applicable when Evaluation GREEDY utility monotonicity O(nm2k2) iDRIPS source similarity empirical STREAMER source similarity empirical utility-diminishing returns plan independence
Related Work • Query reformulation algorithms • BUCKET[Levy et al., VLDB-96]INVERSE-RULE[Duschka&Genesereth, PODS-97]MINICON[Pottinger&Levy, VLDB-00] • our solutions generalize to all of these • Ordering query plans • [Levy et al., AAAI-96][Florescu et al., VLDB-97][Naumann et al., VLDB-99][Leser&Naumann, FQAS-00], ... • only considered in restricted settings • Query optimization • many works at all levels • most works optimize cost to get all answers
Conclusions • Ordering query plans is important & difficult • Contributions • formally defined problem • identified interesting problem properties • utility monotonicity • source similarity • plan independence • utility-diminishing returns • developed 3 solutions: GREEDY, iDRIPS, STREAMER • solutions can handle a broad range of utility measures • showed that solutions find best plans very fast