250 likes | 406 Views
Efficient Evaluation of Queries in a Mediator for WebSources. Louiqa Raschid University of Maryland Joint work with Zadorozhny, Vidal, Urhan, Bright. Wide-Area Applications (WAA ). Problem: Scalability of Wrapper/Mediator architectures to WebSources that are accessible via WANs
E N D
Efficient Evaluation of Queries in a Mediator for WebSources Louiqa Raschid University of Maryland Joint work with Zadorozhny, Vidal, Urhan, Bright
Wide-Area Applications (WAA) Problem: Scalability of Wrapper/Mediator architectures to WebSources that are accessible via WANs • Multiple sources and complex computational capabilities (WSI) • Complex queries on multiple sources, e.g., drug lead discovery using biomolecular sources • Dissimilar access costs (end-to-end latencies) and other metrics for each WSI • Dynamic Wide area environment is noisy (introduces unpredictable delays) L. Raschid — University of Maryland
Mediator Web Query Optimizer Catalog Capability-Based Pre-optimizer Extended Randomized Optimizer Web Query Broker ExecutionEngine Wrapper Wrapper WAN Web Source Web Source Detailed Architecture Query WCM WebPT WCM WebPT
Relevant Technologies • Wrapper generation Toolkit: CoopIS 1999, VLDB Journal special issue 2000 • Wrapper Mediator Architecture (Predator ORDBMS platform) ICDE 2000 demo • CBR Tool, Wrapper Cost model (WCM) CoopIS 1999 • Web prediction tool (WebPT) VLDB Journal 2000 special issue, CoopIS 2001 • Web Query Optimizer (WQO) Sigmod 2002, ICDCS 2002 L. Raschid — University of Maryland
Outline of the talk • Motivation and architecture • Example • Two Phase Optimizer (WQO) and Pre-plans • Heuristics used by WQO to choose Pre-plans • Evaluation • Related Work • Future Work L. Raschid — University of Maryland
ACM DL Web Source Schema Paper(FirstAuthor, Title, PaperID, Keywords, PaperPDF) Coauthor(PaperId, CName) Editor(PaperId, EName) Reviewer(PaperId, RName) Limited capabilities - WSIs ior1 Paper:FirstAuthor —› Title, PaperID, Keywords, PaperPDF ior3 Paper: FirstAuthor —› Title, PaperID, Keywords ior4 Paper:PaperID —› PaperPDF ior2 CoAuthor: PaperID —› CName ior5 Editor: PaperID —› EName ior6 Reviewer: {} —› PaperId, RName Dependencies: (ior1—›ior2), (ior3—›ior4),(ior3&ior4—›ior2) ... Composed WSI ior3 & ior4 Atomic WSI ior1 L. Raschid — University of Maryland
2 Phase Optimization for WebSources Objective: Generate safe good plans in a large search space (due to multiple alternate WSIs) • Pre-plan • 2 Phase optimization • Cost based heuristics • Cost based optimization (randomized optimizer) L. Raschid — University of Maryland
CBR Tool and Pre-plan • Pre-plan – partition subgoals on mediator relations • (partial) Ordering between subgoals (WSI dependencies) • Select WSI for each subgoal SELECTP.Title,P.PaperPDF,CoAuthor.CName FROMPaper P, CoAuthor, Editor, Reviewer WHEREP.1stAuthor=“Franklin”&P.PaperId=CoAuthor.PaperId& P.PaperId=Reviewer.PaperId&P.PaperId=Editor.PaperId {{Paper (ior1), Reviewer(ior6)} {CoAuthor(ior2), Editor(ior5)} } Paper(ior1)—› CoAuthor(ior2), Paper(ior1) —› Editor(ior5) {{Paper (ior3;ior4), Reviewer(ior6)} {CoAuthor(ior2), Editor(ior5)} } Paper(ior3;ior4)—› CoAuthor(ior2), Paper(ior3;ior4) —› Editor(ior5) L. Raschid — University of Maryland
2 Phase Optimization for WebSources • WQO uses CBR tool to select WSIs and generate pre-plans • WQO uses cost-based heuristics to select one or more good pre-plans • WQO uses randomized relational optimizer and cost model and chooses safe good plans Objective: Choose a good pre-plan that will lead to a good plan L. Raschid — University of Maryland
Traditional optimizer behavior Query with 5-way join, 3 WSIs per relationDetails of relative costs of WSIs in paper L. Raschid — University of Maryland
WQO behavior using heuristics to explore a few a good pre-plan(s) L. Raschid — University of Maryland
WQO functionality • Metrics of Wrapper Cost Model (WebPT) • Ignore local processing costs (assume optimizer will choose best plan possible for pre-plan) • Choose WSIs and pre-plans to minimize remote costs • Why use heuristics to choose pre-plan? • Impact of cost / delay on heuristics versus impact on a cost model • Impact of noise on heuristics versus cost model • Limitations of heuristics L. Raschid — University of Maryland
Metrics in Wrapper Cost Model • WebSource and Network Costs • Remote Cost at Web Source - TTF • Downloading data from Web Source (extraction cost) • Total cost - TTL • Wrapper Statistics • Number of Pages Accessed • Cardinality of Result • Statistics dependent on value of query binding • WebPT - a tool for learning using query feedback and predicting access cost based on parameters such as Day, Time, Qty, Cardinality, etc. L. Raschid — University of Maryland
WQO Heuristics to reduce remote costs • Reduce number of WSI calls • Favors atomic WSIs • … but composed WSIs may reduce result cardinality hence reduce overall number of calls • Choose WSI with lowest access cost • Favors lower cost WSIs • … but more expensive WSIs may provide more filtering of results and may reduce result cardinality • Reduce result cardinality • More selective WSIs • WSIs with more input bindings L. Raschid — University of Maryland
WQO Heuristics to choose good pre-plans • Top-down versus bottom-up evaluation • Requires knowledge of dependencies • Details in paper {R1 R3 R4 R5} {R2(S21)} UR3 -> R2 {R1 R2(S22) R3 R4 R5} Uø • Choice of atomic versus composed WSIs ior1 ior3 + ior4 • Cost&selectivity measure • Favor both low access costs and low cardinality • E.g., if access costs of 2 WSIs are similar choose WSI with greater selectivity L. Raschid — University of Maryland
Sample Queries for ACM DL Web Source Query Q3: and keyword = “broadcast disks” Query Q1: SELECT Title, PaperPDF,CoAuthorName FROM Paper, Coauthor WHERE AuthorName = “Michael Franklin” and Coauthor.PaperID = Paper.PaperID Query Q2: SELECT Title, PaperPDF,CoAuthorName FROM Paper, Coauthor WHERE AuthorName = “Michael Franklin” and Coauthor.PaperID = Paper.PaperID and Coauthor.CoAuthorName = ”Stan Zdonik”
Plan 3: Q1 bad Q2 good Plan 1: Plan 2: Coauthor Paper Coauthor Paper ior4 costly ior1 ior2 ior2 Paper Paper Paper Coauthor ior3 ior2 ior3 ior4 costly -- DependentJoin Execution Plans for Query Q1 and Q2 Use atomic ior1 Use composed (ior3,ior4) Use composed (ior3,ior4) Q3 good costly L. Raschid — University of Maryland
Sample Queries for ACM DL Web Source Query Q3: and keyword = “broadcast disks” Query Q1: SELECT Title, PaperPDF,CoAuthorName FROM Paper, Coauthor WHERE AuthorName = “Michael Franklin” and Coauthor.PaperID = Paper.PaperID Query Q2: SELECT Title, PaperPDF,CoAuthorName FROM Paper, Coauthor WHERE AuthorName = “Michael Franklin” and Coauthor.PaperID = Paper.PaperID and Coauthor.CoAuthorName = ”Stan Zdonik”
Plan1 Plan1 Plan2 Plan2 Plan3 Plan3 Response Time for Queries Q1 and Q2 Query # L. Raschid — University of Maryland
Plan1 Plan1 Plan2 Plan2 Plan3 Plan3 Quantile plots for Queries Q1 and Q2 L. Raschid — University of Maryland
Summary of WQO Heuristics Objective: reduce the amount of data delivered to the mediator and minimize remote access cost Factors: Cardinality / Selectivity of dependent join (binding attributes) of WSIs Selectivity / cardinality of the remote relations Costly WSIs Good choice of WSIs and pre-plans relies on ability to construct a realistic cost model
Choice of WSIs for Complex Queries L. Raschid — University of Maryland
WQO Limitations: Choice of (A)tomic and (C)omposed capability 4way and 5way join queries 2 remote relations -(A)tomic /(C)omposed AA AC CA CC Qcard100-AAbest plan (ms) worst plan (ms) 2545 200667AA poor heuristic 639214748 CC Qcard1000-AA good heuristic 20547 200847 AA 24229 2147480 CC Qcard1000-CC good heuristic 20265 200534 AA 2399 466704 CC L. Raschid — University of Maryland
Related Work Capability based rewriting Wrapper cost models: Garlic(IBM), DISCO(INRIA), HP Mediator Optimizers: Garlic(IBM), WSQ/DSQ(Stanford), IRO-DB(Versailles) Adaptive operators: Telegraph, Tukwila, XJoin Reactive optimizers: Query scrambling (Maryland), LEC optimizer (Cornell), EC+D Optimizer (Maryland) Cost/Quality trade-off: Nie+Rao (ASU), Naumann+Freitag (Germany) …..
Current implementation status of WQO • Extensions to randomized relational optimizer. • WebPT tool to predict response time from WebSources. • WebWrapper cost model for WebSources. • Cost based heuristics to choose pre-plans. • Integration into a scrambling enabled optimizer. • Study of pre-plan choice (explored search space) on choice of good plan. L. Raschid — University of Maryland