1 / 33

Adaptive Query Processing: Progress and Challenges

Adaptive Query Processing: Progress and Challenges. Alon Halevy University of Washington [ Gore ] Joint work with Zack Ives, Dan Weld (later: Nimble Technology). Data Integration Systems.

hhensen
Download Presentation

Adaptive Query Processing: Progress and Challenges

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adaptive Query Processing:Progress and Challenges Alon Halevy University of Washington [Gore] Joint work with Zack Ives, Dan Weld (later: Nimble Technology)

  2. Data Integration Systems Uniform query capability across autonomous, heterogeneous data sources on LAN, WAN, or Internet: in enterprises, WWW, big science.

  3. Recent Trends in Data Integration Research • Issues such as: architectures, query reformulation, wrapper construction are reasonably well understood (but still good work going on). • Query execution and optimization raise significant challenges. • Problems for traditional query processing model: • Few statistics (autonomous sources) • Unanticipated delays and failures (network-bound sources). • Conclusion (ours): cannot afford to separate optimization from execution. Need to be adaptive. • See IEEE Data Engineering Bulletin, June, 2000.

  4. Outline • Tukwila (version 1): • Interleaving optimization and execution at the core. • The unsolved problem: when to switch? • The complicating new challenges: • XML, want first tuples fast. • Tukwila (version 2): • completely pipelined XML query processing. • Some experiences from Nimble

  5. Tukwila: Version 1 • Key idea: build adaptive features into the core of the system. • Interleave planning an execution (replan when you know more about your data) • Rule-based mechanism for changing behavior. • Adaptive query operators: • Revival of the double-pipelined join. • Collectors (a.k.a. “smart union”). • See details in SIGMOD-99.

  6. Tukwila Data Integration System Novel components: • Event handler • Optimization-execution loop

  7. Handling Execution Events • Adaptive execution via event-condition-action rules • During execution, eventsgenerated Timeout, n tuples read, operator opens/closes, memory overflows, execution fragment completes, … • Events trigger rules: • Test conditions Memory free, tuples read, operator state, operator active, … • Execution actions Re-optimize, reduce memory, activate/deactivate operator, …

  8. Interleaving Planning and Execution Re-optimize if at unexpected state: • Evaluate at key points, re-optimize un-executed portion of plan [Kabra/DeWitt SIGMOD98] • Plan has pipelined units, fragments • Send back statistics to optimizer. • Maintain optimizer state for later reuse. WHEN end_of_fragment(0) IF card(result) > 100,000 THEN re-optimize

  9. Adaptive Operators: Double Pipelined Join Hybrid Hash Join • Partially pipelined: no output until inner read • Asymmetric (inner vs. outer) — optimization requires source behavior knowledge Double Pipelined Hash Join Enhancement to [Wilschut PDIS91]:uses multithreading, handles overflow • Outputs data immediately • Symmetric — requires less source knowledge to optimize

  10. Adaptive Operators: Collector Utilize mirrors and overlapping sources to produce results quickly • Dynamically adjust to source speed & availability • Scale to many sources without exceeding net bandwidth • Based on policy expressed via rules WHEN timeout(CustReviews) DO activate(NYTimes), activate(alt.books) WHEN timeout(NYTimes) DO activate(alt.books)

  11. Highlights from Version 1 • It worked well (graphs to prove it)! • Unified architecture that encompassed previous techniques: • Choose nodes (Cole & Graefe) • Mid-stream re-optimization (Kabra & DeWitt) • Query scrambling (Urhan, Franklin, Amsaleg) • Optimizer can have global view of different factors affecting adaptive behavior.

  12. The Unsolved Problem • Find interleaving points? When to switch from optimization to execution? • Some straightforward solutions worked reasonably, but student who was supposed to solve the problem graduated prematurely. • Some work on this problem: • Rick Cole (Informix) • Benninghoff & Maier (OGI). • One solution being explored: execute first and break pipeline later as needed. • Another solution: change operator ordering in mid-flight (Eddies, Avnur & Hellerstein).

  13. More Urgent Problems • Users want answers immediately: • Optimize time to first tuple • Give approximate results earlier. • XML emerges as a preferred platform for data integration: • But all existing XML query processors are based on first loading XML into a repository.

  14. Tukwila Version 2 • Able to transform, integrate and query arbitrary XML documents. • Support for output of query results as early as possible: • Streaming model of XML query execution. • Efficient execution over remote sources that are subject to frequent updates. • Philosophy: how can we adapt relational and object-relational execution engines to work with XML?

  15. Tukwila V2 Highlights • The X-scan operator that maps XML data into tuples of subtrees. • Support for efficient memory representation of subtrees (use references to minimize replication). • Special operators for combining and structuring bindings as XML output.

  16. Tukwila V2 Architecture

  17. Example XML File <db> <book publisher="mkp"> <title>Readings in Database Systems</title> <editors> <name>Stonebraker</name> <name>Hellerstein</name> </editors> <isbn>123-456-X</isbn> </book><company ID="mkp"> <name>Morgan Kaufmann</title> <city>San Mateo</city> <state>CA</state> </company> </db>

  18. XML Data Graph

  19. Example Query WHERE <db> <book publisher=$pID> <title>$t</> </> ELEMENT_AS $b </> IN "books.xml", <db> <publication title=$t> <source ID=$pID>$p</> <price>$pr</> </> </> IN "amazon.xml", $pr < 49.95 CONSTRUCT <book> <name>$t</> <publisher>$p</> </>

  20. Query Execution Plan

  21. X-Scan • The operator at the leaves of the plan. • Given an XML stream and a set of regular expressions – produces a set of bindings. • Supports both trees and graph data. • Uses a set of state machines to traverse match the patterns. • Maintains a list to unseen element Ids, and resolves them upon arrival.

  22. X-scan Data Structures

  23. State Machines for X-scan

  24. Other Features of Tukwila V.2 • X-scan: • Can also be made to preserve XML order. • Careful handling of cycles in the XML graph. • Can apply certain selections to the bindings. • Uses much of the code of Tukwila I. • No modifications to traditional operators. • XML output producing operators. • Nest operator.

  25. In the “Pipeline” • Partial answers: no blocking. Produce approximate answers as data is streaming. • Policies for recovering from memory overflow [More Zack]. • Efficient updating of XML documents (and an XML update language) [w/Tatarinov] • Dan Suciu: a modular/composable toolset for manipulating XML. • Automatic generation of data source descriptions (Doan & Domingos)

  26. First 5 Results

  27. Completion Time

  28. Intermediate Conclusions • First scalable XML query processor for networked data. • Work done in relational query processing is very relevant to XML query processing. • We want to avoid decomposing XML data into relational structures.

  29. Some Observations from Nimble • What is Nimble? • Founded in June, 1999 with Dan Weld. • Data integration engine built on an XML platform. • Query language is XML-QL. • Mostly geared to enterprise integration, some advanced web applications. • 70+ person company (and hiring!) • Ships in trucks (first customer is Paccar).

  30. System Architecture XML Query XML Relational Data Warehouse/ Mart Legacy Flat File Web Pages Front-End Lens Builder™ User Applications Lens™ File InfoBrowser™ Software Developers Kit NIMBLE™ APIs Management Tools Integration Layer Nimble Integration Engine™ Metadata Server Cache Compiler Executor Security Tools Common XML View Integration Builder Concordance Developer Data Administrator

  31. The Current State of Enterprise Information • Explosion of intranet and extranet information • 80% of corporate information is unmanaged • By 2004 30X more enterprise data than 1999 • The average company: • maintains 49 distinct enterprise applications • spends 35% of total IT budget on integration-related efforts Source: Gartner, 1999

  32. Design Issues • Query language for XML: tracking the W3C committee. • The algebra: • Needs to handle XML, relational, hierarchical and support it all efficiently! • Need to distinguish physical from logical algebra. • Concordance tables need to be an integral part of the system. Need to think of data cleaning. • Need to deal with down times of data sources (or refusal times). • Need to provide range of options between on-demand querying and pre-materialization.

  33. Non-Technical Issues • SQL not really a standard. • Legacy systems are not necessarily old. • IT managers skeptical of truths. • People are very confused out there. • Need a huge organization to support the effort.

More Related