1 / 60

A Framework for Optimizing and Parallelizing XQuery

A Framework for Optimizing and Parallelizing XQuery. Xiaogang Li. Motivations. Developing data processing applications is hard - Many data formats exist - Different architectures - Need independence from data format and architecture XML has gained great popularity!

esme
Download Presentation

A Framework for Optimizing and Parallelizing XQuery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Framework for Optimizing and Parallelizing XQuery Xiaogang Li

  2. Motivations • Developing data processing applications is hard - Many data formats exist - Different architectures - Need independence from data format and architecture • XML has gained great popularity! - Now the standard language for the internet - Already extensively used as part of Grid/Distributed Computing • High-level declarative languages ease application development -Popularity of Matlab for scientific computations

  3. XML The Whole Picture XQuery HDF5 NetCDF TEXT RDMS XML

  4. Contributions • Architectural independence - Provide compilation support of XQuery for - Stream processing (VLDB2005) - Parallel processing on clusters (ICS2003, DBPL2003) • Data format independence - Developed techniques to use XML as a logical interface over physical datasets (LCPC 2003) • Performance - Developed a series optimization techniques for efficient XQuery processing - Developed static analysis techniques to guide compiler optimizations and transformations (XIMP2004,IPDPS2003)

  5. Roadmap • Background - XML, XQuery - Related work • Stream Processing • Virtual XML • Parallelization • Conclusion

  6. eXtensible Markup Language • Specification of asyntax for “encoding” data, with strict syntax rules about how to do so. • A text-based syntax -- written using printable characters (no explicit binary data) • Extensible -- you can define your own tags (essentially data types), within the constraints of the syntax rules • Universal -- the syntax rules ensure that all XML processing software MUST identically handle a given piece of XML. An ideal data exchange format

  7. attribute of this quantity element element tags XML Example <order> xmlns=“http://w3c.org/Spec/” > <item> <code>“30100026266” </code> <desc>Viewsonic E90f Monitor, 0.21mm, DELL Outlet </desc> <price> 229.99 </price> <quantityunits=“gross”>2</quantity> <deliveryDate date=“20APr2004-12:00h” /> </item> <item> <code> “2001234” </code> . . . . . . </item> </order>

  8. XQuery • A declarative language for querying XML -Widely accepted language for querying XML - Declarative: like SQL, easy to use - Powerful: types, user-defined functions, binary expressions - FLWR (for, let, where, return) expressions • Support XPath as a subset - A query languagethat selects particular subsets of nodes from an XML document

  9. Unordered ( for $i in ( $x1 to $x2) for $j in ($y1 to $y2) let p:=document(“vmscope.xml”) /data/pixel [(x=$i) and ( y=$j) and (sacle >=$z1) return <pixel> <latitute> {$i} </latitute> <longtitute>{$j} <longtitute> <sum>{accumula($p)}</sum> </pixel> ) Define function accumulate ($p) as element { if (empty( $p) then $null else let $max =accumula(subsequence($p,2)) let $q := item-at( $p, 1) return if ($q/scale < $max/scale ) or ($max = $null ) then $max else $q } VMScope- XQuery Code

  10. XQuery Example: Apriori Users can write very complex, flexible programs. Recursive functions are the only way for reduction

  11. Roadmap • Background - XML, XQuery - Related work • Stream Processing • High-level Abstraction • Parallelization • Conclusion

  12. Query Processing- Related Work • Much of the work focuses on XPath -Xpath expressions are regular expressions-- easy to analyze • Limited work on optimizing XQuery -Optimizing from high-level using algebra -Translating query into a tree of operators -Query rewriting based on algebra

  13. Algebra Approach: Limitations • Can not handle low level optimizations - loop invariants, common subexpression … • Hard to catch all features using algebra - Recursive functions, types, aggregations • XQuery is complex, a simple algebra just does not exist

  14. Our Overall Approach • Using compiler technologies for Query optimization - Compiler techniques are well developed - Data flow analysis, loop transformation, parallelization Advanced program analysis, loop transformation and parallelization techniques can allow efficient execution of XQuery

  15. Roadmap • Background - XML, XQuery - Related work • Stream Processing • Virtual XML • Parallelization • Conclusion

  16. Motivation • Why Streaming Data • Data needs to be analyzed at real time - Stock market, Security, Climate, Network monitoring, Telecommunication data management etc • Huge amount of data - NASA EOS project – 50 GB per hour • Rapid improvements in networking technologies - 101.13 Gbps at SC2004 bandwidth challenge

  17. Motivation • Why XML - Standard data exchanging format for the Internet - Widely adapted in web-based, distributed and grid computing • Why XQuery - Widely accepted language for querying XML - Easy to use XQuery is the ideal language for querying XML streams Can we compile it correctly and efficiently for streaming data?

  18. Challenges • For an arbitrary query, can it be evaluated correctly on unbounded streaming data? - Single traversal of the data is required - Decision should be made by the compiler, not the user • If not, can it be transformed accordingly? • How to generate efficient code for XQuery? - Computations involved is nontrivial - Recursive functions are frequently used - Efficient memory usage is important

  19. Our Solutions • For an arbitrary query, can it be evaluated correctly on unbounded streaming data? - Construct data-flow graph for a query - Static analysis based on data-flow graph • If not, can it be transformed accordingly? - Query transformation techniques based on static analysis • How to generate efficient code for XQuery? - Techniques based on static analysis to minimize memory usage and optimize code - Generating imperative code - Recursive analysis and aggregation rewrite

  20. Query Evaluation Model • Single input stream • Internal computations Limited memory linked operators • Pipeline operator and Blocking operator Op1 Op2 Op3 Op4

  21. Pipeline and Blocking Operators • Pipeline Operator: - Each input element produces an output element independently - Selection etc • Blocking Operator: - Can only generate output after receiving all input elements - Cannot be processed in a single pass - Sort, Join etc • Progressive Blocking Operator: (1)|output|<<|input|: we can buffer the output (2) Associative and commutative operation: discard input - count(), sum()

  22. Pixels with x and y Q1: let $i := …/pixel sortby (x) Q2: let $i := …/pixel [x < count(/pixel)] A blocking operator exists A progressive blocking operator is referred by another pipeline operator (or progressive blocking operator) Single Pass? Check condition 2 in a query

  23. Single-Pass? Challenges • Must analyze data dependence - Something like Data Dependence Graph may be helpful • A Query may be flexible and complex - Need a simplified view of the query to make decision

  24. Low level Transformation GNL Generation High level Transformation Horizontal Fusion Recursion Analysis Vertical Fusion Aggregation Rewrite Overall Framework Data Flow Graph Construction Single-Pass Analysis Stream Code Generation

  25. S1 S2 v1 i b Stream Data Flow Graph (DFG) • Node: variable • Sequence • Atomic • Edge: dependence relation v1->v2 if v2 uses v1 • Aggregate dependence • Flow dependence • A DFG is acyclic S1:stream/pixel[x>0] S2:stream/pixel V1: count()

  26. High-level Transformation • Goals 1: Enable single pass evaluation 2: Simplify the DFG for single-pass analysis • Horizontal Fusion and Vertical Fusion - Based on DFG

  27. S0 S1 S2 S1 S2 v1 v2 v1 v2 b b Horizontal Fusion • Enable single-pass evaluation - Merge sequence node with common prefix S1:stream/pixel[x>0] S2:stream/pixel/y V1: count() V2: sum() S0:/stream/pixel S1:[x>0] S2: /y V1: count() V2: sum()

  28. Horizontal Fusion with nested loops • Perform loop unrolling first • Merge sequence node accordingly

  29. Require 3 Scanning Before Horizontal Fusion Output Datasets

  30. After Horizontal Fusion Output Datasets Requires Just one Scanning

  31. S1 S1 i i b b S2 S2 v v S v j j Vertical Fusion • Simplify DFG and single-pass analysis - Merge a cluster of nodes linked by flow dependence edges

  32. Single-pass Analysis • Can a query be evaluated on-the fly? THEOREM 1. If a DFG contains more than one sequence node after vertical fusion, it can not be evaluated correctly in a single pass. Reason: for single input stream, each sequence node requires one traversal

  33. Single-pass Analysis- Continue THEOREM 2. For any given two atomic nodes n1 and n2, if (1) n1 and n2 are aggregate dependent on a sequence node (2) there is a path between them, the query may not be evaluated in a single pass. Reason: A progressive blocking operator is referred by another progressive blocking operator Example : count (pixel) where /x>0.01*sum(/pixel/x)

  34. S1 i b S2 v S2 v j Single-pass Analysis - Continue THEOREM 3. In there is a cycle in a DFG, the corresponding query may not be evaluated correctly using a single pass. Reason:A progressive blocking operator is referred by a pipeline operator

  35. Single-pass Analysis • Check conditions corresponding to Theorem 1 2 and 3 -Stop further processing if any condition is true • Completeness of the analysis - If a query without blocking operator pass the test, it can be evaluated in a single pass THEOREM 4. If the results of a progressive blocking operator are referred to by a pipeline operator or a progressive blocking operator, then for its DFG, at least one of the three conditions holds true

  36. A Review of the High-level Transformation and Analysis S1 S2 S v1 i v1 i b b S S v1 v i b Can not be evaluated in a single pass!!

  37. Code Generation • Using SAX XML stream parser - XML document is parsed as stream of events - Event-Driven: Need to generate code to handle each event • Using Java JDK -Our compiler generates Java source code

  38. Experiment • Query Benchmark - Selected Benchmarks from XMARK - Satellite, Virtual Microscope, Frequent Item • Systems compared with - Galax - Saxon - Qizx/Open

  39. Performance: XMARK Benchmark >25% faster on small dataset Scales well on very large datasets

  40. Performance: Real Applications >One order of magnitude faster on small dataset Works well for very large datasets

  41. Summary • Provide a formal approach for query evaluation on XML stream - Query transformation to enable correct execution on stream - Formal methods for single-pass analysis - Strategies for efficient low-level code generation - Experiment results show advantage over other well-known systems

  42. Roadmap • Background - XML, XQuery - Related work • Stream Processing • Virtual XML • Parallelization • Conclusion

  43. Support High-Level Abstraction • Understanding the physical details is hard, but necessary for performance Logical Schema: A logical view over the data for programmer Physical Schema: Low level details of physical storage, provided to compilers

  44. System Architecture External Schema XML Mapping Service logical XML schema physical XML schema Compiler XQuery Sources C++/C

  45. High-level and low-level XQuery • High-level query: - Query base on logical schema - Developed by programmers • Low-level query: - Query base on physical schema - Retrieve data by calling library functions • High-level Query is transformed to low-level query by our compiler -User can still modify low level query if not satisfied

  46. Mapping to low-level Query • A number of getData functions to retrieve data stream -getData($x) -getData($x,$y) • getData functions Written in Xquery -allow analysis and transformation • Find the optimal library function to call Unordered ( for $i in ( $x1 to $x2) for $j in ($y1 to $y2) let p:= getData($i,$j) return <pixel> <latitute> {$i} </latitute> <longtitute>{$j} </longtitute> <sum>{accumulate($p)}</sum> </pixel> )

  47. Compiler Techniques • Insert getData functions - Compatible: output should be superset of original data stream - performance: want smallest superset • Query rewritten based on relational algebra - Reduce to canonical forms - Compare canonical forms

  48. Comparison with Manual - VMScope

  49. Roadmap • Background - XML, XQuery - Related work • Stream Processing • Virtual XML • Parallelization • Conclusion

  50. Generalized Nested Loop (GNL) • An intermediate representation explicitly defines - iterative structures for retrieving data - aggregation operations to be performed on the qualified data For $b in student/score @t =cis sum = sum +b count = count +1 Filter Expr index variable Path Expr Loop Body

More Related