600 likes | 765 Views
A Framework for Optimizing and Parallelizing XQuery. Xiaogang Li. Motivations. Developing data processing applications is hard - Many data formats exist - Different architectures - Need independence from data format and architecture XML has gained great popularity!
E N D
A Framework for Optimizing and Parallelizing XQuery Xiaogang Li
Motivations • Developing data processing applications is hard - Many data formats exist - Different architectures - Need independence from data format and architecture • XML has gained great popularity! - Now the standard language for the internet - Already extensively used as part of Grid/Distributed Computing • High-level declarative languages ease application development -Popularity of Matlab for scientific computations
XML The Whole Picture XQuery HDF5 NetCDF TEXT RDMS XML
Contributions • Architectural independence - Provide compilation support of XQuery for - Stream processing (VLDB2005) - Parallel processing on clusters (ICS2003, DBPL2003) • Data format independence - Developed techniques to use XML as a logical interface over physical datasets (LCPC 2003) • Performance - Developed a series optimization techniques for efficient XQuery processing - Developed static analysis techniques to guide compiler optimizations and transformations (XIMP2004,IPDPS2003)
Roadmap • Background - XML, XQuery - Related work • Stream Processing • Virtual XML • Parallelization • Conclusion
eXtensible Markup Language • Specification of asyntax for “encoding” data, with strict syntax rules about how to do so. • A text-based syntax -- written using printable characters (no explicit binary data) • Extensible -- you can define your own tags (essentially data types), within the constraints of the syntax rules • Universal -- the syntax rules ensure that all XML processing software MUST identically handle a given piece of XML. An ideal data exchange format
attribute of this quantity element element tags XML Example <order> xmlns=“http://w3c.org/Spec/” > <item> <code>“30100026266” </code> <desc>Viewsonic E90f Monitor, 0.21mm, DELL Outlet </desc> <price> 229.99 </price> <quantityunits=“gross”>2</quantity> <deliveryDate date=“20APr2004-12:00h” /> </item> <item> <code> “2001234” </code> . . . . . . </item> </order>
XQuery • A declarative language for querying XML -Widely accepted language for querying XML - Declarative: like SQL, easy to use - Powerful: types, user-defined functions, binary expressions - FLWR (for, let, where, return) expressions • Support XPath as a subset - A query languagethat selects particular subsets of nodes from an XML document
Unordered ( for $i in ( $x1 to $x2) for $j in ($y1 to $y2) let p:=document(“vmscope.xml”) /data/pixel [(x=$i) and ( y=$j) and (sacle >=$z1) return <pixel> <latitute> {$i} </latitute> <longtitute>{$j} <longtitute> <sum>{accumula($p)}</sum> </pixel> ) Define function accumulate ($p) as element { if (empty( $p) then $null else let $max =accumula(subsequence($p,2)) let $q := item-at( $p, 1) return if ($q/scale < $max/scale ) or ($max = $null ) then $max else $q } VMScope- XQuery Code
XQuery Example: Apriori Users can write very complex, flexible programs. Recursive functions are the only way for reduction
Roadmap • Background - XML, XQuery - Related work • Stream Processing • High-level Abstraction • Parallelization • Conclusion
Query Processing- Related Work • Much of the work focuses on XPath -Xpath expressions are regular expressions-- easy to analyze • Limited work on optimizing XQuery -Optimizing from high-level using algebra -Translating query into a tree of operators -Query rewriting based on algebra
Algebra Approach: Limitations • Can not handle low level optimizations - loop invariants, common subexpression … • Hard to catch all features using algebra - Recursive functions, types, aggregations • XQuery is complex, a simple algebra just does not exist
Our Overall Approach • Using compiler technologies for Query optimization - Compiler techniques are well developed - Data flow analysis, loop transformation, parallelization Advanced program analysis, loop transformation and parallelization techniques can allow efficient execution of XQuery
Roadmap • Background - XML, XQuery - Related work • Stream Processing • Virtual XML • Parallelization • Conclusion
Motivation • Why Streaming Data • Data needs to be analyzed at real time - Stock market, Security, Climate, Network monitoring, Telecommunication data management etc • Huge amount of data - NASA EOS project – 50 GB per hour • Rapid improvements in networking technologies - 101.13 Gbps at SC2004 bandwidth challenge
Motivation • Why XML - Standard data exchanging format for the Internet - Widely adapted in web-based, distributed and grid computing • Why XQuery - Widely accepted language for querying XML - Easy to use XQuery is the ideal language for querying XML streams Can we compile it correctly and efficiently for streaming data?
Challenges • For an arbitrary query, can it be evaluated correctly on unbounded streaming data? - Single traversal of the data is required - Decision should be made by the compiler, not the user • If not, can it be transformed accordingly? • How to generate efficient code for XQuery? - Computations involved is nontrivial - Recursive functions are frequently used - Efficient memory usage is important
Our Solutions • For an arbitrary query, can it be evaluated correctly on unbounded streaming data? - Construct data-flow graph for a query - Static analysis based on data-flow graph • If not, can it be transformed accordingly? - Query transformation techniques based on static analysis • How to generate efficient code for XQuery? - Techniques based on static analysis to minimize memory usage and optimize code - Generating imperative code - Recursive analysis and aggregation rewrite
Query Evaluation Model • Single input stream • Internal computations Limited memory linked operators • Pipeline operator and Blocking operator Op1 Op2 Op3 Op4
Pipeline and Blocking Operators • Pipeline Operator: - Each input element produces an output element independently - Selection etc • Blocking Operator: - Can only generate output after receiving all input elements - Cannot be processed in a single pass - Sort, Join etc • Progressive Blocking Operator: (1)|output|<<|input|: we can buffer the output (2) Associative and commutative operation: discard input - count(), sum()
Pixels with x and y Q1: let $i := …/pixel sortby (x) Q2: let $i := …/pixel [x < count(/pixel)] A blocking operator exists A progressive blocking operator is referred by another pipeline operator (or progressive blocking operator) Single Pass? Check condition 2 in a query
Single-Pass? Challenges • Must analyze data dependence - Something like Data Dependence Graph may be helpful • A Query may be flexible and complex - Need a simplified view of the query to make decision
Low level Transformation GNL Generation High level Transformation Horizontal Fusion Recursion Analysis Vertical Fusion Aggregation Rewrite Overall Framework Data Flow Graph Construction Single-Pass Analysis Stream Code Generation
S1 S2 v1 i b Stream Data Flow Graph (DFG) • Node: variable • Sequence • Atomic • Edge: dependence relation v1->v2 if v2 uses v1 • Aggregate dependence • Flow dependence • A DFG is acyclic S1:stream/pixel[x>0] S2:stream/pixel V1: count()
High-level Transformation • Goals 1: Enable single pass evaluation 2: Simplify the DFG for single-pass analysis • Horizontal Fusion and Vertical Fusion - Based on DFG
S0 S1 S2 S1 S2 v1 v2 v1 v2 b b Horizontal Fusion • Enable single-pass evaluation - Merge sequence node with common prefix S1:stream/pixel[x>0] S2:stream/pixel/y V1: count() V2: sum() S0:/stream/pixel S1:[x>0] S2: /y V1: count() V2: sum()
Horizontal Fusion with nested loops • Perform loop unrolling first • Merge sequence node accordingly
Require 3 Scanning Before Horizontal Fusion Output Datasets
After Horizontal Fusion Output Datasets Requires Just one Scanning
S1 S1 i i b b S2 S2 v v S v j j Vertical Fusion • Simplify DFG and single-pass analysis - Merge a cluster of nodes linked by flow dependence edges
Single-pass Analysis • Can a query be evaluated on-the fly? THEOREM 1. If a DFG contains more than one sequence node after vertical fusion, it can not be evaluated correctly in a single pass. Reason: for single input stream, each sequence node requires one traversal
Single-pass Analysis- Continue THEOREM 2. For any given two atomic nodes n1 and n2, if (1) n1 and n2 are aggregate dependent on a sequence node (2) there is a path between them, the query may not be evaluated in a single pass. Reason: A progressive blocking operator is referred by another progressive blocking operator Example : count (pixel) where /x>0.01*sum(/pixel/x)
S1 i b S2 v S2 v j Single-pass Analysis - Continue THEOREM 3. In there is a cycle in a DFG, the corresponding query may not be evaluated correctly using a single pass. Reason:A progressive blocking operator is referred by a pipeline operator
Single-pass Analysis • Check conditions corresponding to Theorem 1 2 and 3 -Stop further processing if any condition is true • Completeness of the analysis - If a query without blocking operator pass the test, it can be evaluated in a single pass THEOREM 4. If the results of a progressive blocking operator are referred to by a pipeline operator or a progressive blocking operator, then for its DFG, at least one of the three conditions holds true
A Review of the High-level Transformation and Analysis S1 S2 S v1 i v1 i b b S S v1 v i b Can not be evaluated in a single pass!!
Code Generation • Using SAX XML stream parser - XML document is parsed as stream of events - Event-Driven: Need to generate code to handle each event • Using Java JDK -Our compiler generates Java source code
Experiment • Query Benchmark - Selected Benchmarks from XMARK - Satellite, Virtual Microscope, Frequent Item • Systems compared with - Galax - Saxon - Qizx/Open
Performance: XMARK Benchmark >25% faster on small dataset Scales well on very large datasets
Performance: Real Applications >One order of magnitude faster on small dataset Works well for very large datasets
Summary • Provide a formal approach for query evaluation on XML stream - Query transformation to enable correct execution on stream - Formal methods for single-pass analysis - Strategies for efficient low-level code generation - Experiment results show advantage over other well-known systems
Roadmap • Background - XML, XQuery - Related work • Stream Processing • Virtual XML • Parallelization • Conclusion
Support High-Level Abstraction • Understanding the physical details is hard, but necessary for performance Logical Schema: A logical view over the data for programmer Physical Schema: Low level details of physical storage, provided to compilers
System Architecture External Schema XML Mapping Service logical XML schema physical XML schema Compiler XQuery Sources C++/C
High-level and low-level XQuery • High-level query: - Query base on logical schema - Developed by programmers • Low-level query: - Query base on physical schema - Retrieve data by calling library functions • High-level Query is transformed to low-level query by our compiler -User can still modify low level query if not satisfied
Mapping to low-level Query • A number of getData functions to retrieve data stream -getData($x) -getData($x,$y) • getData functions Written in Xquery -allow analysis and transformation • Find the optimal library function to call Unordered ( for $i in ( $x1 to $x2) for $j in ($y1 to $y2) let p:= getData($i,$j) return <pixel> <latitute> {$i} </latitute> <longtitute>{$j} </longtitute> <sum>{accumulate($p)}</sum> </pixel> )
Compiler Techniques • Insert getData functions - Compatible: output should be superset of original data stream - performance: want smallest superset • Query rewritten based on relational algebra - Reduce to canonical forms - Compare canonical forms
Roadmap • Background - XML, XQuery - Related work • Stream Processing • Virtual XML • Parallelization • Conclusion
Generalized Nested Loop (GNL) • An intermediate representation explicitly defines - iterative structures for retrieving data - aggregation operations to be performed on the qualified data For $b in student/score @t =cis sum = sum +b count = count +1 Filter Expr index variable Path Expr Loop Body