610 likes | 728 Views
VAMANA A High Performance, Scalable and Cost Driven XPath Engine. Master Thesis Presentation 22 nd April 2004 Venkatesh Raghavan Advisor: Prof. Elke Rundensteiner Reader : Prof. Micha Hofri. Outline. Motivation Related Work Background for VAMANA Approach Our Physical Algebra
E N D
VAMANA A High Performance, Scalable and Cost Driven XPath Engine Master Thesis Presentation 22nd April 2004 Venkatesh Raghavan Advisor: Prof. Elke Rundensteiner Reader : Prof. Micha Hofri
Outline • Motivation • Related Work • Background for VAMANA Approach • Our Physical Algebra • Query Execution Engine • Query Optimization • Experimental Evaluation • Conclusions
Motivation Many applications are migrating to native XML database. • Need for an XML query engine • High Performance • Support queries to that emphasize the structural semantics of XML query languages. • Efficient querying engine and database management system tailored for XML data. • Scalable • To support large XML document.s • Support all 13 XPath axes. • Cost Based • Schema independent cost model provides dynamically calculated heuristics. • Intelligent cost-based transformations, further improving performance.
Outline • Motivation • Related Work • Relational Solutions • DOM Solutions • Current Index Based Solutions • Background for VAMANA Approach • Our Physical Algebra • Query Execution Engine • Query Optimization • Experimental Evaluation • Conclusions
Relational Solution • Mature data management tools • Query processing • Crash recovery • Concurrency control • Shredding XML documents • XPeranto [6] and Rainbow [7] • Many mapping algorithms • From XML Schema to Relations: A Cost-based Approach to XML Storage [24] • Workload based query mapping algorithm .xml
Flip-side • Data Model mismatch • Tables Vs XML semi-structured data model • Semantic mismatch • SQL Vs Xquery • Data Fragmentation • Increase query execution cost (More Joins) • High update cost • Overhead • Relational Mapping adds overhead • Managing data • Handling order.
DOM Solution • W3C - Document Object Model is language independent API for accessing various parts of XML document. • Traditional top-down tree traversal. • Disadvantages • Very main memory intensive. • On an average 4-5 times the file size [11] . • Most of the DOM based engines do not support all XPath axes. • Even if they do Imagine Pixar rendering “Finding Nemo” in Windows machine. • Requires complex recursive traversal for even for a few XPath axes.
… • Galax[9](developed by Bell and AT\&T labs) • They do not support all XPath axes. • Performs very poorly against large XML documents • Non-cost driven logical level optimization • Jaxen [22] • Java API to support different XML API (JDom, DOM, ElectricXML,dom4j • Can handle document having file sizes 10Mb • Intel Celeron PC with 512MB of RAM • IPSI, Pathan, etc.
Current Index Solutions • Apache Xindice[13] • User-defined pattern indexes • Capable to index small to medium size documents < 5Mb • Natix[23] • The XML data tree is partitioned into small sub-trees and each sub-tree is stored into a data page. • TOX [25] (University of Toronto) • ToX storage engine stores the XML documents in either a relational database or an object oriented database.
Contd. • TIMBER[14](University of Michigan, University of British Columbia and AT&T labs) • TAX Algebra • Pattern trees • Query execution • Structural joins • Query Optimization • Estimating costs of all promising sets of evaluation plans. • Problem is the exponential increase in possibilities for complex query. • They claim to only select from an elite set of possible evaluation plans. • Cost Estimation • Primary Histograms • Expensive to maintain for frequent updates • Counting Twigs – Frequently occuring co-related sub-path queries.
Outline • Motivation • Related Work • Background for VAMANA Approach • Multi-Axis Storage Structure • Running Example • Our Physical Algebra • Query Execution Engine • Query Optimization • Experimental Evaluation • Conclusions
Resultant Tuples XPath Expression XPath Compiler Default Query Plan Optimized Query Plan --- --- Optimizer Query Execution Engine --- --- --- --- Transformation Library Default Query Plan Cost Estimator Axis or Value Based Queries MASS Storage Structure Loader XML Documents
Clustering Axes self, child, following-sibling, preceding-sibling, attribute, namespace CL1 and CL3 self, parent, ancestor, ancestor-or-self, descendent, descendent-or-self, preceding, following CL2 and CL4 Multi-Axis Storage Structure • Efficient storage and access structure for XML document. • XPath axes, nodetests, range position predicates. • Provides statistics for costing. • Number of tuples per page. • Count. • Fast Lexicographical keys • Four Clusters + Value-based index
Running Examples E.g. 1: descendant::name/parent::*/self::person/address <person id="person0"> <name>Krishna Merle</name> <emailaddress>mailto:Merle@mitre.org</emailaddress> … E.g. 2: //province[text() = “Vermont” ]/ancestor::person • <person id="person41"> • <name>Muneo Yemenis</name> • ... • <phone>+0 (807) 6372999</phone> • … • <province>Vermont</province> • …
Outline • Motivation • Related Work • Background for VAMANA Approach • Our Physical Algebra • VAMANA Approach • Operators • Context Node • Query Execution Engine • Query Optimization • Experimental Evaluation • Conclusions
VAMANA Approach • Data-Flow style of querying. • Data flows • Control flows • Pipelined-iterative fashion • Avoid temporary copies of intermediate results whenever possible. • Facilitates the reduction of I/O operations • All tuples for a particular context node are clustered together. • Sequential traversal over node sets . • Hence minimal I/O and key comparisons. • Used in most of commercial relational database system. • Structural joins Best Case : When maximum number of tuples in the join - pairs Worst Case : Few of joins - pairs
Operators where, opsymbol of the operator type. cond represents a set of conditions applied by the operator id is an identifier that uniquely identifies in given plan P Root Operator R Step Operator Φ Literal Operator L Node Base Exist Operator ξ Binary Operator β Join Operator J
Context Node We extend the idea.. • The context node of any given VAMANA operator opidcond defines uniquely the position of an XML node in the index structure. The position is obtained by the structural path information encoded in the context node. “..context node is defined as the current node being processed…” – XPath (1.0 & 2.0)
Concepts //a/b[/c] Context Side /b Predicate Child Predicate Side Context Child EXIST //a /c
Dynamic Context Set /b root //a MASS index a,a,a,a,a,b
Φ2child::phone Φ2ancestor::person a. Q1 b. Q2 R1 R1 Φ5 child::text L4 ‘Vermont’ Φ3 self::person Φ6 //::province β3EQ Φ5 descendant::name Φ4 parent::* XPath Compiler Q1: descendant::name/self::*/parent::person/address Q2: //province[text() = “Vermont” ]/ancestor::person
Outline • Motivation • Related Work • Background for VAMANA Approach • Our Physical Algebra • Query Execution Engine • Query Optimization • Experimental Evaluation • Conclusions
Execution States • An operator can be in • INITIAL • Has not yet started fetching tuples. • FETCHING • When the operator has not yet exhausted all nodes from MASS that meets its condition(s). • When the operator is waiting for its context-child to return tuples. • When the operator is waiting for the predicate condition to process the nodes. • OUT_OF_NODE • When the operator has exhausted all the nodes from MASS that satisfy the condition(s) specified by the node. • When the context-child has no further tuples to return operator.
INTIAL FETCHING OUT_OF_NODE Step1:Setting Context for the “leaf operators on context-path” R1 //province[text()=“Vermont”]/ancestor::person Φ2 ancestor::person Φ6 //::province β3EQ L4 ‘Vermont’ Φ5 child::text
Step2:Ask the root node for tuples. R1 Φ2 ancestor::person Φ6 //::province a.d.y.a
a.d.y.a.a “Massachussets” Φ6 //::province a.d.y.a β3EQ L4 ‘Vermont’ Φ5 child::text a.d.y.a a.d.y.a a.d.y.a.a
Φ6 //::province a.d.y.a β3EQ L4 ‘Vermont’ Φ5 child::text a.d.y.a a.d.y.a
a.d.y.b a.d.y.b.a “Vermont” a.d.y.b Φ6 //::province a.d.y.b β3EQ L4 ‘Vermont’ Φ5 child::text a.d.y.a a.d.y.b a.d.y.b.a
R1 Φ2 ancestor::person a.d.y.b a.d.y Φ6 //::province a.d.y.b
Outline • Motivation • Related Work • Background for VAMANA Approach • Our Physical Algebra • Query Execution Engine • Query Optimization • Query Clean-Up • Cost Model • Transformation • Experimental Evaluation • Conclusions
Optimization Query Plan (P “)+ Heuristics (L( P “) ) Query Plan (P I) Default Query Plan (P ) Clean Up Cost Estimator Transformation Optimal Query Plan (P opt ) Transformed Query Plan (P t )
Φ2child::phone Φ2child::phone Φ3 self::person Φ3 self::person Φ5 descendant::name Φ5 descendant::name Φ4 parent::* Clean Up a. Default Query Plan b. Cleaned Query Plan
VAMANA Cost Model • The cost is usually calculated with respect to the root of the XML document or a node specified by the user. • Query costs are obtained from the actual data rather than a data dictionary and thus are always up to date. • Our cost model, does not suffer the overhead of parsing the entire document. This is the case for does histogram-based costing like StaTiX[14]. • Starting from the leaf operators we propagate the cost upwards towards the root operator.
VAMANA Cost Model COUNT(opidcond) • This heuristics is only calculated for step operators (Φidaxis::nodetest). It represents the count of the number of XML nodes in the underlying index structure that satisfy the node test of the step operator axis::nodetest. • MASS provides an API to efficiently gather count of a particular node test in its storage structure. TC(opidcond) • For a literal operator(Lid value), text count is the number of occurrences of a particular literal value in the index structure.
Contd.. IN (opi) The maximum number of tuples that the operator opi will receive in total from its context child. • Case 1: • For a leaf step operator on the context path of the query plan, the total number of tuples received is equal to the number of tuples available in the underlying index structure, i.e. IN(opi ) = COUNT(opi ). • Case 2: • For all non-leaf operator(s), IN(opi ) = OUT(opj), where opj is the context child of opi. • Case 3: • For all leaf step operator(s) on the predicate path of query plan, the total number of tuples received is equal to the number of tuples received by its predicate operator.
Contd.. OUT(opi) The maximum number of tuples that the current operator opi returns • Case 1: • A leaf step operator on the context path of the query plan returns all the tuples that occur in the underlying index structure with respect to the context of the leaf operator, i.e., OUT(opi )= COUNT(opi ). • Case 2: • A literal operator(s) returns the same values every time a request for tuples is received. To facilitate the optimization of literal operators by using a value-index, we define output as OUT(opi ) = TC(opi ). • Case 3: • For binary predicate operators that have value-based equivalence, the OUT(opi) is calculated as follows Minimum(# tuples from parent operator, TC of the literal value)
Contd.. • Non-leaf operators • Context path • Predicate path • Leaf operators • predicate path
Contd.. “Cost determined with respect to context node provider”
Example 1: Φ2child::address Count : 1256 IN : 4825 OUT : 1256 Φ3parent::person Count : 2550 IN : 4825
Example 2: Φ3parent::person Count : 2550 IN : 4825 OUT : 4825 Φ6descendant::name Count : 4825 OUT : 4825
Heuristics Ratio = IN/OUT • Higher the ratio, better the selectivity. δ i = scale0..1 (IN/OUT) • Inverted index <scaled(IN/OUT), opi>.
Transformation Library • XPath equivalence rules [5] extend for VAMANA physical algebra. Rule 1 : /descendant::n/parent::m/.. //m[child::n]/.. Rule 2 : /descendant-or-self::n/child::m/.. //m[parent::n]/.. Rule 3 : p/following-sibling::n/parent::m p[following-sibling::n]parent::m Rule 4 : /child::m/preceding-sibling::n descendant::n[following-sibling::n] • Binary predicate • Value based equivalence. • Value-index optimization
COUNT= 4825 IN= 4825 OUT= 4825 Φ5 descendant::name Q1 R1 COUNT= 1256 IN = 4825 OUT= 1256 Φ2child::address COUNT= 2550 IN= 4825 OUTT = 4825 Φ3 parent::person
COUNT= 1256 IN = 2550 OUT= 2550 COUNT= 2550 IN= 2550 OUT= 2550 COUNT= 4825 IN= 4825 OUT= 4825 Φ5 descendant::name Φ5 child::name IN= 4825 OUT= 2550 COUNT= 4825 IN= 2550 OUT= 4825 /descendant::n/parent::m/.. //m[child::n]/.. R1 R1 Φ2child::address COUNT= 1256 IN = 4825 OUT= 2550 Φ2child::address Φ3 //::person COUNT= 2550 IN= 4825 OUT= 4825 Φ3 parent::person ξ6 a. Initial Query Plan b. Transformed Query Plan
COUNT= 4825 IN= 1256 OUT= 4825 Φ5 child::name R1 /descendant-or-self::n/child::m/.. //m[parent::n]/.. COUNT= 1256 IN = 1256 OUT= 1256 Φ2//::address IN= 1256 OUT= 1256 ξ7 COUNT= 2550 IN= 1256 OUT= 1256 Φ3 parent::person Optimal Query Plan IN= 4825 OUT= 1256 ξ6
Φ5 child::text Φ6 //::province L4 ‘Vermont’ β3EQ Running Example 2: R1 Φ2ancestor::person COUNT= 2550 IN =1256OUT= 1256 COUNT= 1256 IN = 1256 OUT= 1256 IN = 1256 OUT= 13 TC = 13 COUNT= 304819 IN = OUT =304819
Φ2ancestor::person Φ2ancestor::person Φ6 parent::province Φ5 child::text Φ5 value:: ‘Vermont’ L4 ‘Vermont’ Φ6 //::province β3EQ R1 R1 a. Default Query Plan b. Transformed Query Plan
Can we produce BAD queries? • No! • Our optimization aims to reduce the number of tuples the parent operator receives. • Hence we try to push the most selective operators downward. • An operator is considered for transformation only if: • It is selective. • There exist an equivalent transformation rule. i.e. its parent is not affected by the transformation. • The number of tuples filtered generated by the operator is reduced. • Now, whether the transformation process ends is another question. • To solve this infinite running we have to brute stop the optimization process after a specific number of iterations.
Outline • Motivation • Related Work • Background for VAMANA Approach • Our Physical Algebra • Query Execution Engine • Query Optimization • Experimental Evaluation • Criterions • Queries • Results • Conclusions
Experimental Evaluation • XMark[17] auction database. • Compared the CPU execution time of the test queries over different XPath engines • Galax [9] • Jaxen [22] • Others • IPSI [10] • Pathan [11] • Xindices [14] • Variable factors • Document size • Factor (100Kb, 1Mb, 5Mb, 10Mb, 20Mb, 30Mb, 40Mb, 50Mb) • Queries