250 likes | 266 Views
This research explores the benefits of utilizing path summaries in an XML query optimizer for supporting multiple access methods. The study focuses on novel cost-based optimization techniques that leverage path summaries to enhance XML query processing efficiency. The paper investigates Holistic Path Summary Pruning, Access Order Selection, and Stack Algorithms to optimize query performance. By augmenting ToXin trees with system catalog statistics, the study shows how path summaries can improve query processing accuracy and speed. Experimental evaluations and future research directions are also discussed.
E N D
Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto
Motivation • Growing importance of XML query processing • Plethora of implementations: • native XML dbms (e.g.Timber, Niagara, BEA/XQRL, Natix,ToX) • XQuery systems (e.g.Galax, IPSI-SQ, XSM, MS-XQuery) • XPath processors (e.g.XSQ, SPEX, XPush, Xalan, PathStack) • publish/subscribe (e.g.Y-Filter,IndexFilter,WebFilter,NiagaraCQ) • twig query processors (e.g.TwigStack, PRIX, TurboXPath) • Our contribution: • Apply novel cost-based optimization techniques to XML query processing that exploit path summaries Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods
Example XQuery and Pattern Tree for $x in document(“catalog.xml”)//item, $y in document(“parts.xml”)//part, $z in document(“supplier.xml”)//supplier where $x/part_no = $y/part_no and $z/supplier_no = $x/supplier_no and $z/city = "Toronto" and $z/province = "Ontario" return <result> {$x/part_no} {$x/price} {$y/description} </result> Pattern Tree (PT) or Twig Query Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods
Example XQuery Processing for $x in document(“catalog.xml”)//item, $y in document(“parts.xml”)//part, $z in document(“supplier.xml”)//supplier where $x/part_no = $y/part_no and $z/supplier_no = $x/supplier_no and $z/city = "Toronto" and $z/province = "Ontario" return <result> {$x/part_no} {$x/price} {$y/description} </result> $x = $y $z = $x Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods
Contributions HolisticPath Summary Pruning Access Order Selection • Path Summaries as Catalogs Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods
Outline • Introduction • Path Summaries in the Optimizer • Holistic Path Summary Pruning • Experimental Evaluation • Access Order Selection • Experimental Evaluation • Future Work Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods
ToXin Path Summary • For each distinct path in document there is a path in ToXin - is an exact path summary – reflects the structure of the document [RM01] • Initially proposed as a back-end - can answer any pattern queries <suppliers> <supplier> <supplier_no> 1001 </supplier_no> <name> Magna </name> <city> Toronto </city> <province> ON </province> </supplier> <supplier> <supplier_no> 1002 </supplier_no> <name> MEC </name> <city> Vancouver </city> <province> BC </province> </supplier> </suppliers> TT TI Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods
Augmented ToXin Trees • System catalog: schema + data statistics • DTD and XML Schema are used for validation, they do not describe the actual schema of the instances • ToXin is an exact path summary actual schema • ToXin augmented with statistics system catalog • ToXop statistics: • NCARD– no. of instances for an element • ICARD – no of distinct value for an element • Fan-out – avg. no. of sub-element instances for each sub-element • Augmented ToXin Tree: • existing schema (TT) + statistics + node instances (TI) Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods
Outline • Introduction • Path Summaries in the Optimizer • Holistic Path Summary Pruning • Experimental Evaluation • Access Order Selection • Experimental Evaluation • Future Work Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods
TT TI Holistic Path Summary Pruning • All path summary based query processors perform some path summary pruning specific to the processor • Idea: separate path pruning from the processor and encoding • Holistic Path Summary Pruning (HPSP): • evaluate the pattern tree on the actual schema (TT tree) • compute the twig query using an appropriate algorithm for the particular element encoding • TwigStackScan is one possible HPSP-based Access Method Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods
Stack Algorithms • Stack algorithms: PathStack, TwigStack, TwigStackXB [BSK02] • Use region algebra encoding: Telement: [DocID, Term, StartPos, EndPos, LevelNum] - elements Ttext : [DocID, Term , TextValue, StartPos, LevelNum] - string values • Build a stream (noted as T) for all elements having the same label, e.g. Tauthor encompasses all author elements from the document Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods
TwigStackScan Access Method • Extended region algebra encoding: Telement: [DocID, Term, StartPos, EndPos, LevelNum, TTnodeID] - elements Ttext : [DocID, Term , TextValue, StartPos, LevelNum, TTnodeID] - string values • TwigStackScan = HPSP + TwigStack Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods
Experimental Datasets • DBLP, SWISSPROT: University of Washington XML Repository • Both are large (millions of nodes) and shallow • DBLP – regular in structure (5 structures that repeat) • SWISSPROT – irregular in structure (many one of the kind structures) • XMARK: • simulates an on-line auction site • xmlgen from 0.01 (0.6 MB) – 2.8 (165.9 MB) • removed the content of ‘Text’ elements 30% reduction in size Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods
TwigStackScan Scale-Up Q7 scale-up with (XMARK) file size TwigStackScan speedup with (XMARK) file size • Q7: //site/people/person[@id = "person0"]/name – 1 twig match - @id in person, category, item, open_action • Q8: //site/people/person/name – 38,760 twig matches • When applicable TwigStackScan yields improvements of one order of magnitude Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods
TwigStackScan vs. TwigStack • High selectivity twig queries (Q1, Q4, Q6, Q7): speedup 0.97 to 5.87 • Low selectivity twig queries (Q8, Q11): speedup 1.43 to 1.78 • Scattered twig matches(Q2, Q3, Q5, Q9), grouped twig matches (Q10): speedup 8.96 to 75.38 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods
Outline • Introduction • Path Summaries in the Optimizer • Holistic Path Summary Pruning • Experimental Evaluation • Access Order Selection • Experimental Evaluation • Future Work Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods
Order Selection in Pattern Trees • Order Selection: the order in which to evaluate the branches • Direction Selection: decide how to evaluate a branch: top/down or bottom/up Choosing between top/down and bottom/up is extremely expensive computationally: LORE optimizer [McW99] – for a document with level 7 – millions of possible plans Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods
ToXinScan Access Method • Relational optimizers compute a GOOD plan not THE BEST plan • Similarly we use data statistics and heuristics to compute a good plan • The access-order selection strategy: • Sort the children according to parent selectivity • Evaluate the path with the highest selectivity using a bottom-up evaluation • Evaluate all other paths, in the selectivity order, using a top-down evaluation Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods
ToXinScan Scale-Up Speedup ToXinScan vs. TwigStack with (XMARK) file size • Q8: //site/people/person[@id = "person0"]/name – 1 twig match • Q9: //site/people/person/name– 38,760 twig matches • Q10: //regions/samerica/item[./location = "United States" AND ./@id AND ./quantity AND ./payment] /name – 8 twig matches • Two-order of magnitude improvements over TwigStack Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods
ToXinScan vs. TwigStack High selectivity twig queries (Q3, Q4, Q5, Q8): speedup 2.16 to 9.32 Grouped twig matches (Q11, Q12, Q13): speedup 12.97 to 28.80 Low selectivity (Q2, Q9, Q10, Q14), scattered twig matches (Q1, Q6, Q7): speedup 48.31 to 122.44 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods
ToXinScan vs. Heavier Indexes • Pattern indexes (such as PRIX [RM04], ViST [WPF+03]) are the best twig-query processors • Indexes are expensive to build (three passes over the document) and require extensive space • ViST uses O(SH) space, S # of sequences, H height of tree • Indexes outperform TwigStack by two-orders of magnitude • Good news: • using path summaries and the presented optimization strategy we achieve the same performance improvements as node indexes • path summaries are inexpensive to build (one pass over the document) Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods
Outline • Introduction • Path Summaries in the Optimizer • Holistic Path Summary Pruning • Experimental Evaluation • Access Order Selection • Experimental Evaluation • Future Work Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods
Future Work • Generalize based on the strategy derived from the TwigStackScan access method • Holistic Path Summary Pruning (HPSP) can be used in conjunction with any twig query evaluation method • Can be used with Path summaries other than ToXin • ToXinScan • Add a generalized cost model for access methods • Enhance the XML statistics used • Propose benchmarks for XML Access methods Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods
Thank you for your attention! Attila Barta Mariano P. Consens Alberto O. Mendelzon { atibarta, consens, mendel }@cs.toronto.edu University of Toronto
ToXinScan vs. PRIX [RMo04] Praveen Rao, Bongki Moon, “PRIX: Indexing and Querying XML Using Prufer Sequences”, Proceedings of the 2004 International Conference on Data Engineering, Boston, MA, 2004 [RMo03] Praveen Rao, Bongki Moon, “PRIX: Indexing and Querying XML Using Prufer Sequences”, Technical Report TR-03-06,Univ. of Arizona, Tucson, 2003 Good news: node indexes (e.g. PRIX) are computationally expensive to build (three passes over the document) while path summaries are un-expensive to build (one pass over the document) Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods