470 likes | 573 Views
BLAS: An Efficient XPath Processing System Chen Y., Davidson S., Zheng Y. Νίκος Λούτας. Outline. Problem being addressed in the paper Related work BLAS Experimental Results Evaluation. Problem.
E N D
BLAS: An Efficient XPath Processing SystemChen Y., Davidson S., Zheng Y. Νίκος Λούτας
Outline • Problem being addressed in the paper • Related work • BLAS • Experimental Results • Evaluation
Problem • Number of disk accesses and joins is the primary bottleneck for evaluating complex queries efficiently!
Motivation • Can we improve XPath processing which uses relational technology? • D-labeling • Processes descendant axis traversal using a single join rather than a transitive closure of joins. • Observation: D-labeling processes / and // in the same way using joins. • XPRESS – queriable compressed XML files • Reverse arithmetic encoding • A label path as a distinct interval in[0.0, 1.0) • Handling of path expressions : containment relationships
Goals • Process / (simple path expressions) more efficiently • Reduce the number of disk accesses and joins • Optimize the join operations
Outline • Problem being addressed in the paper • Related work • BLAS • Experimental Results • Evaluation
Related work • XML storage and query processing • Store XML data naively as a file • The whole file needs to be traversed whenever a query is processed not efficient for large XML data sets • Store XML using a commercial RDBMS • Indexing, query processing capabilities
Related work (cont’d) • XML storage and query processing • An XML document as a graph generate a tuple for every edge • Simple, general and automatic generation of XML query – SQL mapping • An XML query may involve many self-joins • Self-joins can be eliminated by inlining the distinct child information into the parent tuple complex XML query – SQL mapping Problem:In all above approaches, wetypically need to rely on auxiliary code in a general-purpose programminglanguage together with SQL to express an XML query
Related work (cont’d) • Indexing • Structural indexes create a structural summary which is extracted from the XML document as a directed graph queries evaluated by pruning the search space • Path / tree queries • Indexing for branching path queries restrict the class of queries indexed to achieve performance benefits • Materialized views
Related work (cont’d) • Labeling • D-labeling • Build minimum label size D-labels • Build a B+ tree over D-labels to support tree queries • Effective for translating XQuery to SQL • XPRESS an XML data compression technique which uses reverse arithmetic encoding to encode label paths as a distinct interval within [0.0,1). Furthermore, it supports query evaluation over the compressed document using the containment relationship among the intervals.
Outline • Problem being addressed in the paper • Related work • BLAS • Experimental Results • Evaluation
Bi-LAbeling based System (BLAS) • Based on D-labeling and P-labeling • Process XPath queries which can be represented as trees • Index generator stores D-labeling, P-labeling, data values of an XML document • Query engine RDBMS or twig join
BLAS (cont’d) • Query translator • Decomposes an XPath query into a set of suffix path queries • encodes each suffix path query using P-labeling • generates a corresponding SQL query for each suffix path query • composes the SQL subqueries into a complete SQL query plan using D-labeling
Subquery Suffix Path Query Subquery Generator (based on P-labeling) Query … … XPath Query Query decomposition Subquery composition (based on D-labeling) Subquery Suffix Path Query Ancestor-descendant relationship between the results of the suffix path queries Query Translator Query Engine P-labeling generator P-labelings SAX Parser XML Events Storage Data values Query result Data loader D-labeling generator D-labelings Architecture of BLAS
BLAS: D-labeling • A D-label of an XML node is a triplet <d1,d2,d3>, such that for any two nodes n and m, n ≠ m: • n.d1 ≤ n.d2 (validation) • m is a descendant of n, if and only if n.d1 < m.d1 and n.d2 > m.d2 (descendant) • m is a child of n, if and only if m is a descendant of n and n.d3 + 1 = m.d3 (child) • n and m have no ancestor-descendant relationship, if and only if n.d2 < m.d1 and n.d1 > m.d2 (nonoverlap)
BLAS: D-labeling (cont’d) • Where for a node n: • d1 the position of the start tag of n in the XML document • d2 the position of the end tag of n in the XML document • d3 level of n in the XML trees
BLAS: D-labeling (cont’d) • Descendant axis query //t1//t2 • Retrieve all the nodes reachable by t1 and t2 two lists, l1 and l2 • Test for ancestor-descendant relationships between nodes in l1 and in l2 (D-join) • //proteinDatabase//refinfo, pDB and refinfo relations which store node tagged by proteinDatabase and refinfo • Select pDB.start, pDB.end, refinfo.start, refinfo.end • From pDB, refinfo • Where pDB.start < refinfo.start and pDB.end > refinfo.end
The labeling (start, end, level) can be used to detect ancestor-descendant relationships between nodes in a tree. books ... (1, 20000, 1) book (6, 1200, 2) (10,80,3) (81, 250,3) ... title section “The lord of the rings …” (100, 200,4) section title “Locating middle-earth” ... title figure “A hall fit for a king” description “King Theoden's golden hall” D-labeling scheme
BLAS: P-labeling • Efficiently process consecutive child axis steps (suffix path query) • A P-label for a suffix path P is an interval IP =< p1, p2 >, such that for any two suffix path expressions P, Q: • P.p1 ≤ P.p2(Validation ) • P Q if and only if interval IP is contained inIQ, i.e. Q.p1 ≤ P.p1 and Q.p2 ≤ P.p2(Containment) • P Q = , if and only if IP and IQ do notoverlap, i.e. P.p1 > Q.p2 or P.p2 < Q.p1(Nonintersection)
BLAS: P-labeling (cont’d) • For an XML node n, such that SP(n) =< p1, p2 >,the P-label for this XML node,denoted as n.plabel, is the integerp1 • Findall nodes n such that Q.p1 ≤ SP(n).p1≤ Q.p2and evaluate suffixpath query Q by obtaining the set of XML nodes whose P-labelsare contained in the P-label of Q • [[Q]] = {n | Q.p1 ≤n.plabel≤ Q.p2 }
BLAS: Intuition for P-labels • Assign each node a number, and each suffix path an interval such that: • For any two suffix paths Q1 and Q2, Q1contained in Q2 iff Q1’s interval is contained in Q2’s • A node is contained in the suffix path iff its number is contained in the path interval. • Replaces a sequence of joins by a selection.
BLAS: P-labeling Construction • For paths • For XML Trees • Assign / ratio r0 and each tag ratio ri = 1 / (n+1) • Define domain [0,m-1], m (n + 1)h • Construct P-labels for suffix paths • Assign // an interval of <0, m-1> • Partition the interval I tag order proportional to ti’s ri • allocate < 0, p1 > to suffix paths starting with /, and < pi, pi+1 - 1 > to suffix paths starting with //ti • Partition over each subinterval of path //ti by tags according to their ratios.
/books/book ... 2.11*103 2.1*104 2.2*104 //books/book /book //book/book ... 3*104 2*104 2.1*104 2.2*104 2.3*104 //book //title //section / //books ... 104 2*104 3*104 4*104 5*104 105 0 BLAS: Constructing P-label for paths
BLAS: P-labeling Construction (cont’d) • m = 1012 and99 tags • Each tag is assigned a r = 0.01 • construct a P-label for suffix path • P= /ProteinDatabase/ProteinEntry/protein/name
BLAS: Constructing P-label for XML nodes (cont’d) books ... P-label of an XML node: m, where the P-label for the path from root is [m,n] book ... title section 42100 E.g. /books/book/section: [42100, 42110] “The lord of the rings …” section title “Locating middle-earth” ... Evaluating a suffix path query Q finding all nodes whose P-label is contained in the P-label of Q title figure “A hall fit for a king” description “King Theoden's golden hall”
BLAS: Query Language • XPath queries containing /, //, *, and predicates (branches) tree queries • The evaluation of a path expression P returns the set of nodes [[P]] in an XML tree T which are reachable by P starting from the root of T • A source path SP(n) of a node n in an XML tree T, is the unique simple path P from the root to itself. • A path expression P is contained in a path expression • Q, P Q, if and only if for any XML tree T [[P]] [[Q]] • Path expressions P and Q are non-overlapping,P Q = , if and only if for any XML tree T, [[P]] [[Q]] =
BLAS: Query Translator • Split • Steps: • Descendent axis elimination • Branch elimination • Dfs traversal • p//q p and //q • D-elimination – D-join
BLAS: Query Translator: (I) Decomposition book section title figure Q: //book[//title]/section/figure
title BLAS: Query Translator: (I) Decomposition (cont’d) book book section figure Q: //book[//title]/section/figure
title BLAS: Query Translator: (I) Decomposition (cont’d) book section figure Q: //book[//title]/section/figure
title BLAS: Query Translator: (I) Decomposition (cont’d) book book section figure Q: //book[//title]/section/figure
title BLAS: Query Translator: (II) Selection on P-labels book book section figure Q: //book[//title]/section/figure
title BLAS: Query Translator: (III) Join on D-labels book book section figure Q: //book[//title]/section/figure
BLAS: Query Translator - Push-up • Used when schema information is absent • Descendent axis elimination • Push-up branch elimination • P[q1…qn]/r p, p/q1, …, p/qn, p/r
BLAS: Query Translator - Unfold • Used when schema information is present • Both non-recursive and recursive schemas • replace D-joins with a process that first performs selections on P-labels and then unions the results very efficient • selections using an index are cheap • the union is very simple since there are no duplicates • subqueries are all simple path queries, which can be implemented as a select operation with equality predicates • reduce the number of disk accesses
BLAS: Comparison with D-labeling book book book section title section title figure figure BLAS D-labeling BLAS: Fewer joins, fewer disk accesses
Outline • Problem being addressed in the paper • Related work • BLAS • Experimental Results • Evaluation
Experiment Setup • Data sets • Query sets • Suffix path queries • Path queries • XPath queries • Benchmark queries • Query Engine: TwigStack Join
Query Execution Time Query Name: A:Auction P: Protein S: Shakespeare 1: suffix path query 2: path query 3: XPath query
Number of data elements visited Query Name: A:Auction P: Protein S: Shakespeare 1: suffix path query 2: path query 3: XPath query
Scalability BLAS
Outline • Problem being addressed in the paper • Related work • BLAS • Experimental Results • Evaluation
Contributions • P-labeling scheme is proposed to evaluate suffix path queries efficiently. • BLAS combines P-labeling and D-labeling to evaluate XPath queries. • BLAS is more efficient than state-of-the-art work because the queries translated from XPath queries require: • fewer disk accesses • fewer joins • Experiments show the effectiveness of BLAS
Evaluation • Successful effort • Trade off between additional cost and execution time • BLAS vs RDBMS ?