250 likes | 266 Views
Optimizing Cursor Movement in Holistic Twig Joins. Marcus Fontoura, Vanja Josifovski , Eugene Shekita (IBM Almaden Research Center) Beverly Yang (Stanford) CIKM’2005. for $a in //article[year = “2005” or keyword = “XML”] for $s in $a/section return $s/title.
E N D
Optimizing Cursor Movement in Holistic Twig Joins Marcus Fontoura, Vanja Josifovski, Eugene Shekita (IBM Almaden Research Center) Beverly Yang (Stanford) CIKM’2005
for $a in //article[year = “2005” or keyword = “XML”] for $s in $a/section return $s/title In an index-based method, 7 tags and text elements need to be verified to process this query Running time is dominated by the I/O for manipulating this cursors Twig join Algorithms are not optimized for I/O and do not exploit the query’s extraction points article AND OR section year keyword title 2005 XML Motivation
Our Contributions • TwigOptimal, a new holistic twig join algorithm that supports a large fraction of XQuery (including AND/OR branches) • Description of how extraction points improve query performance • Experimental evaluation that shows how TwigOptimal outperforms current algorithms
Agenda • Background • TwigOptimal algorithm • Experimental results • Conclusions
(0,7,0) R (1,5,1) B3 A1 (6,7,1) (3,5,2) (7,7,2) B2 B1 C2 (2,2,2) (4,4,3) C1 D1 (5,5,3) XML Indexing • Begin/End/Level encoding • Begin: preorder position of tag/text • End: preorder position of last descendent • Level: depth • Containment: X contains Y iff X.begin < Y.begin <= X.end (assuming well-formed)
R B3 A1 B1 B2 C2 C1 D1 B1 B2 B3 C1 C2 Basic Access Path • Inverted lists • Posting: <Token, Location> • Token = <term/tag> • Location = <DocumentID, Position> • Supported method on cursor: • CB.fowardTo(Position p)
A || B || || C D Joins in XML • Structural (Containment) Joins • Twig Joins A || B B || C B || D A || B || C
A B3 A || B B1 D1 X2 D2 C1 X1 || || C D C2 LocateExtension • “Extension” (w.r.t. query node q) – a solution for the subquery rooted at q • Input: q • Result: the cursors of all descendants of q point to an extension for q
A B3 A || B B1 D1 X2 D2 C1 X1 || || C D C2 LocateExtension While (not end(q) && not hasExtension(q)) { (p, c) = PickBrokenEdge(q); ZigZagJoin(p, c); }
TwigOptimal Algorithm • Tests if the cursor with the minimal location has an extension • If not, try to virtually move cursors until they form an extension • Only move cursors physically if no more virtual move is possible • A virtual move just sets the begin value of the cursor, therefore no I/O is involved: • Cq.begin = new begin value for Cq; • Cq.virtual = true; //indicates that the cursor is virtual
A B3 A || B B1 D1 X2 D2 C1 X1 || || C D C2 Checking Extension • We have an extension for cursor q if: • All cursors underneath q are properly aligned • All cursors underneath q have physical locations Return false
A B3 A || B B1 D1 X2 D2 C1 X1 || || C D C2 Checking Extension • We have an extension for cursor q if: • All cursors underneath q are properly aligned • All cursors underneath q have physical locations Return true
Moving Cursors • Two passes over the query tree • Bottom-up: move each parent cursor forward so it contains the children cursors • Top-down: move the children cursors forward so they are contained by their parents
Move Cursors Example Query = //x[.//y and .//z] = physical move = virtual move 5 1 x1 x2 6 4 2 y1 y2 y3 y4 y5 7 3 z1 z2
Comparing with TSGeneric+ = current cursor position = physical move = virtual move Query = //w//x//y//z w1 w2 x1 x2 x3 x4... x49 x50 y1 y2 y3… y49 y50 y51 y52 ... y98 y100 y99 z1 z2
Comparing with TSGeneric+ = current cursor position Query = //w//x//y//z = physical move w1 w2 x1 x2 x3 x4... x49 x50 y1 y52... y2 y3… y49 y50 y51 y98 y100 y99 z1 z2
A || || B C Extraction Points Optimization • If neither q or its descendants in the query are extraction points we can virtually move these cursors within q’s parent A1 A2 C1 B1 B2 B3 C100 C99
Prototype • Implemented over Berkeley DB B-tree • Inverted lists • Posting: <Token, Location> • Token = <term/tag> • Location = <DocumentID, Position> • Position is BEL
Data Sets • Xmark • 10 documents of size ~ 100MB each • Synthetic • 4 tags: W, X, Y, Z • Uncorrelated, no self-nesting • Same frequency
Conclusion • TwigOptimal algorithm outperforms existing twig join algorithms by more than 40%, especially for larger queries • Optimized for I/O, which is the performance bottleneck • Extraction points optimization improve performance