360 likes | 534 Views
Efficient Processing of Partially Specified Twig Queries. Junfeng Zhou Renmin University of China. Outline. Introduction Preliminary PTwigStack Conclusion. Outline. Introduction Preliminary PTwigStack Conclusion. Introduction(1).
E N D
Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China
Outline • Introduction • Preliminary • PTwigStack • Conclusion
Outline • Introduction • Preliminary • PTwigStack • Conclusion
Introduction(1) • XML has been used extensively as a standard for information representation and exchange • More and more data is stored and exchanged with XML format • Effective and efficient querying of XML data is indispensable
bibliography(1) book bib(2) bib(…) year(3) book(4) article(7) title author 1999 title(5) author(6) title(8) author(9) author(10) Q XML Bob XML Joe Mary Introduction(2) • Using standard query language (XPath or XQuery) • How can we write a proper query when: • the structure or schema is not fully available or • Extracting information from different data sources with different structure
bibliography(1) bib(2) bib(…) year(3) book(4) article(7) 1999 title(5) author(6) title(8) author(9) author(10) XML Bob XML Joe Mary Introduction (4) • Using keyword based query • For example[1] • Find title and author of the publications The answer is : (5,6), (8,9,10) [1]Y. Li, C. Yu, and H. V. Jagadish. Schema-Free XQuery. In Proceedings of VLDB2004, pages 72-83, 2003
Correct answer (5,NULL), (NULL,9,10) Introduction (5) • Using keyword based query • How if node 6 and 8 are removed from the document • Find title and author of the publications bibliography(1) bib(2) bib(…) year(3) book(4) article(7) 1999 title(5) author(9) author(10) XML Joe Mary Meaningless Result The answer is : (5,9,10)
Introduction (6) • Using Partially Specified Twig Query (PSTQ)[2] • Can provide users the most flexibility • But • No existing method can process a PSTQ efficiently [2]Heuristic Containment Check of Partial Tree-Pattern Queries in the Presence of Index Graphs, CIKM, 2006
Introduction(7) • Objective • A concise but effective way to specify more flexible semantics constrains in a twig query • An efficient approach to process a PSTQ holistically without deriving twig queries and process them one by one • Scan Once: Each stream whose elements’ tag appears in the twig pattern is scanned only once. • No redundant output: None of the intermediate path solutions is useless • Bounded space complexity: The space required by the algorithm is bounded by a factor which is independent of source document size.
Outline • Introduction • Preliminary • Holistic Twig Join • Partially Specified Twig Query • PTwigStack • Conclusion
R A a1 a2 B C b1 b2 c1 Q XML document Preliminary- Holistic Twig Join[3] • Query Processing • Output useful Path Solutions • Merge all path solutions to get final results • Data Structure • Each query node is associated with a stack and an element stream • Benefits • No useless path solutions [3]N. Bruno, N. Koudas, and D. Srivastava: Holistic twig joins: Optimal XML pattern matching. TechnicalR eport Columbia University March 2002
Q1 Preliminary- Partially Specified Twig Query[2] • Q1 consists of two partial paths (PP), p1 and p2 • In p1, Y is descendant of W • In p2, W and A are being at the same path • p1 share W with p2 • “*” means p2 is output path • Compared with Twig Query: • Some nodes are specified with being at the same path relationship with other nodes, but not the precedence relationship • Compared with keyword based query: • Each part of the query can be a path expression, but not just keyword • Benefits of using PSTQ: • Users can specify query with whatever partial knowledge they have whenever possible [2]Heuristic Containment Check of Partial Tree-Pattern Queries in the Presence of Index Graphs, CIKM, 2006
a1 A A B A b1 B C A B C C B C c1 Q1 Q2 Q3 Q4 Q Xml document Preliminary- Partially Specified Twig Query • Query Processing of PSTQ: A naïve method • Deriving Twig Queries • Processing each twig query • Problem of the naïve method • Processing cost is too high • Eliminating redundant results
Outline • Introduction • Preliminary • PTwigStack • Conclusion
A A A C C B B B B B C A B A C A C C C B B A C A Q Q1 Q2 Q3 Q4 Q5 Q6 Q7 PTwigStack __PSTQ Expression • Extending XPath by adding an operator • “ ” is used to denote being at the same path relationship • A B is equivalent to A//B or B//A • A B C ?
A A B b1 A A B C A a1 a2 C B B C C B C c1 b3 Q Q1 Q2 Q3 Q4 b2 Document PTwigStack • Objective • Scan Once • No redundant output • Bounded space complexity • Problems • Which query node should be processed first? • Which element should be processed first? • How to guarantee no useless path solutions from being produced? According to special order in the given Query Element with solution extension Element which cannot participate in answers will not be pushed into stack
A A B b1 A A B C A a1 a2 C B B C C B C c1 b3 Q Q1 Q2 Q3 Q4 b2 Document PTwigStack • Problems(1) • Which query node should be processed first? • Deep first order • ABC
A A B b1 A A B C A a1 a2 C B B C C B C c1 b3 Q Q1 Q2 Q3 Q4 b2 Document a1 c1 PTwigStack • Problems(2) • Which element should be processed first? • The element with Partial Solution Extension • Partial Solution Extension • We say a query node q has a PSE iff q satisfies any one of the following conditions: • If q is a leaf node, Cq does not equal to NULL. • If q is not a leaf node, for each q’∈children(q) • If q//q’, then Cq is ancestor of Cq’
A A B b1 A A B C A a1 a2 C B B C C B C c1 b3 Q Q1 Q2 Q3 Q4 b2 Document c0 a1 b1 c1 b1 a1 a1 b1 c1 c1 PTwigStack • Problems(2) • Which element should be processed first? • The element with Partial Solution Extension • Partial Solution Extension • We say a query node q has a PSE iff q satisfies any one of the following conditions: • If q is a leaf node, Cq does not equal to NULL. • If q is a non-leaf node, for each q’∈children(q) • If q//q’, then Cq is ancestor of Cq’ • If q q’ (being at the same path) and q’ has a PSE, then Cq can cover Cq’ or be covered by Cq’, or Cq.end < Cq’.start
A A B b1 A A B C A a1 a2 C B B C C B C c1 b3 Q Q1 Q2 Q3 Q4 b2 Document PTwigStack • Problems(2) • Which element should be processed first? • The element with Partial Solution Extension • Partial Solution Extension • We say a query node q has a PSE iff q satisfies any one of the following conditions: • If q is a leaf node, Cq does not equal to NULL. • If q is a non-leaf node, for each q’∈children(q) • If q//q’, then Cq is ancestor of Cq’ • If q q’ (being at the same path) and q’ has a PSE, then Cq can cover Cq’ or be covered by Cq’, or Cq.end < Cq’.start • If q q’ and q’ hasn’t PSE, let p be descendent of q’ which has PSE, then Cq.start<Cp.start
PTwigStack • Feature of Partial Solution Extension • If E has a PSE, E must have a Solution Extension of some twig queries derived from the given PSTQ, which means CE may participate in final results. • Usage of Partial Solution Extension • Guiding the executing of PTwigStack
a1 b1 c1 Document c1 b1 a1 Document A a1 C B c1 Document PTwigStack • Problems(3) • How to guarantee no useless path solutions from being produced? • Prevent useless elements from being pushed into stack • What is useless element? • cannot satisfy query requirement with top elements in correlated stacks or head element in each element stream a0 b1
PTwigStack • Data Structure • Stack • Each query node is also associated with a stack to compactly represent temporal results • Tag index • Each query node is associated with an element stream
PTwigStack PTwigStack(root) // the first stage • while not end(root) • q = getNext(root) • Clean All Stacks related with q and output relevant path solutions • If Cq can be pushed into Stack Sq • Push(Sq, Cq) • Processing other elements Cq’ iteratively where q’ is child of q in the query and Cq’.start < Cq.start • Output all possible path solutions • Advance(Cq) //the second stage • MergeAllPathSolution(); 6
A A B A A B C A C B B C C B C Q Q1 Q2 Q3 Q4 PTwigStack Output: Output: Final Result: A C B • PTwigStack(root) • // the first stage • while not end(root) • q = getNext(root) • Clean All Stacks related with q and output path solutions • If Cq can be pushed into Stack Sq • Push(Sq, Cq) • Processing other elements Cq’ iteratively where q’ is • child of q in the query and Cq’.start < Cq.start • Output all possible path solutions • Advance(Cq) • //the second stage • MergeAllPathSolution(); b1 c1 a1 a3 a2 c2 b2
A A B A A B C A C B B C C B C Q Q1 Q2 Q3 Q4 PTwigStack Output: Output: Final Result: A C B • PTwigStack(root) • // the first stage • while not end(root) • q = getNext(root) • Clean All Stacks related with q and output path solutions • If Cq can be pushed into Stack Sq • Push(Sq, Cq) • Processing other elements Cq’ iteratively where q’ is • child of q in the query and Cq’.start < Cq.start • Output all possible path solutions • Advance(Cq) • //the second stage • MergeAllPathSolution(); b1 c1 c1 a1 a3 a2 c2 b2
A A B A A B C A C B B C C B C Q Q1 Q2 Q3 Q4 PTwigStack Output: Output: Final Result: a1 A b1 C B • PTwigStack(root) • // the first stage • while not end(root) • q = getNext(root) • Clean All Stacks related with q and output path solutions • If Cq can be pushed into Stack Sq • Push(Sq, Cq) • Processing other elements Cq’ iteratively where q’ is • child of q in the query and Cq’.start < Cq.start • Output all possible path solutions • Advance(Cq) • //the second stage • MergeAllPathSolution(); b1 c1 a1 a3 a2 c2 b2
A A B A A B C A C B B C C B C Q Q1 Q2 Q3 Q4 PTwigStack Output: Output: Final Result: a1 A b1 C B • PTwigStack(root) • // the first stage • while not end(root) • q = getNext(root) • Clean All Stacks related with q and output path solutions • If Cq can be pushed into Stack Sq • Push(Sq, Cq) • Processing other elements Cq’ iteratively where q’ is • child of q in the query and Cq’.start < Cq.start • Output all possible path solutions • Advance(Cq) • //the second stage • MergeAllPathSolution(); b1 c1 a1 a3 a2 c2 b2
A A B A A B C A C B B C C B C Q Q1 Q2 Q3 Q4 PTwigStack Output: Output: Final Result: a1 a1c2 A b1 c2 C B • PTwigStack(root) • // the first stage • while not end(root) • q = getNext(root) • Clean All Stacks related with q and output path solutions • If Cq can be pushed into Stack Sq • Push(Sq, Cq) • Processing other elements Cq’ iteratively where q’ is • child of q in the query and Cq’.start < Cq.start • Output all possible path solutions • Advance(Cq) • //the second stage • MergeAllPathSolution(); b1 c1 a1 a3 a2 c2 b2
A A B A A B C A C B B C C B C Q Q1 Q2 Q3 Q4 PTwigStack Output: Output: Final Result: a1 a1c2 a1b2 A b2 b1 C B • PTwigStack(root) • // the first stage • while not end(root) • q = getNext(root) • Clean All Stacks related with q and output path solutions • If Cq can be pushed into Stack Sq • Push(Sq, Cq) • Processing other elements Cq’ iteratively where q’ is • child of q in the query and Cq’.start < Cq.start • Output all possible path solutions • Advance(Cq) • //the second stage • MergeAllPathSolution(); b1 c1 a1 a3 a2 c2 b2
A A B A A B C A C B B C C B C Q Q1 Q2 Q3 Q4 PTwigStack Output: Output: Final Result: a1 a1c2 a1b2 a1b1 a1b1c2 a1b2c2 A b1 C B • PTwigStack(root) • // the first stage • while not end(root) • q = getNext(root) • Clean All Stacks related with q and output path solutions • If Cq can be pushed into Stack Sq • Push(Sq, Cq) • Processing other elements Cq’ iteratively where q’ is • child of q in the query and Cq’.start < Cq.start • Output all possible path solutions • Advance(Cq) • //the second stage • MergeAllPathSolution(); b1 c1 a1 a3 a2 c2 b2
PTwigStack • Properties: • Each element is scanned only once • Each element in stack must participate in at least one final result • No “Eliminating Operation” for redundant results • Space bounded by|Q|×L where L is the longest path in the XML source documentand |Q| is the number of nodes in the given queryQ
Outline • Introduction • Preliminary • PTwigStack • Conclusion
Conclusion • We propose a concise but effective way to express the semantics of being at the same path by expanding XPath • We propose a new concept, Partial Solution Extension, to guide the executing of getNext • We propose a new holistic join method to process a PSTQ with root node
Future Work • The above method cannot be applied directly to query without being specified with root node, e.g. • #[//A]//B • #[//A//B]//C • #[//A B]//C • Possible Solution • Implementing special algorithm to process a PSTQ without being specified with root node (using Dewey code) • Using ORASS[4] to construct a twig query with more semantics constrains (using range code) [4] Gillian Dobbie, Wu Xiaoying, Tok Wang Ling, Mong Li Lee: ORA-SS: An Object-Relationship-Attribute Model for Semistructured Data TR21/00, Technical Report, Department of Computer Science, National University of Singapore, December 2000.
Thank You ! Q & A