870 likes | 1.01k Views
Sequence Indexing Schemes. Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601. Introduction. Graph indexes precise Path, (twig only few methods) Sequence indexing schemes Top-down or bottom-up XML document and XML queries in structure-encoded sequences Path and twig.
E N D
Sequence Indexing Schemes Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601
Introduction • Graph indexes • precise • Path, (twig only few methods) • Sequence indexing schemes • Top-down or bottom-up • XML document and XML queries in structure-encoded sequences • Path and twig
ViST – Virtual Suffix Tree • Top-down Sequence Indexes • Represent XML documents and XML queries in structure-encoded sequences • Querying XML data is equivalent to finding subsequence matching • Avoid to expensive join operations • Provides unified index on both content and structure • Support dynamic index update • B+Trees which are supported in DBMSs
DTD of purchase records <!ELEMENT purchases (purchase*)> <!ELEMENT purchase (seller, buyer)> <!ATTRIST seller ID ID location CDATA name CDATA> <!ELEMENT seller (item*)> <!ATTRIST buyer ID ID location CDATA name CDATA> <!ELEMENT item (item*)> <!ATTRIST item name CDATA manufacturer CDATA>
Preorder Sequence of XML • Use capital letters to represent names of elements/attributes • Use hash function h(), to encode attribute values into integers • v1 = h(“dell”) • v2=h(“ibm”) • Preorder sequence of XML purchase record example • PSNv1IMv2Nv3IMv4INv5Lv6BLv7Nv8 • Isomorphic trees may produce different preorder seq. • DTD schema embodies linear order of all elements/attributes • Without DTD – use lexicographical order
Structure-Encoded Sequence Definition: A Structure-Encoded Sequence, derived from a prefix traversal of semi-structured XML document, is a sequence of (symbol, prefix) pairs: D = (a1,p1), (a2,p2),…, (an,pn) Where ai represents a node in the XML document tree, (of which a1, … ,an is the preorder sequence), and pi is the path from the root node to node ai.
Structure-Encoded Sequence D= (P,ϵ),(S,P),(N,PS),(v1,PSN),(I,PS),(M,PSI),(v2,PSIM),(N,PSI), (v3,PSIN),(I,PSI),(M,PSII),(v4,PSIIM),(I,PS),(N,PSI),(v5,PSIN), (L,PS),(v6,PSL),(B,P),(L,PB),(v7,PBL),(N,PB),(v8,PBN)
XML Queries in Path Expression and Sequence Form • Query: Path Expression Structure-Encoded Sequence • Q1 : /Purchase/Seller/Item/Manufacturer (P, ϵ)(S,P)(I,PS)(M,PSI) • Q2 : /Purchase[Seller[Loc = v5]]/Buyer[Loc = v7] (P,ϵ)(S,P)(L,PS)(v5,PSL)(B,P)(L,PB)(v7,PBL) • Q3 : /Purchase/*[Loc= v5] (P, ϵ)(L, P)(v5,P*L) • Q4 : /Purchase//Item[Manufacturer = v3] (P, ϵ)(I,P//)(M, P//I)(v3,P//IM)
Querying XML through Structure-Encoded Sequence Matching • Querying XML is equivalent to finding (non-contiguous) subsequence matches • Most structural XML queries can be performed through direct subsequence matching • Exception: branch has multiple identical child nodes • Q5=/A[B/C]/B/D • Two different sequences • (A, ϵ)(B,A)(C,AB)(B,A)(D,AB) • (A, ϵ)(B,A)(D,AB)(B,A)(C,AB) • Find matches separately and union their result • We may find false matches if the indexed documents contain branches with identical child nodes, then we ask multiple queries and compute set difference on result • If the query contains a large number of same child nodes under the branch, we can choose disassemble the tree into multiple trees and use join operations to combine their results
Algorithms • Naïve algorithm • RIST – Relationships Indexed Suffix Tree • ViST – Virtual Suffix Tree
Naïve algorithm: Suffix-Tree-Like structure • Doc1 : (P, ϵ)( S, P)(N, PS)(v1, PSN)(L, PS)(v2, PSL) • Doc2 : (P, ϵ)(B, P)(L, PB)(v2, PBL) • Q1 : (P, ϵ)(B, P)(L,PB)(v2, PBL) • Q2 : (P, ϵ)(L, P*)(v2,P*L)
D-Ancestorship and S-Ancestorship • D-Ancestorship • Ancestor-descendant relationships in original XML tree • Element (S,P) is a D-Ancestorship of (L,PS) • S-Ancestorship • Ancestor-descendant relationships in suffix tree • Element (v1, PSN) is an S-Ancestorship of (L, PS)
RIST – Indexing Construction • S-Ancestorship requires additional information • Label each suffix tree node x by pair <nx, sizex> • nx prefix traversal order of x in suffix tree • sizex is total number of descendants of x in suffix tree • x … <nx, sizex>, y …<ny, sizey> • x is S-Ancestor of node y if nyϵ (nx, nx + sizex] • Construct the B+Trees: • Tree nodes into the D-Ancestorship B+Tree using (Symbol, Prefix) as keys • For all nodes x inserted with the same (Symbol, Prefix) we index them by S-Ancestorship B+Tree, using the nx values of their labels as keys.
ViST – Virtual Suffix Tree • Dynamic Virtual suffix tree labeling • Semantic and statistical clues • Dynamic scope allocation without clues
Dynamic scope allocation • Number of child nodes of x is λ. We allocate 1/λ of the remaining scope to x’s first child Dynamic scope allocation with λ=2
subScope(parent, e): create a sub scopewithin the parent scope for e
Insertion index • Doc1 = (P,ϵ)(S,P)(N,PS)(v1,PSN)(L,PS)(v2,PSL) • Doc2 = (P,ϵ)(S,P)(L,PS)(v2,PSL)
EXPERIMENTS - Sample queries Path Expression Dataset Q1 /inproceedings/title DBLP Q2 /book/author[text=‘David’] DBLP Q3 /*/author[text= ‘David’] DBLP Q4 //author[text= ‘David’] DBLP Q5 /book[key=‘books/bc/MaierW88’]/author DBLP Q6 /site//item[location=‘US’]/mail/date[text=‘12/15/1999’] XMARK Q7 /site//person/*/city[text=‘Pocatello’] XMARK Q8 //closed_auction[*[person=‘person1’]]/date[text=‘12/15/1999’] XMARK
Comparing indexing methods time in seconds
Index structure • DBLP (301 MB of data) • XMARK (52MB of data)
Conclusion • structure-encoded sequences • Sequence matching • Avoid expensive join operations • Top-down scope allocation method • Index structure – B+Tree
PRIX: PRufer sequences for Indexing Xml • Rao & Moon (2006) proposed a new method for indexing XML documents using sequences • It uses the same idea as in ViST index: • The XML tree is transformed into a sequence and saved in the database • Each query is also transformed into a sequence • The answer of the query is acquired by performing subsequence matching
Motivation: Twig Queries and Wildcards • Like in ViST, PRIX also tries to efficiently answer twig queries as well as queries containing wildcards (‘*’ any and ‘//’ self or descendant queries) P P Q Q T S S Twig query XPath: P/Q[T]/S Query with wildcards XPath: P//Q/S
Motivation: Problems in ViST Index • Memory requirements: • In the worst case, ViST requires O(N2) space to index the document A <A> <B> <C> <D> <E> </E> </D> </C> </B> </A> B D = (A, ε), (B, A), (C, AB), (D, ABC), (E, ABCD) C D Elements in height k appear k times E
Motivation: Problems in ViST Index • Memory requirements: • In the worst case, ViST requires O(N2) space to index the document • False positives • In many cases, query processing in Vist results in false alarms P P P Q R Q Q Q T S U T T S T S Doc1 = (P, e) (Q, P) (T, PQ) (S, PQ) (R, P) (U, PR) (T, PR) Doc2 = (P, e) (Q, P) (T, PQ) (Q, P) (S, PQ) XPath: P/Q[T]/S Q = (P, e) (Q, P) (T, PQ) (S, PQ)
Motivation: Problems in ViST Index • Memory requirements: • In the worst case, ViST requires O(N2) space to index the document • False positives • In many cases, query processing in Vist results in false alarms • False negatives • Correctly answering a twig query depends on the order the branches are created P P N F F N T G Doc = (P, e) (F, P) (T, PF) (N, P) (G, PN) Xpath: P[N]/F Q = (P, e) (N, P) (F, P) ???
Motivation: Problems in ViST Index • Memory requirements: • In the worst case, ViST requires O(N2) space to index the document • False positives • In many cases, query processing in Vist results in false alarms • False negatives • Correctly answering a twig query depends on the order the branches are created
Indexing and Querying in PRIX Indexing: • The first step is to take as input an XML document and convert it into a sequence • This is achieved using Prufer Sequences • The sequence is saved in the database in a way equivalent to the one used in ViST • It is a Virtual Trie implemented as B+ Trees XML document
Indexing and Querying in PRIX Querying • Queries are also transformed to trees and then to Prufer Sequences • The query sequence looked up in the document sequence and all matching subsequences are retrieved • After this initial filtering, three refinement phases follow XPath Query
Indexing XML Documents • The first step is to transform the XML document to the equivalent XML tree • Notice that both elements and text values are represented as nodes (the same stands for attributes) • The tree is not saved in the database <A> <B></B> <B> <C> D </C> <C> <F/> <E/> </C> </B> </A> A B B F D E C C
Indexing XML Documents • Then the Prufer Sequence is created from the XML tree • A Prufer Sequence is a method proposed by Prufer (1918) that constructs a one-to-one correspondence between a labeled tree and a sequence 8,A 8, 3, 7, 6, 6, 7, 8 1,B 7,B 2,D 5,E 4,F 3,C 6,C
Indexing XML Documents • Prufer Sequences can only be created from trees with numerical labeling, with each node having a unique number • Since the XML tree contains string labels (the names of elements etc.) we add an additional label to each node • We will use the post-order traversal to name the nodes • The prufer sequence can be extracted for any labeling of the tree, but using post-order numbering has some properties that makes the querying process easier
Indexing XML Documents • Initial labeling A 8,A B B 1,B 7,B F 2,D D E 5,E 4,F C C 3,C 6,C
Indexing XML Documents Finding the Prufer Sequence • The algorithm to find the Prufer sequence is the following: • Find the leaf with the smallest value and delete it. • Add the label of its parent to the sequence • Repeat until only one node is left • In PRIX index, two sequences are held: • The actual Prufer Sequence holding the numbers of the labels called Numbered Prufer Sequence: NPS • The corresponding sequence holding the actual labels of the nodes of the XML Tree called Labeled Prufer Sequence: LPS
Indexing XML Documents Finding the Prufer Sequence • The algorithm to find the Prufer sequence is the following: • Find the leaf with the smallest value and delete it. • Add the label of its parent to the sequence • Repeat until only one node is left 8,A 1,B 7,B 2,D NPS : 8, LPS : A, 5,E 4,F 3,C 6,C
Indexing XML Documents Finding the Prufer Sequence • The algorithm to find the Prufer sequence is the following: • Find the leaf with the smallest value and delete it. • Add the label of its parent to the sequence • Repeat until only one node is left 8,A 7,B 2,D NPS : 8, 3 LPS : A, C 5,E 4,F 1,B 3,C 6,C