Sequence Indexing Schemes

Sequence Indexing Schemes Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601

Introduction • Graph indexes • precise • Path, (twig only few methods) • Sequence indexing schemes • Top-down or bottom-up • XML document and XML queries in structure-encoded sequences • Path and twig

Top-Down Sequence Indexes: ViST

ViST – Virtual Suffix Tree • Top-down Sequence Indexes • Represent XML documents and XML queries in structure-encoded sequences • Querying XML data is equivalent to finding subsequence matching • Avoid to expensive join operations • Provides unified index on both content and structure • Support dynamic index update • B+Trees which are supported in DBMSs

DTD of purchase records <!ELEMENT purchases (purchase*)> <!ELEMENT purchase (seller, buyer)> <!ATTRIST seller ID ID location CDATA name CDATA> <!ELEMENT seller (item*)> <!ATTRIST buyer ID ID location CDATA name CDATA> <!ELEMENT item (item*)> <!ATTRIST item name CDATA manufacturer CDATA>

A Single Purchase Record

Preorder Sequence of XML • Use capital letters to represent names of elements/attributes • Use hash function h(), to encode attribute values into integers • v1 = h(“dell”) • v2=h(“ibm”) • Preorder sequence of XML purchase record example • PSNv1IMv2Nv3IMv4INv5Lv6BLv7Nv8 • Isomorphic trees may produce different preorder seq. • DTD schema embodies linear order of all elements/attributes • Without DTD – use lexicographical order

Structure-Encoded Sequence Definition: A Structure-Encoded Sequence, derived from a prefix traversal of semi-structured XML document, is a sequence of (symbol, prefix) pairs: D = (a1,p1), (a2,p2),…, (an,pn) Where ai represents a node in the XML document tree, (of which a1, … ,an is the preorder sequence), and pi is the path from the root node to node ai.

Structure-Encoded Sequence D= (P,ϵ),(S,P),(N,PS),(v1,PSN),(I,PS),(M,PSI),(v2,PSIM),(N,PSI), (v3,PSIN),(I,PSI),(M,PSII),(v4,PSIIM),(I,PS),(N,PSI),(v5,PSIN), (L,PS),(v6,PSL),(B,P),(L,PB),(v7,PBL),(N,PB),(v8,PBN)

XML Queries in Graph Form

XML Queries in Path Expression and Sequence Form • Query: Path Expression Structure-Encoded Sequence • Q1 : /Purchase/Seller/Item/Manufacturer (P, ϵ)(S,P)(I,PS)(M,PSI) • Q2 : /Purchase[Seller[Loc = v5]]/Buyer[Loc = v7] (P,ϵ)(S,P)(L,PS)(v5,PSL)(B,P)(L,PB)(v7,PBL) • Q3 : /Purchase/*[Loc= v5] (P, ϵ)(L, P)(v5,P*L) • Q4 : /Purchase//Item[Manufacturer = v3] (P, ϵ)(I,P//)(M, P//I)(v3,P//IM)

Querying XML through Structure-Encoded Sequence Matching • Querying XML is equivalent to finding (non-contiguous) subsequence matches • Most structural XML queries can be performed through direct subsequence matching • Exception: branch has multiple identical child nodes • Q5=/A[B/C]/B/D • Two different sequences • (A, ϵ)(B,A)(C,AB)(B,A)(D,AB) • (A, ϵ)(B,A)(D,AB)(B,A)(C,AB) • Find matches separately and union their result • We may find false matches if the indexed documents contain branches with identical child nodes, then we ask multiple queries and compute set difference on result • If the query contains a large number of same child nodes under the branch, we can choose disassemble the tree into multiple trees and use join operations to combine their results

Algorithms • Naïve algorithm • RIST – Relationships Indexed Suffix Tree • ViST – Virtual Suffix Tree

Naïve algorithm: Suffix-Tree-Like structure • Doc1 : (P, ϵ)( S, P)(N, PS)(v1, PSN)(L, PS)(v2, PSL) • Doc2 : (P, ϵ)(B, P)(L, PB)(v2, PBL) • Q1 : (P, ϵ)(B, P)(L,PB)(v2, PBL) • Q2 : (P, ϵ)(L, P*)(v2,P*L)

D-Ancestorship and S-Ancestorship • D-Ancestorship • Ancestor-descendant relationships in original XML tree • Element (S,P) is a D-Ancestorship of (L,PS) • S-Ancestorship • Ancestor-descendant relationships in suffix tree • Element (v1, PSN) is an S-Ancestorship of (L, PS)

Naïve search :A naïve algorithm based on suffix trees

RIST – Indexing Construction • S-Ancestorship requires additional information • Label each suffix tree node x by pair <nx, sizex> • nx prefix traversal order of x in suffix tree • sizex is total number of descendants of x in suffix tree • x … <nx, sizex>, y …<ny, sizey> • x is S-Ancestor of node y if nyϵ (nx, nx + sizex] • Construct the B+Trees: • Tree nodes into the D-Ancestorship B+Tree using (Symbol, Prefix) as keys • For all nodes x inserted with the same (Symbol, Prefix) we index them by S-Ancestorship B+Tree, using the nx values of their labels as keys.

The RIST index structure

Search: non-contiguous subsequence matchingusing B+Tree

ViST – Virtual Suffix Tree • Dynamic Virtual suffix tree labeling • Semantic and statistical clues • Dynamic scope allocation without clues

Dynamic scope allocation • Number of child nodes of x is λ. We allocate 1/λ of the remaining scope to x’s first child Dynamic scope allocation with λ=2

Dynamic Scope of a Suffix Tree Node

subScope(parent, e): create a sub scopewithin the parent scope for e

Insertion index • Doc1 = (P,ϵ)(S,P)(N,PS)(v1,PSN)(L,PS)(v2,PSL) • Doc2 = (P,ϵ)(S,P)(L,PS)(v2,PSL)

Index an XML document

EXPERIMENTS - Sample queries Path Expression Dataset Q1 /inproceedings/title DBLP Q2 /book/author[text=‘David’] DBLP Q3 /*/author[text= ‘David’] DBLP Q4 //author[text= ‘David’] DBLP Q5 /book[key=‘books/bc/MaierW88’]/author DBLP Q6 /site//item[location=‘US’]/mail/date[text=‘12/15/1999’] XMARK Q7 /site//person/*/city[text=‘Pocatello’] XMARK Q8 //closed_auction[*[person=‘person1’]]/date[text=‘12/15/1999’] XMARK

Comparing indexing methods time in seconds

Index structure • DBLP (301 MB of data) • XMARK (52MB of data)

Conclusion • structure-encoded sequences • Sequence matching • Avoid expensive join operations • Top-down scope allocation method • Index structure – B+Tree

PRIX:Prufer Sequences for Indexing XML

PRIX: PRufer sequences for Indexing Xml • Rao & Moon (2006) proposed a new method for indexing XML documents using sequences • It uses the same idea as in ViST index: • The XML tree is transformed into a sequence and saved in the database • Each query is also transformed into a sequence • The answer of the query is acquired by performing subsequence matching

PRIX: PRufer sequences for Indexing Xml

Motivation: Twig Queries and Wildcards • Like in ViST, PRIX also tries to efficiently answer twig queries as well as queries containing wildcards (‘*’ any and ‘//’ self or descendant queries) P P Q Q T S S Twig query XPath: P/Q[T]/S Query with wildcards XPath: P//Q/S

Motivation: Problems in ViST Index • Memory requirements: • In the worst case, ViST requires O(N2) space to index the document A <A> <C> <D> <E> </E> </D> </C> </A> B D = (A, ε), (B, A), (C, AB), (D, ABC), (E, ABCD) C D Elements in height k appear k times E

Motivation: Problems in ViST Index • Memory requirements: • In the worst case, ViST requires O(N2) space to index the document • False positives • In many cases, query processing in Vist results in false alarms P P P Q R Q Q Q T S U T T S T S Doc1 = (P, e) (Q, P) (T, PQ) (S, PQ) (R, P) (U, PR) (T, PR) Doc2 = (P, e) (Q, P) (T, PQ) (Q, P) (S, PQ) XPath: P/Q[T]/S Q = (P, e) (Q, P) (T, PQ) (S, PQ)

Motivation: Problems in ViST Index • Memory requirements: • In the worst case, ViST requires O(N2) space to index the document • False positives • In many cases, query processing in Vist results in false alarms • False negatives • Correctly answering a twig query depends on the order the branches are created P P N F F N T G Doc = (P, e) (F, P) (T, PF) (N, P) (G, PN) Xpath: P[N]/F Q = (P, e) (N, P) (F, P) ???

Motivation: Problems in ViST Index • Memory requirements: • In the worst case, ViST requires O(N2) space to index the document • False positives • In many cases, query processing in Vist results in false alarms • False negatives • Correctly answering a twig query depends on the order the branches are created

PRIX Architecture

Indexing and Querying in PRIX Indexing: • The first step is to take as input an XML document and convert it into a sequence • This is achieved using Prufer Sequences • The sequence is saved in the database in a way equivalent to the one used in ViST • It is a Virtual Trie implemented as B+ Trees XML document

Indexing and Querying in PRIX Querying • Queries are also transformed to trees and then to Prufer Sequences • The query sequence looked up in the document sequence and all matching subsequences are retrieved • After this initial filtering, three refinement phases follow XPath Query

Indexing XML Documents • The first step is to transform the XML document to the equivalent XML tree • Notice that both elements and text values are represented as nodes (the same stands for attributes) • The tree is not saved in the database <A> <C> D </C> <C> <F/> <E/> </C> </A> A B B F D E C C

Indexing XML Documents • Then the Prufer Sequence is created from the XML tree • A Prufer Sequence is a method proposed by Prufer (1918) that constructs a one-to-one correspondence between a labeled tree and a sequence 8,A 8, 3, 7, 6, 6, 7, 8 1,B 7,B 2,D 5,E 4,F 3,C 6,C

Indexing XML Documents • Prufer Sequences can only be created from trees with numerical labeling, with each node having a unique number • Since the XML tree contains string labels (the names of elements etc.) we add an additional label to each node • We will use the post-order traversal to name the nodes • The prufer sequence can be extracted for any labeling of the tree, but using post-order numbering has some properties that makes the querying process easier

Indexing XML Documents • Initial labeling A 8,A B B 1,B 7,B F 2,D D E 5,E 4,F C C 3,C 6,C

Indexing XML Documents Finding the Prufer Sequence • The algorithm to find the Prufer sequence is the following: • Find the leaf with the smallest value and delete it. • Add the label of its parent to the sequence • Repeat until only one node is left • In PRIX index, two sequences are held: • The actual Prufer Sequence holding the numbers of the labels called Numbered Prufer Sequence: NPS • The corresponding sequence holding the actual labels of the nodes of the XML Tree called Labeled Prufer Sequence: LPS

Indexing XML Documents Finding the Prufer Sequence • The algorithm to find the Prufer sequence is the following: • Find the leaf with the smallest value and delete it. • Add the label of its parent to the sequence • Repeat until only one node is left 8,A 1,B 7,B 2,D NPS : 8, LPS : A, 5,E 4,F 3,C 6,C

Indexing XML Documents Finding the Prufer Sequence • The algorithm to find the Prufer sequence is the following: • Find the leaf with the smallest value and delete it. • Add the label of its parent to the sequence • Repeat until only one node is left 8,A 7,B 2,D NPS : 8, 3 LPS : A, C 5,E 4,F 1,B 3,C 6,C

Sequence Indexing Schemes

Sequence Indexing Schemes

Presentation Transcript

Indexing

Indexing:

Indexing

Indexing

Indexing

Indexing

Introduction on biological sequence indexing, searching and text mining

Indexing Biological Sequence Data

Indexing

Indexing

Indexing

Indexing

Group Feature Extraction Based on Multiple Indexing Sequence Alignment

Indexing

Indexing

Indexing

Reference-based Indexing of Sequence Databases

Introduction on biological sequence indexing, searching and text mining

Indexing

Indexing

Indexing