240 likes | 367 Views
Covering Indexes for Branching Path Queries. Raghav Kaushik , Philip Bohannon, Jeffrey F Naughton and Henry F Korth. XML as Graph Data. Leaf nodes with attributes are suppressed. oid. label(3). Non-tree edges: model IDREF relationships in the document. Branching Path Expression.
E N D
Covering Indexes for Branching Path Queries RaghavKaushik, Philip Bohannon, Jeffrey F Naughtonand Henry F Korth Abdullah Mueen
XML as Graph Data Leaf nodes with attributes are suppressed oid label(3) Non-tree edges: model IDREF relationships in the document Abdullah Mueen
Branching Path Expression ROOT/metro/neighborhoods/neighborhood [/business=>cinema-hall]/cultural=>museum Abdullah Mueen
Example (1) //hotel[/star][<=business\neighborhood[/cultural=>museum[\art]]] Abdullah Mueen
Covering Index • A covering index can answer any query from a set of queries without consulting with the original document. • The GOAL of this paper is to find a covering index for “Branching Path Queries” . Abdullah Mueen
k-bisimilarity R 0 Two nodes u and v are called k-bisimilar(u ≈k v) if label(u) = label(v) every incoming label path of length≤kto u matches with at least one incoming path of length≤kto v and vice versa. C A B 1 3 2 C B D 4 5 6 C D 7 8 D 9 • ≈k defines an equivalence class over the set of nodes in G • The algorithm for computing k-bisimulation will be shown later • 2,4 are 0-bisimilar. • 5,7 are 1-bisimilar • 8,9 are 2-bisimilar • 6,8 are 1-bisimilar Abdullah Mueen
1-index : Covering Index for Simple Path Expression R 0 SuccStable R 11 R 11 C A C B A 12 14 B 13 1 3 {3,5,7} 2 {1} {2,4} C D 15 A B 12 14 13 {6,8,9} C {1} B D SuccStable {3} 4 5 6 {2} A(0) A(1) C R 11 B D R 11 15 16 17 SuccStable C D {5,7} {6,8,9} 7 8 A C C {4} A B B 12 14 12 14 13 13 {1} {1} {3} {3} {2} {2} C D C B 9 B D D 15 16 17 15 16 17 {4} {5} {5} {6} {4} {6} D data graph G D C 18 C 19 18 19 {8} {7} {8,9} {7} D 18 A(2) A(3) = 1-index {9} Abdullah Mueen Abdullah Mueen 7
Inverse edges R R 0 0 C C A A B B 1 3 1 3 2 2 C D C D B B 4 5 6 4 5 6 C D C D 7 8 7 8 D D 9 9 • 5,7 are not 1-bisimilar • 5,7 are 1-bisimilar Abdullah Mueen
The F&B index • While there is no change • Reverse all edges • Compute Forward Bismilarity Partition • Reverse all edges again. • Compute Backward Bisimilarity Partition Abdullah Mueen
Forward Bisimulation R R R R 0 0 0 0 C C C C A A A A B B B B 1 1 3 3 1 1 3 3 2 2 2 2 C C C C B B B B D D D D 4 4 5 5 6 6 4 4 5 5 6 6 C C C C D D D D 7 7 8 8 7 7 8 8 D D D D 9 9 9 9 Abdullah Mueen
Backword Bisimulation R R R 0 0 0 C C C A A A B B B 1 3 1 3 1 3 2 2 2 C C C D D D B B B 4 5 6 4 5 6 4 5 6 C C C D D D 7 8 7 8 7 8 D D D 9 9 9 Abdullah Mueen
Properties of F&B index • The F&B index over a data graph G covers all branching path expression. • F&B index is the smallest of the indexes that covers branching path queries. • Generally F&B is large for most of the real documents. Abdullah Mueen
1. Tags to be indexed • There are tags that are not used for Queries. • bold, emph • We specify set of tags to be indexed. • In a 100MB document, the F&B index on all tags has 436,000 nodes while ignoring formatting tags it has 18,000 nodes. Abdullah Mueen
2. IDREF edges to be indexed • IDREF edges are not counted in // operation. • IDREF edges are explicitly described in the path expression by => operator. • We specify the Set of IDREF edges to be indexed. • The 100MB document has 1.3 million nodes with all IDREF edges while it has 18,000 nodes without any IDREF edges and formatting tags. Abdullah Mueen
3. Exploiting Local Similarity • Long Queries are not frequent and interesting. • If we restrict the length of the possible queries, we can get much smaller index tree than the F&B index. • We specify the length of the local path by using k-bisimilarity instead of bisimilarity while computing the F&B index. Abdullah Mueen
4. Restricting Tree Depth • Long nested conditions are less likely to occur. • We specify the maximum depth of the conditional path expression by tree-depth (defined next). Abdullah Mueen
tree depth //museums/history/museum[/featured and <=cultural\neighborhood[/cultural=>museum[\art]]] Abdullah Mueen
Definition of an Index • A set of tags T • Set of IDREF edges on both directions reffwd and refbwd • Two parameters kbwdandkfwdto restrict the length of the path queries • One parameter tdto restrict the depth of the nested conditional expression. Abdullah Mueen
The BPCI index • Remove all tags not in T such that the removal does not cut out a tag in T. • Start with label groupingas current partitionP • For i=0 and i≤td • Reverse all edges in G, retain IDREF edges only in reffwd . • P ← Forward kfwd-Bismilar Partition of P and inc(i) • Reverse all edges in G again, retain IDREF edges only in refbwd. • P ← Backward kbwd-Bisimilar Partition of P and inc(i) Abdullah Mueen
Variations of BPCI Abdullah Mueen
Testing if an Index covers a Query • Build the Query graph • Check if all tags and IDREF edges in the query are in T and in (refbwdU reffwd) • Check if the tree depth of the query is less than td of the index • Check if all paths in the query with even tree depth have length < kbwd • Check if all paths in the query with odd tree depth have length < kfwd Abdullah Mueen
Result on Xmark benchmark Iall is the F&B index Iallmost-all is F&B with kfwd = 1 Ispecificis built on the query Abdullah Mueen
Result Abdullah Mueen
Conlclusion • BPCI is the covering index for Branching Path Queries. • By setting appropriate parameters, we can get a wide range of queries suitable for various applications. • Extensions • Updating and Bulk loading • Integration with value indexes Abdullah Mueen