410 likes | 422 Views
This paper presents a new method for efficiently evaluating sparse graph reachability queries using core labeling. It introduces the concept of core trees and provides algorithms for generating core trees and compressing transitive closures. The methodology is applicable to various domains including XML data processing and gene-regulatory networks.
E N D
Core Labeling: A New Way to Compress Transitive Closure Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba, Canada R3B 2E9
Outline • Motivation • Tree labeling • Main algorithm - Core tree - Graph labeling: Core-I - Graph labeling: Core-II • Conclusion
Motivation • Efficient method to evaluate sparse graph reachability queries Given a directed sparse graphG, check whether a node v is reachable from another node u through a path in G. • Application XML data processing, gene-regulatory networks or metabolic networks. It is well known that XML documents are often represented by tree structure. However, an XML document may contain IDREF/ID references that turn itself into a directed, but sparse graph: a tree structure plus a few reference links. For a metabolic network, the graph reachability models a relationship whether two genes interact with each other or whether two proteins participate in a common pathway. Many such graphs are sparse.
a b c d e G*: G: 1 0 0 0 0 a b c d e 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 0 a a M = c c b b e e d d a b c d e 1 0 0 0 0 a b c d e 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 1 0 1 0 0 Motivation • A simple method - store a transitive closure as a matrix O(n2) space M M =
Tree labeling • Tree encoding Let G be a sparse graph. we will first find a spanning tree T of G. Each node v in T will be assigned an interval [start, end), where start is v’s preorder number and end - 1 is the largest preorder number among all the nodes in T[v]. So another node u labeled [start’, end’) is a descendant of v (with respect to T)iff start’ [start, end). [0, 12) a r [5, 9) [1, 5) b e [9, 12) [6, 9) h f d g [4, 5) c i [2, 4) [7, 8) [11, 12) j [8, 9) [10, 11) k [3, 4) Let v and u be two nodes in T, labeled [a, b) and [a’, b’), respectively. If a [a’, b’), v is a descendant of u. In this case, we say, [a, b) is subsumed by [a’, b’). Also, we must have b b’. Therefore, if v and u are not on the same path in T, we have either a’ b or a b’. In the former case, we say, [a, b) is smaller than [a’, b’), denoted [a, b) [a’, b’). In the latter case, [a’, b’) is smaller than [a, b).
a h e f d k g c i j b r [0, 12) [2, 4)[4, 5)[6, 9)[9, 12) [2, 4)[4, 5)[6, 9) [3, 4)[4, 5)[7, 8) [3, 4)[4, 5) [3, 4) [2, 4)[8, 9) [2, 4) [10, 11) [11, 12) [1, 5) [2, 4)[5, 9)[6, 9) Tree labeling • Tree encoding Interval sequences: (label space) [0, 12) a r [5, 9) [1, 5) b e [9, 12) [6, 9) h f d g [4, 5) c i [2, 4) [7, 8) [11, 12) j [8, 9) [10, 11) k [3, 4)
Main Algorithm • Core tree (core of G) Let T be a spanning tree.We denote E’ the set of all the non-tree edges. Denote V’ the set of all the end points of the non-tree edges. Then, V’ = VstartVend, where Vstart stands for a set containing all the start nodes of the non-tree edges and Vend for all the end nodes of the non-tree edges. Definition 1. (anti-subsuming subset) A subset S Vstart is called an anti-subsuming set iff |S| > 1 and no two nodes in S are related by ancestor-descendant relationship with respect to T. anti-subsumming subsets: Vstart = {d, f, g, h} Vend = {c, k, e, d, g} a {d, f} {d, g} {d, h} {f, g} {f, h} {g, h} {d, f, g} {d, f, h} {d, g, h} {f, g, h} {d, f, g, h} r b e h f d g c i j k
Main Algorithm • Core tree (core of G) Definition 2. (critical node) A node v in a spanning tree T of G is critical if v Vstart or there exists an anti-subsuming subset S = {v1, v2, ..., vk} for k 2 such that v is the lowest common ancestor of v1, v2, ..., vk. We denote Vcritical the set of all critical nodes. In the graph, node e is the lowest common ancestor of {f, g}, and node a is the lowest common ancestor of {d, f, g, h}. So e and a are critical nodes. In addition, each v Vstart is a critical node. So all the critical nodes of G with respect to T are {d, f, g, h, e, a}. a r b e h f d g c i j k
Main Algorithm • Core tree (core of G) Definition 3. (core of G) Let G = (V, E) be a directed graph. Let T be a spanning tree of G. The core of G with respect to T is a tree structure with the node set being Vcritical and there is an edge from u to v (u, vVcritical)iff there is a path p from u to v in T and p contains no other critical nodes. The core of G with respect to T is denoted Gcore = (Vcore, Ecore). a h e f d g [0, 12) [2, 4)[4, 5)[6, 9)[9, 12) [2, 4)[4, 5)[6, 9) [3, 4)[4, 5)[7, 8) [3, 4)[4, 5) [2, 4)[8, 9) Gcore: a a r e b e h d f g h f d g c i j k
Main Algorithm • Core generation Algorithmcore-generation(T) • Mark any node in T, which belongs to Vstart. • Let v be the first marked node encountered during the bottom-up searching of T. Create the first node for v in Gcore. • Let u be the currently encountered node in T. Let u’ be a node in T, for which a node in Gcore is created just before u is met. Do (4) or (5), depending on whether u is a marked node or not. • If u is a marked node, then do the following. (a) If u’ is not a child (descendant) of u, create a link from u to u’, called a left-sibling link and denoted as left-sibling(u) = u’.
Main Algorithm • Core generation Algorithmcore-generation(T) (continued) (b) If u’ is a child (descendant) of u, we will first create a link from u’ to u, called a parent link and denoted as parent(u’) = u. Then, we will go along a left-sibling chain starting from u’ until we meet a node u’’ which is not a child (descendant) of u. For each encountered node w except u’’, set parent(w) u. Set left- sibling(u) u’’. Remove left-sibling(w) for each child w of u. 5. If u is a non-marked node, then do the following. (c) If u’ is not a child (descendant) of u, no node will be created. (d) If u’ is a child (descendant) of u, we will go along a left-sibling chain starting from u’ until we meet a node u’’ which is not a child (descendant) of u. If the number of the nodes encountered during the chain navigation (not including u’’) is more than 1, we will create new node in Gcoreand do the same operation as (4.b). Otherwise, no node is created.
g e e Main Algorithm • Core tree (core of G) u’’ is not a child of u. u u u’’ u’’ u’ u’ … … … … link to the left sibling d d f d f (a) (b) (c) a h r (e) (d) g d f g d f b e h a f d g c e i (f) j f h d g k
Main Algorithm • Graph labeling: Core-I Definition 4. Let Vcore = {v1, ..., vg} be the node set of Gcore. The core label for G is a set {L(v1), ..., L(vg)}, where each L(vl) (l = 1, ..., g) is an interval sequence associated with vl, satisfying the following two properties: (1) Let L(vl) = [al1,bl1), ..., [alr,blr) for some r. Then, for any i, j {1, ..., r}, aliblj if i < j. That is, [ali, bli) ≺ [alj, blj) for i < j. (In this sense, the intervals in L(vl) are considered to be sorted.) (2) Let [a, b) be the interval associated with a descendant of vl with respect to G. There exists an interval [ali, bli) (1 i r) in L(vl) such that a [ali, bli). Definition 5. (link graph) Let G = (V, E) be a directed graph. Let T be a spanning tree of G. The link graph of G with respect to T is a graph, denoted Glink, with the node set being V’ (the end points of all the non-tree edges) and the edge set E’ E’’, where (v, u) E’’ iff vVend, uVstart, and there exists a path from v to u in T.
Main Algorithm • Graph labeling: Core-I Glink: e h g c d f k Gcom = Gcore Glink: a h e f d k g c [0, 12) [2, 4)[4, 5)[6, 9)[9, 12) [2, 4)[4, 5)[6, 9) [3, 4)[4, 5)[7, 8) [3, 4)[4, 5) [3, 4) [2, 4)[8, 9) [2, 4) [0, 12) a h reverse topological order [6, 9) e [9, 12) c d f g [2, 4) [7, 8) [8, 9) [4, 5) k [3, 4)
Main Algorithm - Generation of interval sequences 1. Scan the reverse topological order of Gcom. 2. For each node v, the interval sequence L(v) is stored in a linked list Av. Initially, Av contains only one interval, which is generated by labeling T. 3. Let v1, ..., vkbe the children of v (in Gcom). Merge Av with each Avl for the child node vl(l = 1, ..., k) as follows. Assume Av = p1p2 ... pg and Avl = q1q2 ... qh. Assume that both Avand Avl are increasingly ordered. (As we will see soon, any interval sequence generated by the following algorithm has this nice property. It contains only the intervals not on the same path in T. Initially, Av contains only one interval. It is considered to be sorted.)
Main Algorithm - Generation of interval sequences 4. We step through both Av and Avl from left to right. Let pi = [ai, bi) and qj = [aj, bj) be the intervals encountered. We will conduct the following checkings. (i) If aibj, insert qj into Av after pi-1 and before pi and move to qj+1. (ii) If ai [aj, bj), remove pi from Avand move to pi+1. (*piis subsumed by qj.*) (iii) If aj [ai, bi), ignore qj and move to qj+1. (*qj is subsumed by pi; but it should not be removed from Avl.*) (iv) If ajbi, ignore pi and move to pi+1. (v) If ai = aj and bi = bj, ignore both pi and qj, and move to pi
p p A1: [3, 4)[4, 5)[7, 8) A2: [2, 4)[8, 9) A1: [4, 5)[7, 8) A2: [2, 4)[8, 9) q q Main Algorithm - Generation of interval sequences Example. p A1: [2, 4)[4, 5)[7, 8) A2: [2, 4)[8, 9) q p P = nil A A1: [2, 4)[4, 5)[7, 8) A2: [2, 4)[8, 9) A1: [2, 4)[4, 5)[7, 8)[8, 9) A2: [2, 4)[8, 9) q q
Main Algorithm - Core labels [0, 12) a [2, 4)[4, 5)[6, 9) e [2, 4)[4, 5)[6, 9)[9, 12) [3, 4)[4, 5) g d f h [2, 4)[8, 9) [3, 4)[4, 5)[7, 8)
Main Algorithm - Non-tree labeling Let Vcore= {v1, ..., vj}. We store the core label of G as a list: s1 = L(v1), ..., sj= L(vj). Then, we define a function f: Vcore{1, ..., j} such that for each v Vcoref(v) = i iff si = L(v). Based on the above concepts, we define Core-I below. f(a) f (h) f (e) f (f) f (d) f (g) = 1 = 2 = 3 = 4 = 5 = 6 s1: L(a) s2: L(h) s3: L(e) s4: L(f) s5: L(d) s6: L(g) = [0, 12) = [2, 4)[4, 5)[6, 9)[9, 12) = [2, 4)[4, 5)[6, 9) = [3, 4)[4, 5)[7, 8) = [3, 4)[4, 5) = [2, 4)[8, 9)
Main Algorithm - Non-tree labeling Each node v in V is associated with two nodes: v- and v*. v- - a critical node in T[v], which is closest to v. v* - the lowest ancestor of v (in T), which has a non-tree incoming edge. Example. r- = e, r* does not exist. e- = e, e* = e. a r h e b d f g i c j k
[0, 12) <1, _> a [5, 9) <3, _> r [9, 12) <2, _> [1, 5) <5, _> h [6, 9) <3, [6, 9)> e b [2, 4) <_,[2, 4)> d f g i c [11, 12) <_, _> [10, 11) <_, _> j [7, 8) <4, [6, 9)> [8, 9) <6, [8, 9)> [4, 5) <5, [4, 5)> [3, 4) <_,[3, 4)> k Main Algorithm - Non-tree labeling Definition (Core-I) Let v be a node in G. The non-tree label of v is a pair <d, t>, where - d= i if v- exists and f(v-) = i. If v- does not exists, let dbe the special symbol “-”. - t= [x, y) if v* exists and [x, y) is the interval of v*. If v* does not exist, let y be “-”.
Main Algorithm - Non-tree labeling PropositionAssume that u and v are two nodes in G, labeled ([a1, b1), <x1, y1>) and ([a2, b2), <x2, y2>), respectively. Node v is reachable from u iff one of the following conditions holds: (i) [a2, b2) is subsumed by [a1, b1), or (ii) There exists an interval [a, b) in sx1 such that for y2 = [a’, b’) we have a’ [a, b) (i.e., y2 is subsumed by [a, b) .)
Main Algorithm • Graph labeling: Core-II We can store the core label of G as a d g booleanmatrix M, where d is the number of the end nodes of all non-tree edges and g the number of the nodes in Gcore. Let u1, u2, ..., ud be all the end nodes of the non-tree edges. Let v1, v2, ..., vg be all the nodes in Gcore. Assign each ui an index, denoted index(ui) (i.e., u1, u2, ..., ud will be assigned contiguous integers, starting from 0.) Assign each vj an index, denoted index’(vj). An entry M[index(ui), index’(vj)] is set to 1 if there exists an interval [a’, b’) in L(vj) such that for ui’s interval [a, b) we have a[a’, b’); otherwise, it is set to 0. 0 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 3 0 1 1 0 0 4 0 1 1 0 0 5 1 0 0 0 1 index(c) = 0 index(k) = 1 index(d) = 2 index(e) = 3 index(g) = 4 Index’(a) = 0 Index’(h) = 1 Index’(e) = 2 Index’(f) = 3 Index’(d) = 4 Index’(g) = 5 0 1 2 3 1
Conclusion A new algorithm for graph recheabiliy - Core tree - Graph labeling: Core-I query time: O(log(min{b, s})) labeling time: O(n + e + t ·min{b, s}) space overhead: O(n + s ·min{b, s} ) - Graph labeling: Core-II query time: O(1) labeling time: O(n + e + t ·min{b, s} + d·s log(min{b, s}) space overhead: O(n + d ·s)
Evaluation of Twig Pattern Queries Based on Ordered Tree matching Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba, Canada R3B 2E9
Outline • Motivation • Algorithm for tree pattern query evaluation based on ordered tree matching • - Tree encoding • - Algorithm description • Index-based algorithm • Conclusion
a b b c d c e d Motivation • XPath evaluation against XML documents - XPath expression a[b[c and .//d]]/b[c and e//d] book[title = ‘Art of Programming’]//author[fn = ‘Donald’ and ln = ‘Knuth’] book <document> <book> <title> Art of Programming </title> <author> <fn>Donald Knuth</fn> … … title author Art of Programming fn ln Knuth Donald
a c b Motivation • XPath evaluation against XML documents Evaluation based on unordered tree matching XPath expression: Definition An embedding of a twig pattern Q into an XML document T is a mapping f: Q T, from the nodes of Q to the nodes of T, which satisfies the following conditions: (i) Preserve node label: For each u Q, label(u) matches label(f(u)). (ii) Preserve parent-child/ancestor-descendant relationships: If uv in Q, then f(v) is a child of f(u) in T; if uv in Q, then f(v) is a descendant of f(u) in T. Q: T: a d b c e g f
Motivation • XPath evaluation against XML documents - Evaluation based on ordered tree matching XPath expression: a[b[c/following-sibling:: .//d]]/following-sibling::b[c/following- sibling:: e//d]
a c c a c b c b b Motivation • XPath evaluation against XML documents - Evaluation based on ordered tree matching Definition An embedding of a twig pattern Q into an XML document T is a mapping f: Q T, from the nodes of Q to the nodes of T, which satisfies the following conditions: (i) Preserve node label: For each u Q, label(u) matches label(f(u)). (ii) Preserve parent-child/ancestor-descendant relationships: If uv in Q, then f(v) is a child of f(u) in T; if uv in Q, then f(v) is a descendant of f(u) in T. (iii) Preserve sibling order: For any two nodes v1 Q and v2 Q, if v1is to the left of v2, then f(v1)is to the left of f(v2) in T. T: Q: q3 v6 q1 q2 v4 v5 v1 v3 v2
Algorithm for tree pattern query evaluation • Tree encoding Let T be a document tree. We associate each node v in T with a quadruple (DocId, LeftPos, RightPos, LevelNum), denoted as a(v), where DocId is the document identifier; LeftPos and RightPos are generated by counting word numbers from the beginning of the document until the start and end of the element, respectively; and LevelNum is the nesting depth of the element in the document. (i) ancestor-descendant: a node v1 associated with (d1, l1, r1, ln1) is an ancestor of another node v2 with (d2, l2, r2, ln2) iff d1 = d2, l1 < l2, and r1 > r2. (ii) parent-child: a node v1 associated with (d1, l1, r1, ln1) is the parent of another node v2 with (d2, l2, r2, ln2) iff d1 = d2, l1 < l2, r1 > r2, and ln2 = ln1 + 1. (iii)from left to right: a node v1 associated with (d1, l1, r1, ln1) is to the left of another node v2 with (d2, l2, r2, ln2) iff d1 = d2, r1 < l2.
a c c c b b Algorithm for tree pattern query evaluation • Tree encoding Example. T: (1, 1, 9, 1) v6 (1, 2, 7, 2) (1, 8, 8, 2) v4 v5 (1, 3, 3, 3) (1, 4, 6, 3) v3 v1 v2 (1, 5, 5, 4)
a c c a c b c b b Algorithm for tree pattern query evaluation • Main algorithm 1. First, we will number both T and Q in postorder. So the nodes in both trees will be referenced by their postorder numbers. T: Q: q3 v6 6 3 q1 q2 4 v4 5 v5 2 1 v1 3 v3 1 2 v2 2. We will access the nodes in T and the nodes in Q along their postorder numbers. Each time we meet a node i in Q, we will associate it with an array, Ai, of length |T|, indexed from 0 to |T| - 1. Ai’s are manipulated as follows.
Q: q3 3 q1 q2 2 1 a c a c 0 1 2 3 4 5 0 1 2 3 4 5 A2: A2: c b c 1 1 5 5 5 5 b b Algorithm for tree pattern query evaluation (i) We set a virtual node for T, numbered 0, which is considered to be to the left of any node in T. (ii) If we find Q[i] can be embedded in T[j], we will set Ai[j1], ..., Ai[jk] (0 kj - 1) to j, where each jl (0 l k) is a node to the left of j, to record the fact that j is the closest node to the right of jl such that T[j] embeds Q[i]. T: v6 6 v0 4 v4 5 v5 v1 3 v3 1 2 v2
Algorithm for tree pattern query evaluation (iii) If some time later we find another node p such that Q[i] can be embedded in T[p], we will set Ai[p1], ..., Ai[pq] to p, where each ps (1s q) is to the left of p but to the right of jk. • For all the other nodes j’ such that T[j’] embeds Q[i], we will set values for the entries in Ai in the same way as (ii) and (iii). 3. During the process, when we meet i in Q and j in T, we will do the following: Let i1, ..., ik be the child nodes of i in Q. We first check starting from Ai1[l], where l = min{desc(j)} - 1 and desc(j) represents all the descendants of j. We begin the searching from min{desc(j)} - 1 because it is the closest node to the left of a descendant of j, which has the least postorder number. Let Ai1[l] = j’. If (i, i1) is /- edge, we will check whether (j, j’) is a /-edge. Otherwise, we only check whether j’ is descendant of j. If it is not the case, we will check Ai1[j’]. This process continues until one of the following conditions is satisfied: (i) Ai1 is exhausted (we cannot find a descendant j’’ of j such that T[j’’] contains Q[i1]; or (ii) we find an j’’ satisfying the parent-child or ancestor-descendant relationship, depending on whether (i, i1) is a /-edge or a //-edge. Then, we will check Ai2[j’’].
j’ Ai2: Ai1: .. .. .. j’’ .. .. .. .. .. j’ .. .. Algorithm for tree pattern query evaluation • If Ai1[l], is exhausted (case (i)), it shows that Q[i1] cannot be embedded in any subtree rooted at a child node (for /-edge) or a descendant (for //-edge) of j. It indicates that Q[i1] cannot be embedded into T[j] and thus T[j] cannot embed Q[i]. We will continue to check i against a next node in T. • If it is case (ii), we will check Ai2, starting from [j’’]. For all the other Ail’s (l = 3, ..., k), we will do the same checkings. If for each il (l = 1, ..., k) we can find j’ such that T[j’] embeds Q[il ], it shows that T[j] embeds Q[i] and we will set some new values in Ai as described in (2). l Q: T: j i i2 i1 ik … … j’’ j’ l
Q: q3 3 q1 q2 2 1 (a) (d) (b) (c) a c c a 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 A2: A2: A1: A2: A1: A3: b c c 1 2 2 1 2 2 1 5 5 5 5 6 b b Algorithm for tree pattern query evaluation Example. T: v6 6 v0 4 v4 5 v5 v1 3 v3 1 2 v2 (f) The time complexity of the algorithm is O(|T||Q|). (e)
T: (1, 1, 9, 1) v6 (1, 2, 7, 2) (1, 8, 8, 2) v4 v5 (1, 3, 3, 3) (1, 4, 6, 3) v3 v1 v2 (1, 5, 5, 4) c a c c b b Index-base algorithm • XB-tree An XB-tree is a variant of B+-tree over a quadruple sequences. (1, 3, 3, 3) (1, 5, 5, 4) (1, 4, 6, 3) (1, 2, 7, 2) (1, 8, 8, 2) (1, 1, 9, 1) sorted by RightPos values P1: P.parentIndex 3, 5 2, 7 1, 9 P.parent P2: P3: P4: 3, 3 5, 5 4, 6 2, 7 8, 8 1, 9 c b c b c a
Index-base algorithm • Searching an XB-tree - = (P, i) – indicates that the ith entry in the page P is currently accessed. - advance(b) (going up from a page to its parent): If b = (P, i) does not point to the last entry of P, i i + 1. Otherwise, b (P.parent, P.parentIndex). - drilldown(b) (going down from a page to one of its children): If b = (P, i) and P is not a leaf page, b (P’, 1), where P’ is the ith child page of P. - Initially, b (rootPage, 1), pointing to the first entry in the root page. We finish a traversal of the XB-tree when b = (rootPage, last), where last points to the last entry in the root page, and we advance it (in this case, we set b to nil).
Index-base algorithm • Searching an XB-tree • Assume that i in Q is the node currently encountered. We will find, by searching the XB-tree, a node j of T with label(i) = label(j), for which it is possible that T[j] embeds Q[i]. - L(i) - the most recently found nodesuch that Q[i] can be embedded into T[L(i)]. Proceduresearch(XB, i) • Let i1, ..., ik be the children of i. Assume that L(ik) = v. l v.LeftPos. r v.RightPos. If i is a leaf node, then l , r 0. • Assume that = (P, c). Let j be the entry pointed to by . We will do the following checkings. • If P is a leaf page, label(j) = label(i) and j.LeftPos < l and j.RightPos > r, then advance(), return j. • If P is an internal page, and j.LeftPos < l and j.RightPos > r, drilldown(). • If j.RightPos < r, then advance(). If = nil, return nil. • Repeat (2) until the whole XB-tree is traversed (i.e., when = nil) or a node j is found (i.e., the condition in (2)-(i) is satisfied).
Conclusion • Algorithm for evaluating tree pattern • queries based on ordered tree matching • time complexity: O(|T||Q|). • Space complexity: O(|T||Q|). • The algorithm can be integrated into an • index environment by using XB-trees.