650 likes | 801 Views
Approximate Labelled Subtree Homeomorphism. Based on: “Approximate Labelled Subtree Homeomorphism” R. Y. Pinter, O.Rokhlenko, D. Tsur, M. Ziv-Ukelson “Alignment of Metabolic Pathways” R. Y. Pinter, O. Rokhlenko, E. Yeger-Lotem, M. Ziv-Ukelson. The general Idea.
E N D
Approximate Labelled Subtree Homeomorphism Based on: “Approximate Labelled Subtree Homeomorphism” R. Y. Pinter, O.Rokhlenko, D. Tsur, M. Ziv-Ukelson “Alignment of Metabolic Pathways” R. Y. Pinter, O. Rokhlenko, E. Yeger-Lotem, M. Ziv-Ukelson
The general Idea Biological Problem Converting into terms of computer science problem Finding the solution Reverting back to Biological terms
Ag Stimuli IL-12 IL-12R Thnp Th1 TNF-a Stat 4 IFN- T-Bet IL-2 Proliferation Signal transduction
Why pathways? • Metabolic and regulatory pathways have biological importance. • These pathways are evolutionary conserved.
What do we want to do? • Compare one metabolic pathway of a certain organism against the same metabolic pathways in other organisms. • Compare a metabolic pathway against other metabolic pathways in the same organism.
The subtree homeomorphism problem: Given a pattern tree P and a text tree T, find a subtree of T which is isomorphic to P or decide that there is no such tree. Degree 2 node can be deleted from the text tree. ?
Graph homeomorphism Text Pattern Colors?
Graph homeomorphism Text Pattern
Text Pattern Labels (similarity) topology Graph homeomorphism
Back to 2nd semester… • An unrooted tree is an undirected, acyclic, connected graph (T=(VT,ET(( • A rooted tree is a triple Tr=(VT,ET,r(where (VT,ET( is an unrooted tree, and r is some vertex in V which is called the root. The root node of the tree implies the direction for all the edges in the graph. • A multi-source tree is an acyclic, directed graph, whose underlying undirected graph is a tree.
Back to 2nd semester… A tree is said to be ordered if the relative order of its subtree in each node is fix. Otherwise a tree is unordered. for “ordered” Problem:
What are we allowed to do? • Taking into account both label similarity and topology. • We are permited to delete vertexes from the text tree. • We are NOT permited to delete vertexes from the pattern tree.
Some definitions: • Let Δdenote a predefined node-to-node similarity score table. • Let D denote a predefined score for deleting a node from a tree (usually a penalty). • A mapping M from T1 to T2 is a partial one to one map from the nodes of T1 to the nodes of T2 that preserves the ancestor relations of the nodes.
Our problem: Let M be a mapping from T1 to T2 . The Labelled Subtree Homeomorphic Similarity Score of M[T1,T2] is: LSH (M[T1,T2]) = D (|T1|-|T2|) + ∑ (u,v) ∈ M Δ]u,v] Given two undirected labeled trees P and T, We want to find a mapping M and a subtree t of T, such that: LSH (M [t,P]) is maximal.
Text Pattern Score Score: 2 Scoring Score: 5 Score: 2
Dynamic programming P T v u y1 y2 y3 x1 x2
RScore[u,v] is the maximum between two terms: • The node-to-node similarity value Δ[v,u] plus the sum of the weights of the matched edges in the maximal assignment over G. This term is only compute if c(u) ≤c(v) (otherwise: -∞). • The weight RScore[yi,u] for the comparison of u and the best scoring child yi of v, updated with the penalty for deleting v. C(u) is the number of the children of u
RScore[u,v] - example w Pattern Text score matrix deletion = -1 b v a u Max {5,10-1} = 9 Max {3,9-1} = 8
The assignment problem Let G be a bipartite graph G = (V = X U Y,E) with weights w (x,y) for all edges. The assignment problem is to compute a matching M (list of monogamic pairs) such that: • The size of M is maximal among all the matchings. • From all the matchings above, The sum of the weights is maximal.
Solving the assignment problem • Reduction from the assignment problem to the min cost max flow problem. • We’ll construct G’ which contains G(V,E) with the following changes: • Two more vertexes: s,t • Edges from s to X and from Y to t, while w (s,x) = 0, w (y,t) =0 • The cost of the other edges in E is –w (x,y) • The capacity of all edges is 1 What is it? Among all the maximal flows we’ll choose the cheapest
From assignment to matching v u y3 y2 y1 x1 x2 y2 x1 s y2 t x2 y2
Time complexity analysis • Edmonds and Karp’s algorithm: O(EV*logV) • Fredman and Tarjan: O(VE + V2logV) (independent of the edges cost) • Gabow and Tarjan: O(V1/2Elog(VC) where the input costs are integers and in the range [-C,….,C] (the similarity assumption)
What did we have so far? • Motivation • “Advanced” homeomorphism: labels and topology • Scoring and deletion • Dynamic programming • Matching • Questions?
The algorithm for rooted unordered trees: • Input: Rooted trees T = (VT,ET,r) and P = (VP,EP,r’ )). • Output: The root of the subtree t of T which has the highest similarity score to P, (and homeomorphic to P).
Dynamic programming for each node u of P in postorder do for each node v of T in postorder do ifu is leaf then ifv is leaf then RScore(v, u) = Δ [v,u] else RScores(v,u) = ComputeScores (v,u) end if else ifLevel(u) > Level(v) thenRScore(v, u) = -∞ elseRScores(v,u) = ComputeScores (v,u) end if ; end if; end for; end for Node to node score Delete from the pattern
Procedure ComputeScores (v,u) Let k denote the out-degree of node u and l denote the out degree of node v ifk >l then AssignmentScore(G) = -∞ else Construct a bipartite graph G with node bipartition X and Y such that: X is the set of children {x1…xk{ of u, Y is the set of children {y1…yl{ of v, node ui ∈ X X is connected to node vj ∈ Y via an edge whose weight w(ui,vj) is set to RScore(vj,ui). AssignmentScore(G) = max ∑ (i,j) ∈ M RScores[yj,xi] end if Find, among all children of v, the node BestChild(v,u) whose ALSH score with u is highest: BestChild(v,u) = max j=1 to l RScore(yj,u) return max {Δ [v,u]+AssignmentScore(G),BestChild(v,u)+δ} Deletion penalty
Time complexity analysis Observation 1: ∑u =1 to m c(u) = m-1 ∑v =1 to n c(v) = n-1 The number of the vertexes in the pattern
Time complexity analysis The weighted assignment is computed once for each pair u,v uT, vP In a bipartite graph there are c(v)+c(u) nodes and c(v)c(u) edges. Based on Fredman and Tarjan the time complexity is: O(∑u=1 to m ∑ v=1 to n)c(u)2)c(v)+c(u)c(v) log (c(v)) = (observation 1) O(∑u=1 to m c(u)2)n+c(u)n log n) = (observation 1) O(m2n + mn log n)
Unrooted unordered trees: • The problem: each vertex in both the text tree and the pattern tree can be the root. • The naïve solution: choose an arbitrary node r of T to get a rooted tree. Next, for each u P compute rooted ALSH between Puand Tr. • Time complexity: O(m3n+m2n log n)
2nd try: • Select an arbitrary node r in T as the root • For each internal node in T (in post order) and for each node in P compute an “improved” matching problem u v
2nd try: • Select an arbitrary node r in T as the root • For each internal node in T (in post order) and for each node in P compute an “improved” matching problem u v
2nd try: • Select an arbitrary node r in T as the root • For each internal node in T (in post order) and for each node in P compute an “improved” matching problem u v
General idea for keeping the time complexity • Find the best match between the children {x1,..,xn) of v∈T and {y1,…,ym} of u∈P. • After computing the best match and removing a node xi (which act as the parent of u) there is a way to find the optimal matching between {x1,…,xn}\xi and {y1,…,ym} in O(d(u)c(v)+c(v) log c(v)) • The total time complexity for computing all assignments between v and u: O(d(u)2c(v)+d(u)c(v) log c(v))
Time complexity Observation 2: The sum of vertex degrees in an unrooted tree P is ∑u =1 to m d(u) = 2m-2 We’ve study that at Combinatorics
Time complexity – continue… O((∑u =1 to m ∑v =1 tond(u)2c(v))+d(u)c(v) log c(v)) = O((∑u =1 to m d(u)2n +d(u)n log n) = O(m2n + mn log n) Observation 1 Observatin 2
Up the tree… For each vertex v∈T, u∈P and xi∈ neighbors (u), UScore[v,u, xi[ is the maximal LSH between a subtree pu,xi of P and a corresponding homeomorphic subtree of tv,r if one exists. otherwise, UScore[v,u,xi] is set to -∞ A subtree in P which his root is u and the root’s parent is xi
UScore[u,v,xi] is the maximum between two terms: • The node-to-node similarity value Δ[v,u] plus the sum of the weights of the matched edges in the maximal assignment over Gi. This term is only compute if d(u) - 1 c(v) (otherwise: -∞). • The weight UScore[yi,u,xi] for the comparison of u and the best scoring child yi of v, updated with the penalty for deleting v. d(u) is the degree of u
And if ‘u’ is the root… • We have to compute an additional entry UScore[v,u,Φ]. • This entry represent the fact that u might be the root of P. • The root of P will be node u such that: UScore[v,u,Φ] is maximal.
Multi-source graphs • DAG = Directed Acyclic Graph. • A multi-source tree is a DAG whose its underlying structure is an unrooted, unordered trees.
Multi-source graph - example pattern text UScore[u,v,r’] = -∞ r’ r u v
Multi-source graphs & alignment • We’ll use the algorithm for the unrooted unordered tress. • We’ll filter out subtree alignments that map together edges of conflicting direction. • We’ll split the bipartite graph G = {X U Y,E} into two different graphs: one correspond to macthing of incoming-edge neighbors of u and v and the other for matching outgoing edge neighbors.
Solving ALSH for ordered rooted trees • Maximum weighted matching problem on ordered bipartite graphs, where no edges are allowed to cross. • Given a pattern string X, a source Y, and a character to character similarity table Δ[∑X, ∑Y], find among all |X|-sized subsequences of Y the subsequence Q which is most similar to X, that is, the sum ∑i=1 to|X|Δ[Qi,Xi] is maximized.
0 0 0 ∆ -∞ ki+1 x1 -∞ x2 y1 y2 y3 lj+1 String alignment This is NOT the deletion penalty y1 y2 y3 We can’t delete nodes from the pattern tree
Time complexity for rooted ordered For each node pair (v∈T,u∈P), the time complexity of the assignmentb is O(c(u)c(v)) (dynamic programming) ∑u =1 to m ∑v =1 to n O(c(v) c(u)) = ∑v =1 to n O(m c(v)) = O(m n) Observation 1