Efficient Discovery of Conserved Patterns Using a

Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern DiscoveryArwa Zabian 02/11/2014

Goal The goal is the use of pattern graph for discovering conserved patterns in a set of related protein sequences

Pratt • Is a tool that allows the user to search for patterns conserved in a set of protein sequences. • It must be specify what kind of patterns should be searched for and how many sequences should match a pattern to be reported • This tool is the implementation of an algorithm proposed by Jonassen in 1995 for the discovery of patterns of the PROSIT types allowing for both ambiguous pattern position and variable length gaps. • Pratt searches for patterns matching at least a specified number of a given sequence and then ranked the patterns discovered according to the highest scoring function

PROSITE Is a database of protein families containing more than 1,100 entries. For each family it gives a pattern or a profile which can be used to identify new members of the family . The results indicate that Pratt able to discover useful patterns for some protein families

Efficient Discovery of Conserved Pattern Using a Pattern Graph In 1996 Jonassen has proposed an alternative approach for finding patterns common to at least k out of n given sequences. The pattern graph concept is introduced . • It assumes that the pattern has a determined form • defines the transformation operations that allow the generation of a pattern from another • given a sequence S = s1 ,s2 , ……sm ,with length l, the pattern is represented as graph • it constructs the graph • it uses DFS to find all the possible patterns derived from a given path in the graph. • For all these patterns, selects the most significance one based on the highest score function

Terminology and definitions (1) The algorithm finds the most interesting patterns matching some minimum number of a given set of sequences. These sequences are string of alphabet  which represent the alphabet of a sequences of nucleotide Definition a class of patterns: • A pattern P in the class C is considered in the form: P = A1---x (i1 , j1 )----A2---x ( i2 , j2 )………x ( i P-1 , jP-1 )---- AP (1) where: A1 , A2 ,……..A P are called pattern component of P • A pattern component can be identity orambiguous • x ( i 1, j1 )….. are called the wildcard region where i1  j1 …. are integers number non negative . • Wildcard regions can be fixed and flexible the flexibility is defined as j-i

Exemple: P = A---[ DE]---x (3 )----G----x ( 3,4 )---L A1 = A , i1 = j1 = 0 , A2 = {D,E }, i2 = j2 = i3 = 3 , A3 = G , j3 =4 , A4 = L the length of pattern is 6, p = 4 A pattern P1 to match P, each patterns component of P1 must match each pattern component of P P = 4, L = 6, W = 4, F = 1, N = 1, FP = 2

Definitions (2) We define a class of patterns C that will be discovered during the work of the algorithm. We define a set of bounds  = (A, P , L , W, F, N, FP) where A  2 is the set of pattern components P the maximum number of component L the maximum length of patterns W the maximum length of wildcard region F the maximum flexibility of the wildcard region N the maximum number of flexible wildcard regions FP the maximum product of flexibility which is :  pk =1 ( jk - ik +1 )

Generalisation of Patterns Definition : a Pattern P1 is said to be generalisation of another pattern P2 if for any sequence matching P2 will matches also P1 . The concept of generalisation : • Given a class of patterns C ,we define a family of transformation operators i c i = { 1,2 ,3 } that can applied on a pattern P in c and produces another pattern P1 in C. • These operators are defined as follows: 1- P 1 cP1 : P1 is generated from P by the first transformation operator, if P1 can be obtained by deleting or adding a pattern component c from P, formally : - P = c----x ( i, j ) ---- P1 - P = P1 --- x ( i, j ) -----c - P = P1 ---- x ( i1 , j1 ) ---- c ---- x ( i2 , j2 ) ------ P2 and P1 = P1 ---- x ( i1 + i2 +1, j1+ j2 +1 ) ------ P2 for P, P1, P1, P2  C

Generalisation of pattern (2 ) 2- P 2 cP1 by substitution a component c in P with less restrictive one c1 : P = P1 ---- x ( i1 , j1 ) ---- c ---- x ( i2 , j2 ) ------ P2 P1 = P1 ---- x ( i1 , j1 ) ---- c1 ---- x ( i2 , j2 ) ------ P2 P1, P2  C c  c1  A 3- P 3 cP1 by allowing more flexibility in the wildcard regions of P P = P1 ---- x ( i , j ) ---- P2 P1 = P1 ---- x ( i1 , j1 ) --- P2 where i1  i , j1  j and ( i1 , j1 )  ( i , j ) for some P1, P2 , P1  C more generally P  cP1 if and only if P i cP1 i  { 1, 2, 3 }

Exemple: Given the pattern A-----B -----C-----D can be generalised to [AB]----B----- x ( 1,3)----- D. A----B----C-----D 2c [AB]---B-----C----D 1c [ AB]----B-----x---D 3c [AB]----B---- x (1,3)------D Patterrn Scoring function : • The score of pattern P is given in the form (1) is : I ( P ) = p i-1 I1 ( Ai) - c. p-1k= 1 ( jk - ik) where c is a constant and I1 ( Ai) is the information contents of the pattern component. • The pattern that contains more information has more highest scoring patterns and that is ranked in the top of the patterns. • This function is used in Pratt to rank all the patterns discovered.

Pattern Graph • Pattern graph is a directed graph G = ( V, E ) where the nodes V represent the patterns component , and the edges E represent the wildcard regions. •  ( u) is the label of a node v  V, the edge e  E is labelled with the minimum  and the maximum 1 number of residues to match the wildcard region.

Exemple • for a pattern P = A----B----x ( 0,2)---C----x (3)----D we can construct he following graph:  (u) = A,  (v) = B,  (w) = C,  (x) = D a path  = u1, u2,…un in G defines the pattern : that means a path u, w, x defines the pattern :  ( P) = A----x ( 0,1)----- C----x (3)----D

Definition (3) • we define  ( G,C) to be the set of all the patterns that can beC-generalisation from the set of patterns in C defined by the paths in G.  ( G,C) = P Path Uin G  (P) C { P /  (P) * c P } • we define 1 ( G,C) is a set of patterns in C that can be derived from a patterns defined by path in G using restrictive transformation operations: P  cP1 if and only if P i cP1 i  { 2, 3 } 1 ( G,C) = P Path Uin G  (P) C { P /  (P) * c P } • the goal is to find 1 ( G,C) and to prove that 1 ( G,C) =  ( G,C)

Constructing a pattern graph • Input : set of sequences S= s1, s2,…….sn where Si = s1is i2…..sij bounds  specifying a class C minimum number of sequences k < n that a pattern should match. • Output : 1- a pattern graph G 2- 1 ( G,C) 3- Finding the highest scoring patterns matching at least k sequences. 4- pruning the highest scoring patterns

Constructing pattern graph from a sequence Given a set of bounds  defining a class of patterns C and a sequence s = s1 ,s2 ,…..sl . The algorithm works in phases . In the first phase, it defines the nodes starting by the root and in the second phase it defines the edges. Phase 1 - if G contains one node ui - for each character si in s , that is a pattern component in A label ui with si  A ,  (ui ) = si phase 2 - for each node ui make an edge to all node uj which i < j  min ( i+ w + 1, l) - label this edges (ui ,uj ) as ( j-i-1, j-i -1)

Exemple S = ABCDEFG Algorithm properties : - 1 ( G,C) contains all patterns in C matching S - each pattern in 1 ( G,C) matches S - 1 ( G,C) =  ( G,C)

Constructing a pattern graph from a multiple alignment • The goal is to construct a pattern graph G with 1 ( G,C) =  ( G,C). • Input : let l be an alignment of the sequences S = M1 ….Mm l is the length of alignment. A sequence Mi = Mi1 ,……..Mili where Mij is the jth character in the sequence M i we number the column alignment from left to right the column i represent a vector ci1……cim where cij = k if the ith column in l contains the kth character from sequence Mi or 0 if the i th column contains a gap the graph is constructed by all the ungapped column

Constructing a pattern graph from a multiple alignment (2) • the algorithm works in steps, in the first one define the nodes of the graph and in the second step defines the edges. Step 1 : - for each ungapped column in l, make a node ui for column ci - the set of symbols present in that column represent the allowable pattern components - label ui with the smallest set a  A Step2 : - each pair of nodes ui , uj , i < j correspond to a column i, j in l - d, d1 are the minimum and the maximum number of sequence symbols in each sequence between column i , j - for each edge ui uj label it with ( d ,d1 )

Exemple :

Simple depth -first search using the graph • Until now we are constructing the graph. The next step is to find the set of conserved patterns 1 ( G,C) in the graph using DFS. • That means constructing a search tree rooted in an empty pattern and contains all the k- pattern in 1 ( G,C) at depth k • Definition : k- pattern is defined by k-path in G and the C- generalisation operation ( of type 2,3 ) applied on it. • Input conserved k-pattern P k-path P in G from which P has been derived • output : generating all the simple possible extension of P that are in C and can be derived from an extension of the path P. checked if the patterns generated are conserved or not.

how are generated an extension of P • Let P = v1, v2,….vk and there are edges from vk to w1 ,……wl • each path p : vk,…..wl define a pattern P ( pl ) is a simple extension of P where: P ( pl) = P ----- x (d ( vk wl ), d1 ( vk wl ) ---- a ( wl ) or = P ----- x ( i l , jl )------ A l • for each pattern P ( pl) we can generate a simple extension by applying the operator type 2 on A l and the operator type 3 on x ( i l , jl ) • Example : let G be a graph and assume F = 1 and A = { { A }, { B }, {C }, { D}, { E }, {F}, {G}} assume that the pattern P = A ------x------C------D was derived from the path p = A,C,D

Example ( continue ) • the path p can be extended along any of the edges

Simple depth -first search using the graph (2) • running the search procedure recursively can generate all the patterns in 1 ( G,C) and then we check if they are conserved or not . • Pruning the search : - find the highest scoring patterns means for all node u in G we need to find the most expressive conserved pattern from a path started in u. - The search can be done in different cases: 1- no flexibility no ambiguity is allowable 2- no flexibility but allowing for ambiguity 3- general case

Pruning the search (1) no flexibility no ambiguity is allowed: • A = { {a } / a  S } F = 0 in this case pattern is directly defined by the path and the longest path will give the highest scoring pattern . Property : for a given graph G = ( V, E ) if a node u has edges to v and w where u < v <w then there will be an edge from v to w. Defining an ordering relation <1 we can ordered the child nodes of a given node x in a manner such if : xi <1 xj then in the patterns P xi , P xj : w i < wj result : there are no need to explore all the subtree xi+1 ….x l to find the highest scoring pattern.

Pruning the search (2) no flexibility but ambiguity is allowed: • if x1 is a child of node x in the search tree that correspond to path pi = v1……vk, wi , the pattern derived from such path is Px ---- x ( d ( vk wi ))-----A , let Ind ( x1) = index ( wi) and Amb ( x1) = |A| • we define a partial order < 2 ordering of the children of x so : x1 <2 x11 if Ind ( x1) < Ind ( x11) or if ind ( x1) = ind ( x11) and Amb ( x1 ) < Amb ( x11) • two nodes x1, x11 which : ( Ind (x1), Amb (x1) ) = (Ind ( x11), Amb (x11) ) are ordered arbitrarily. • If a pattern of child x1 matches the same number of segments as Px then all the child of x after x1 will not be analysed because they cannot give a higher scoring pattern

Pruning the search (3) the general case: • Each child x1 of x defines a pattern P1 = P----x ( i , j)-----a • each node wi is appended to the path is defined by : a  A such a ( w )  a the flexibilty of the wildcard region defined by the edge vk wi • given Inde(x1),Amb(x1), F ( x1) = j-i we define a partial order <3 of the children of s that : x1 < x11 if : Ind (x1) < Ind (x11) or Ind ( x1) = Ind ( x11) and Amb ( x1 ) < Amb ( x11) or (Ind ( x1) , Amb ( x1 )) = ( Ind ( x11) , Amb ( x11)) and F (x1) < F (x11) • two nodes x1, x11 for which (Ind ( x1) , Amb ( x1 ), F (x1) ) = ( Ind ( x11) , Amb ( x11), F (x11)) are ordered arbitrarily

Pruning the search (3) the general case (continue): • if P1 a pattern corresponding to a child x1 of x , if the extend of P1 matches at least a certain proportion of the segments matched by P we do not analysis other children of x • because if a P is a real conserved pattern and the extension P1 matches at least k segments , then we would expect only a small proportion of segments in the set of segments that matches the pattern P to extend to segment matching P1 . • P1 is conserved pattern and no additional expansion of P need to be explored

Complexity analysis: • Time complexity : the algorithm search for all patterns conserved in at least k sequences of n sequences with average length l, the class of patterns C is given by a set of bounds b ( A,P,L, W, F, N,FP) then the time complexity to analysis a pattern graph G ( V, E ) constructed from the n-k+t shortest sequences is O ( |E|.P.N) where P = O (L ) and L = O ( n.l ) is the total length of all sequences. The worst case time complexity is exponential in the maximum pattern length P which is the maximum depth of the search tree. • Space complexity : the space needed to store the graph is : O ( |E|.  g2 / 8 + |V| ).(W+1+N:P) bytes where g is the maximum number of generalisations of a patterns component.

References

References (2)

Efficient Discovery of Conserved Patterns Using a

Efficient Discovery of Conserved Patterns Using a

Presentation Transcript

Discovery of Patterns in Digital Records

Discovery of Patterns in the Global Climate System using Data Mining

A Gene Coexpression Network for Global Discovery of Conserved Genetic Modules

Efficient Discovery of XML Data Redundancies

Conserved Introns

From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

Efficient Mining of Iterative Patterns for Software Specification Discovery

Efficient Mining of XML Query Patterns for Caching

Efficient Discovery of Frequent Approximate Sequential Patterns

Towards Efficient Discovery of Web Services (WSs)

Prediction of Gene Expression in Yeast using Conserved Sequence Templates

Identifying conserved spatial patterns in genomes

CanTree: a tree structure for efficient incremental mining of frequent patterns

Measures of Conserved Synteny

Discovery of Patterns in the Global Climate System using Data Mining

Using Patterns Examples

Discovery of Significant Usage Patterns from Clickstream Data

Using Summon as a discovery tool

Discovery of SOA patterns via Model Checking

Using Patterns Effectively

A mosaic of conserved and radically changed developmental features A. Conserved

Data mining and discovery of access patterns