330 likes | 545 Views
Colloque T.A.G – LAPTH Annecy – 8-10 novembre 2006. Analysis of biological sequences using Markov Chains and Hidden Markov Models. Bernard PRUM,. La genopole – Evry – France prum@genopole.cnrs.fr. Why Markov Models ?. A biological sequence : X = (X 1 , X 2 , … , X n )
E N D
Colloque T.A.G – LAPTH Annecy – 8-10 novembre 2006 Analysis of biological sequences using Markov Chainsand Hidden Markov Models Bernard PRUM, La genopole – Evry – France prum@genopole.cnrs.fr
Why Markov Models ? A biological sequence : X = (X1, X2, … , Xn) where XkA = { t , c , a , g } or {A C D E F G H I K L M N P Q R S T V W Y} A very common tool for analyzing these sequences is the Markov Model (MM) P(Xk = v | Xj , j < k) = P(Xk = v | Xk – 1) u, v A denoted by π(u , v) if Xk – 1 = u
A complex, called Rec BCD, protects the cell against viruses Why MM ? – 2 Exemple : Rec BCD E. coli viruses chi own bacteria genome To avoid the destruction of the genome of the cell, along the genome exists a passwordgctggtgg (it is called chi). When rec BCD bumps into the chi, it stops its destruction. In order to be efficient the number of occurrences of the chi is much higher that the number predicted in a Markov model.
Parsimonious Markov Models When we modelize a sequence in order to find exceptional motifs or for annotation, we have to estimate the parameters of the model, and more parameters we have, worst is the estimation. In a Markov Model of order m, there are 4m predictors (the m-words), hence 3 x 4m parameters In the M2 model, there are 16 predictors and 48 parameters In M5, there are 1024 predictors, 3072 parameters
PMM – 2 A first restriction consists in taking into account the past up to the point : we use a large past when the sequence shows that this is necesary, we use a short past when the sequence allows the economy : these models are called VLMC = Variable Length Markov Chains In this VLMC, there are 12 predictors : aa ca ga ta ac cc gc tc g at tt [gc]t There are 36 parameters Notation : [gc] denotes « g or c » ; [act] denotes « a or c or t »
PMM – 3 But it is not obvious that for the prediction of Xk , Xk – p is less and less informative inasmuch p increases. • As an example (*), let us consider the ’jumper’ model • P(Xt = v | past) = P(Xt = v | Xt – 2) • (the dependance ‘jumps’ over Xt – 1) • it corresponds to this tree (4 predictors, 12 parameters) (*)this model is not as scholar as it seems : for example in a coding region (periodic model depending on the phase), the 2nd position in a codon strongly depends on the 2nd position in the previous codon (cf hydrophobicity)
PMM – 4 These models are called PMM = Parsimonious Markov Models More general (?) example : In this PMM there are 8 predictors (24 parameters) : a[ac] c[ac] g[ac] t[ac] g at [cg]t tt
PMM – 5 More precisely : in the tree of predictors (*) below any node all the partitions of A = { t , c , a , g } may appear (*) : the differents predictors appear in this tree like the path from all the leaves to the root Hence, there are 15 possibilities below each node.
PMM – 6 A Parsimonious Markov Model (PMM) is defined by • such a dependance tree t • for each leaf (= for each predictor) a law on A P(Xt = u | Xt – 1 = [tc]) P(Xt = u | Xt – 1 = a, Xt – 2 = c) P(Xt = u | Xt – 1 = a, Xt – 2 = [tag]) P(Xt = u | Xt – 1 = g) +
PMM – 7 We will only work with finite order PMM : the longer predictor contains, say, m letters (the depth of the tree is m) Obviously a PMM of order m is a MM of order m Note : the number of PMM increases very quickly with m : in the 4-letter alphabet and for m = 5 there are some 1085 trees ––––––– Notations : • denotes a tree of predictors W its sets of predictors in t (the leaves) For w W , w,u = P(Xt = u | w)
Statistics on PMM For a fixed treet, the likelihood is obviously N(wu) L(q) = … ∏w,u (The dots correspond to the first letters in the sequence. We will not care about them today) Which leads to the classical MLE ^ N(wu) w,u = (where N(w+) = ∑ N(wv) N(w+) The difficulty arises when we want to choose the tree : problem of choice of model (within, for example 1085 models)
Statistics – 2 Therefore we adopt a Bayesian approach A priori law : • on the tree let us choose the uniform law (it can be changed) • on the transition parameter, it is natural to chose a Dirichlet law which is conjugate : if, for w W , a priori P(w,•) = ∏w,u then, a posteriori P(w,•) = ∏w,u a(w,u) a(w,u) The MAP estimator of w,u remains the same as before, except the fact that N(w,u) has to be changed in N’(wu) = N(wu) + a(w, u)
Statistics – 3 The use of Bayes formula then gives as a posterior law on the trees ln P(t | X) = S S(w) Where the sum is taken over all the predictors in the tree t and S(w) = S ln G(N(wu)) – ln GS N(wu)) (G is such as G(k+1) = k ! , k N) Writing the posterior law in this way shows that P(t | X) may be maximized in a recursive way
Application to real genomes We fitted MM and PMM for the orders m = 3 , 4 and 5 • on the set of the 224 complete bacterial genomes published today • on their coding regions (CDS) To compare the adequacy of this modelizations, we computed the BIC criterion for each model M BIC(M) = 2 L(M) - nb_param(M) . ln n “ The higher BIC, the better the model ”
Application – 2 This picture plots BIC(PMM) – BIC(MM) against the size of the bacterial genome. For all the bacteriae, PMM fits better than classical MM
Approach using FDA Recent results (Gregory Nuel) concern the use of «Finite (Deterministic) Automata» in the statistic of words or patterns To a word, we may associate an FDA : Example 1 : on {a,b}, w = aaab States : b a aa aaa aaab a b aa aaa This can be generalized if “one“ word w is replaced by a motif (finite family of words) or even a language. aaab
Approach using FDA This automata is especially dedicated to the study of the word w (the motif, ...) : if we “run“ a sequence on this graph, the automate counts the occurences of w (the motif, ...) It turns to be VERY efficient : “wordcount “, program in EMBOSS, needs 4352 seconds to count the occurrences of all 12-words in E. coli, Nuel’s program acheives this task in 9.86 seconds The prosite motif [LIVMF]GE.[GAS][LIVM].(5-11)R[STAQ]A.[LIVAM].[STACV] (some 1012 words) is treated by a FDA of 329(30) states in M0 1393(78) states in M1
Approach using FDA If this sequence X is a Markov chain, we then have an other MC running on this graph. Even for “rather complicated motifs“, this allows to get the law of “all“ statistics of words : - exact law of the first occurrence of a motif (taking into account the “starting point“), - exact law of the number of occurrences of the motif, - in particuler expectation and variance of these laws, opening the possibility of gaussian, poisonnian,... approximations (and an exhaustive study of the qualities of these approximation), - law of a motif M conditionally to the number of occurrences of another one, M’.
Hidden Markov Models 2nd Part :
Hidden Markov Models An important criticism against Markov modelization is its stationarity: a well known theorem says that, under weak conditions, P(Xk = u) µ(u) (when k ∞) (and the rate of convergence is exponential.) But biological sequences are not homogeneous. There are g+c rich segments / g+c poor segments (isochores). One may presume (and verify) that the rules of succession of letters differ in coding parts / non-coding parts. Is it possible to take avantage of this problem and to develop a tool for the analysis of heterogeneity ? => annotation
HMM – 2 Suppose that d states alternate along the sequence Sk = 1 Sk = 2Sk = 1Sk = 2 Sk = 1 And in each state we have a MC : if Sk = 1, then P(Xk = v | Xk–1 = u) = π1(u ; v) if Sk = 2, then P(Xk = v | Xk–1 = u) = π2(u ; v) and (more technical than biological - see HSMM) P(Sk = y | Sk–1 = x) = π0(u ; v) • Estimate the parameters π1, π2, π0 • Allocate a state {1, 2} to each position Our objectives
HMM – 3 ¡ Use the likelihood !! L(q) = ∑ µ0(S1) µS (X1) .... 1 ...∏ π0(Sk-1,Sk) πk (Xk-1,Xk) n terms (length of the sequence) over all possibilities S1S2...Sn ; there are sn terms Désespoir !!! 210 000 = 103 000
Searching nucleosome positions In eukaryotes (only), an important part of the chromosomes forms chromatine, a state where the double helix winds round “beads“ forming a collar : | | | 10 nm Each bead is called a nucleosome. Its core is a complex involving 8 proteins (an octamer) called histone(H2A, H2B, H3, H4). DNA winds twice this core and is locked by an other histone (H1). The total weight of the histones is ± equal to the weight of the DNA.
Curvature within curvature The DNA helix turns twice around the histone core. Each turn corresponds to about 7 pitches of the helix, each one made with about 10 nucleotides. Total = 146 nt within each nucleosome. Depending on the position (“in”vs “out”) the curvature satisfies different constraints
Nuc and “no-nuc” states Trifonov (99) as well as Rando (05) underline that there are ‘no‘ nucleosome in the gene promotors (accessibility) The introduce “before“ nucleosome a “no-nucleosme” state. 1 2 . . . 70 nucleosome core no-nuc spacer Ioshikhes, Trifonov, Zhang Proc. Natl Acad. Sc. 96 (1999) Yuan, Liu, Dion, Slack, Wu, Altschuler, Rando, Sciencexpress (2005)
Bendability Following an idea (Baldi, Lavery,...) we introduce an indice of bendability ; it depends on succession of 2, 3, 4, ...di-nucleotides. g a t g a t c t a c t a q
2nd letter a t g c a 0.0 2.8 3,3 5.2 a t 7,3 7,3 10,0 10,0 a g 3,0 6,4 6,2 7,5 a c 3,3 2,2 8,3 5,4 a a 0.7 0.7 5,8 5,8 t t 2.8 0.0 5.2 3,3 t g 5.3 3.7 5.4 7,5 t c 6,7 5.2 5.4 5.4 t 1rst letter 3rd letter a 5.2 6,7 5.4 5.4 g t 2,2 3,3 5.4 8,3 g g 5.4 6,5 6,0 7,5 g c 4,2 4,2 4,7 4,7 g a 3.7 5.3 7,5 5.4 c t 6,4 3,0 7,5 6,2 c g 5.6 5.6 8.2 8.2 c c 6,5 5.4 7,5 6,0 c PNUC table There exist various tables which indicate the bendability of di-, tri or even tetra-nucleotides (PNUC, DNase, ...) We used PNUC-3 : PNUC(cga) = 8,3 PNUC(tcg) = 8,3 (*) Goodsell, Dickerson, NAR 22 (1994)
Sometime it works : Scan of K3 of yeast
What about positions ? We represent (*) parts of the chromosome K3 of Yeast The green curve (“proba” of the no-nuc state) increases between genes (promotors) The red curve (“proba” of the nucleosome state) appears periodically in genes. (*) using the software MuGeN, by Mark Hoebeke
Acknowledgements Labo «Statistique et Génome» Labo MIG – INRA Christopha AMBROISE Philippe BESSIÈRE Maurice BAUDRY François RODOLPHE Etienne BIRMELE Sophie SCHBATH Cécile COT Élisabeth de TURCKHEIM Emmanuelle DELLA-CHIESA Mark HOEBEKE Mickael GUEDJ François KÉPÈS Labo AGRO Sophie LEBRE Jean-Noël BACRO Catherine MATIAS Jean-Jacques DAUDIN Vincent MIELE Stéphane ROBIN Florence MURI-MAJOUBE Grégory NUEL Franck PICARD Lab’ Rouen Hugues RICHARD Dominique CELLIER Anne-Sophie TOCQUET Sabine MERCIER Nicolas VERGNE Sec : Michèle ILBERT