Markov Models and HMM in Genome Analysis

Journée Statistique et Santé Paris 5 – 5 mai 2006 Markov Models and HMMin Genome Analysis Bernard PRUM La genopole – Evry – France prum@genopole.cnrs.fr

Why Markov Models ? A biological sequence : X = (X1, X2, … , Xn) where XkA = { t , c , a , g } or {A C D E F G H I K L M N P Q R S T V W Y} A very common tool for analyzing these sequences is the Markov Model (MM) P(Xk = v | Xj , j < k) = P(Xk = v | Xk – 1) u, v A denoted by π(u , v) if Xk – 1 = u

A complex, called Rec BCD, protects the cell against viruses Why MM ? – 2 Exemple : Rec BCD E. coli viruses chi own bacteria genome To avoid the destruction of the genome of the cell, along the genome exists a passwordgctggtgg (it is called chi). When rec BCD bumps into the chi, it stops its destruction. In order to be efficient the number of occurrences of the chi is much higher that the number predicted in a Markov model.

Why MM ? – 3 The « exceptional character» of a word depends on • the frequency of each letter  model M0 (Bernoulli) Exemple HIV1 t 2164 c 1773 a 3410 g 2370

Why MM ? – 3 The « exceptional character» of a word depends on • the frequency of each letter  model M0 (Bernoulli), • the frequency of 2-words tt , tc ta, tg, … gg  model M1 (Markov) Exemple HIV1 t c a g t 548 342 684 590 2164 c 470 413 795 95 1773 a 713 561 1112 1024 3410 g 432 457 820 661 2370

Results MM

Hidden Markov Models An important criticism against Markov modelization is its stationarity: a well known theorem says that, under weak conditions, P(Xk = u)  µ(u) (when k  ∞) (and the rate of convergence is exponential.) But biological sequences are not homogeneous. There are g+c rich segments / g+c poor segments (isochores). One may presume (and verify) that the rules of succession of letters differ in coding parts / non-coding parts. Is it possible to take avantage of this problem and to develop a tool for the analysis of heterogeneity ? => annotation

HMM – 2 Suppose that d states alternate along the sequence Sk = 1 Sk = 2Sk = 1Sk = 2 Sk = 1 And in each state we have a MC : if Sk = 1, then P(Xk = v | Xk–1 = u) = π1(u ; v) if Sk = 2, then P(Xk = v | Xk–1 = u) = π2(u ; v) and (more technical than biological - see HSMM) P(Sk = y | Sk–1 = x) = π0(u ; v) • Estimate the parameters π1, π2, π0 • Allocate a state {1, 2} to each position Our objectives

HMM – 3 ¡ Use the likelihood !! L(q) = ∑ µ0(S1) µS (X1) .... 1 ...∏ π0(Sk-1,Sk) πk (Xk-1,Xk) n terms (length of the sequence) over all possibilities S1S2...Sn ; there are sn terms Désespoir !!! 210 000 = 103 000

back to HMM Step n° 1 We do not know what is S= 1 and what is Sk = 2 We make an arbitrary allocation of these states : Sk = 1 Sk = 2Sk = 1Sk = 2 Sk = 1 “Knowing” the states, it is obvious to compute the parameters exple : π1(c,g) = % of the c which are followed by a g in green parts. π0(•,•) = % of • which are followed by a •

Step n° 2 Baum & Churchill formula Define the predictive probability ak(v) = P(Sk = v | X1k-1) the filtragee probability bk(v) = P(Sk = v | X1k) the estimated probability jk(v) = P(Sk = v | X1n) Bayes formula Forward (k = 2 to n) ak(v) = ∑ubk-1(u) π0 (u, v) ak-1(u) πu (Xk-2, Xk-1) bk-1(u) = ∑w ak-1(w) πw (Xk-2, Xk-1) Backward (k = n to 2) jk(v) jk-1(v) = bk-1(u) ∑v π0 (u, v) bk(v)

and Sk ? Bad idea : Sk = arg maxjk(v) (*) First good idea : keep the distribution jk(v) [EM] Second good idea : draw Sk according to jk(v) [SEM] (*) except, may be at the end of the algorithm (freezing)

Annotation

SHMM •• • • ( ) New defect : transition between states is described using a Markov model 1 - p p q 1 - q π0 = Consequence : the length of segments of ‘a given colour’ are r.v. ~ Expo(–)

SHMM Does not correspond to the reality ! ! Histograms of ‘biological segments‘ (after smoothing) look more like density g(h) h It is easy to make the probability of leaving the state depend on h to get the suitable law : g(h) p(h) = 1 - G(h)

Searching nucleosomes What are nucleosomes ? In eukaryotes (only), an important part of the chromosomes forms chromatine, a state where the double helix winds round “beads“ forming a collar : | | | 10 nm Each bead is called a nucleosome. Its core is a complex involving 8 proteins (an octamer) called histone(H2A, H2B, H3, H4). DNA winds twice this core and is locked by an other histone (H1). The total weight of the histones is ± equal to the weight of the DNA.

Nucleosomes Curvature within curvature Back to one nucleosome : the DNA helix turns twice around the histone core. Each turn corresponds to about 7 pitches of the helix, each one made with about 10 nucleotides. Total = 146 nt within each nucleosome. Depending on the position (“in”vs “out”) the curvature satisfies different constraints

Bendability Following an idea (Baldi, Lavery,...) we introduce an indice of bendability ; it depends on succession of 2, 3, 4, ...di-nucleotides. g a t t g a c t a c t a q

PNUC table 2nd letter a t g c a 0.0 2.8 3,3 5.2 a t 7,3 7,3 10,0 10,0 a g 3,0 6,4 6,2 7,5 a c 3,3 2,2 8,3 5,4 a a 0.7 0.7 5,8 5,8 t t 2.8 0.0 5.2 3,3 t g 5.3 3.7 5.4 7,5 t c 6,7 5.2 5.4 5.4 t 1rst letter 3rd letter a 5.2 6,7 5.4 5.4 g t 2,2 3,3 5.4 8,3 g g 5.4 6,5 6,0 7,5 g c 4,2 4,2 4,7 4,7 g a 3.7 5.3 7,5 5.4 c t 6,4 3,0 7,5 6,2 c g 5.6 5.6 8.2 8.2 c c 6,5 5.4 7,5 6,0 c There exist various tables which indicate the bendability of di-, tri or even tetra-nucleotides (PNUC, DNase, ...) We used PNUC-3 : PNUC(cga) = 8,3 PNUC(tcg) = 8,3 (*) Goodsell, Dickerson, NAR 22 (1994)

HMM for nucleosomes Better : consider a different state for ”each” position in the nucleosome (say 146 states) 1 2 3 . . . 145 146 and the repetition of r (say 4) identical states for the between-nucleosome regions (= spacer). These brother states give to the law of the length of the b.n. regions a Gamma form which is not geometrical ! !

A no-nuc state Trifonov (99) as well as Rando (05) underline that there are ‘no‘ nucleosome in the gene promotors (accessibility) The introduce “before“ nucleosome a “no-nucleosme” state. 1 2 . . . 70 nucleosome core no-nuc spacer Ioshikhes, Trifonov, Zhang Proc. Natl Acad. Sc. 96 (1999) Yuan, Liu, Dion, Slack, Wu, Altschuler, Rando, Sciencexpress (2005)

Acknowledgements Labo «Statistique et Génome» Labo MIG – INRA Maurice BAUDRY Philippe BESSIÈRE Etienne BIRMELE François RODOLPHE Cécile COT Sophie SCHBATH Mark HOEBEKE Élisabeth de TURCKHEIM Mickael GUEDJ François KÉPÈS Sophie LEBRE Labo AGRO Catherine MATIAS Jean-Noël BACRO Vincent MIELE Jean-Jacques DAUDIN Florence MURI-MAJOUBE Stéphane ROBIN Grégory NUEL Hugues RICHARD Anne-Sophie TOCQUET Lab’ RouenDominique CELLIER Nicolas VERGNE Dominique CELLIER Sec : Michèle ILBERT Sabine MERCIER

Markov Models and HMM in Genome Analysis

Markov Models and HMM in Genome Analysis

Presentation Transcript

Hidden Markov Models

Markov Models and Simulations

INTRODUCTION TO HIDDEN MARKOV MODELS HMM

Hidden Markov Models HMM in Sequence Analysis

Markov Models

Probability and Markov Models

Hidden Markov Model (HMM) - Tutorial

Markov Models

Markov models and applications

Markov Models

Markov Models

Markov Models

Hidden Markov Model (HMM) Tagging

Combining Phylogenetic and Hidden Markov Models in Biosequence Analysis

Analysis of biological sequences using Markov Chains and Hidden Markov Models

Markov Chains and Hidden Markov Models

HMM-X Initial models

Hidden Markov Models

Markov Models and Applications

Markov Models

Markov Models and HMM in Genome Analysis