Uncovering Sequences Mysteries With Hidden Markov Model

Uncovering Sequences Mysteries WithHidden Markov Model Cédric Notredame

Our Scope Look once Under the Hood Understand the principle of HMMs Understand HOW HMMs are used in Biology

Outline -Reminder of Bayesian Probabilities -HMMs and Markov Chains -Application to gene prediction -Application Tm predictions -Application to Domain/Prot Family Prediction -Future Applications

Conditional Probabilities And Bayes Theorem

I now send you an essay which I have found among the papers of our deceased friend Mr Bayes, and which, in my opinion, has great merit... In an introduction which he has writ to this Essay, he says, that his design at first in thinking on the subject of it was, to find out a method by which we might judge concerning the probability that an event has to happen, in given circumstances, upon supposition that we know nothing concerning it but that, under the same circumstances, it has happened a certain number of times, and failed a certain other number of times. Bayes

“The Durbin…”

What is a Probabilistic Model ? Dice = Probabilistic Model -Each Possible outcome has a probability (1/6) -Biological Questions: -What kind of dice would generate coding DNA -Non-Coding ?

Which Parameters ? OR -Through Observation: -measure frequencies on a large number of events Dice = Probabilistic Model Parameters: proba of each outcome -A Priori estimation: 1/6 for each Number

Which Parameters ? Parameters: proba of each outcome Model: Intra/Extra Protein 1- Make a set of Inside Proteins using annotation 2- Make a set of Outside Proteins using annotation 3- COUNT Frequencies on the two sets Model Accuracy  Training Set

Maximum Likelihood Models 1- Make training set 2- Count Frequencies Model Accuracy  Training Set Maximum Likelihood Model: Model probability MAXIMISES Data probability Model: Intra/Extra Proteins

Maximum Likelihood Models Model: Intra/Extra-Cell Proteins Maximum Likelihood Model Model Probability MAXIMISES Data Probability AND Data Probability MAXIMISES Model Probability P ( Model ¦ Data) is Maximised ¦ means GIVEN!

Maximum Likelihood Models Model: Intra/Extra-Cell Proteins Maximum Likelihood Model P ( Model ¦ Data) is Maximised P ( Data ¦ Model) is Maximised Model Probability MAXIMISES Data Probability AND Data Probability MAXIMISES Model Probability

Maximum Likelihood Models Model: Intra/Extra-Cell Proteins Maximum Likelihood Model P ( Coin ¦ Data)< P(Dice ¦ Data) Data: 11121112221212122121112221112121112211111

Conditional Probabilities

Conditional Probabilities P (Win Lottery ¦ Participation) The Probability that something happens IF something else ALSOHappens

Conditional Probability Dice 1Dice 2 P(6¦ Dice 1)=1/6P(6¦ Dice 2)=1/2 Loaded! The Probability that something happens IF something else ALSOHappens

Joint Probability The Probability that something happens IF something else ALSOHappens AND P(6¦ D1)=1/6P(6¦ D2)=1/2 P(6,D2)=P(6¦D2) * P(D2)=1/2* 1/100 Comma

Joint Probability P(6¦ DF and DL)= P(6, DF) + P(6, DL) = P(6 ¦ DF) * P(DF) + P(6¦ DL)*P(DL) = 1/6*0.99 + 1/2*0.01 = 0.17 Question: What is the probability of Making a 6, given that the Loaded Dice is used 1% of the time (0.16 for an unloaded dice)

Joint Probability Unsuspected Heterogeneity In the training set  Inaccurate Parameters Estimation P(6¦ DF and DL)= P(6, DF) + P(6, DL) = P(6 ¦ DF) * P(DL) + P(6¦ DF)*P(DL) = 1/6*0.99 + 1/2*0.01 = 0.17 (0.16 for an unloaded dice)

Bayes Theorem P(Y¦Xi) * P(Xi) P(Xi¦ Y) = S(P(Y¦Xi)*P(Xi)) i X : Model or Data or any Event Y : Model or Data or any Event

Bayes Theorem P(Y¦X) * P(X) P(X¦ Y) = P(Y¦X)*P(X)+ P(Y¦X)*P(X) P(Y,X)+ P(Y,X) P(Y) X : Model or Data or any Event Y : Model or Data or any Event XT=X+ X

Bayes Theorem Proba of Observing Y AND X simultaneously Proba of Observing XIF Y is fulfilled ‘Remove’ P(Y) to Get P(X¦Y) X : Model or Data or any Event Y : Model or Data or any event P(Y¦X) * P(X) P(X¦ Y) = P(Y)

Bayes Theorem X : Model or Data or any Event Y : Model or Data or any event Proba of Observing Y and X simultaneously P(X,Y) P(X¦Y) = P(Y) Proba of Observing XIF Y is fulfilled ‘Remove’ P(Y) to Get P(X¦Y)

Using Bayes Theorem We will use Bayes Theorem to test our belief: If the Dice was loaded (model) what would be the probability of this Model Given the data (three 6 in a row) Question:The dice gave three 6s in a row IS IT LOADED !!!

Using Bayes Theorem Question:The dice gave three 6s in a row IS IT LOADED !!! P(D1)=0.99 P(D2)=0.01 P(6¦D1)=1/6 P(6¦D2)=1/2 Occasionally Dishonest Casino…

Using Bayes Theorem P(Y¦X)*P(X) P(D1)=0.99 P(D2)=0.01 P(6¦D1)=1/6 P(6¦D2)=1/2 P(X¦ Y) = P(Y) Y: 63 X: D2 P(63 ¦D2)*P(D2) P(D2¦63) = P(63 ¦D1)*p(D1) + P(63¦D2)*P(D2) 63 with D2 63 with D1 Question:The dice gave three 6s in a row IS IT LOADED !!!

Using Bayes Theorem P(D1)=0.99 P(D2)=0.01 P(6¦D1)=1/6 P(6¦D2)=1/2 P(X,Y) P(X¦ Y) = P(Y) Question:The dice gave three 6s in a row IS IT LOADED !!! P(63 ¦D2)*P(D2) P(D2¦63) = = 0.21 P(63 ¦D1)*p(D1) + P(63¦D2)*P(D2) Probably NOT

Posterior Probability 0.21 is a posterior probability: it was estimated AFTER the Data was obtained P(63¦D2) is the likelihood of the Hypotheses Question:The dice gave three 6s in a row IS IT LOADED !!! P(63 ¦D2)*P(D2) P(D2¦63) = = 0.21 P(63 ¦D1)*p(D1) + P(63¦D2)*P(D2)

Debunking Headlines 50% of the crimes are committed by Migrants. Question: Are 50% of the Migrants Criminals??. P(Migrant) =0.1 P(Criminal) =0.0001 P(M¦C)=0.5 P(M¦C)*P(C) P(M¦C)*P(C) P(C¦M) = P(C¦M) = P(M) P(M) 0.5*0.0001 =0.0005 = 0.1 NO: 0.05% Migrants only are Criminals (NOT 50%!)

Debunking Headlines P(T¦P)*P(P) P(T)=0.1 P(P)=0.0001 P(T¦P)=0.5 P(T¦P)*P(P) P(P¦T) = P(P¦T) = P(T) P(T) 0.5*0.0001 =0.0005 = 0.1 50% of Gene Promoters contain TATA. Question:IS TATA a good gene predictor NO

Bayes Theorem TATA=High Sensitivity / Low Specificity Bayes Theorem Reveals the Trade-off Between Sensitivity:Finding ALL the genes and Specificity: Finding ONLY genes

Markov Chains

What is a Markov Chain ? Markov Chain: Two Dices -You only use ONE dice: the fair OR the loaded -The Dice you roll only depends on the previous roll Simple Chain: One Dice -Each Roll is the same -A Roll does not depend on the previous

What is a Markov Chain ? Biological Sequences Tend To Behave like Markov Chains Question/Example Is it possible to Tell Whether my sequence is CpG island ???

What is a Markov Chain ? Question: Identify CpG Island sequences Old Fashion Solution -Slide a Window of size: Captain’s Height/p -Measure the % of CpG -Plot it against the sequence -Decide

sliding Window Methods Sliding Window Average Sliding Window

What is a Markov Chain ? Question: Identify CpG Island sequences Bayesian Solution -Make a CpG Markov Chain -Run the sequence through the Chain -Likelihood for the chain to produce the sequence?

Transition State T A C G Transition Probabilities Probability of Transition from G to C AGC=P(Xi=C ¦ Xi-1=G)

P(sequence)=P(XL,XL-1,XL-2,….., X1) Remember: P(X,Y)=P(X¦Y)*P(Y) In The Markov Chain, XL only depends on XL-1 P(sequence)=P(XL¦XL-1)*P(XL-1¦XL-2)….., P(X1) )

AGC=P(Xi=C ¦ Xi-1=G) P(sequence)=P(XL¦XL-1)*P(XL-1¦XL-2)….., P(X1) ) P(sequence)= P(x1)* Axi-1 xi L P i=2

T B A C G Arbitrary Beginning and End States can be added To The Chain. By Convention, Only the Beginning State is added

E A C G Adding An End State with a Transition Proba T Defines Length probabilities P(all the sequences length L)=T(1-T)L-1 T B

A C G T E B The transition are probabilities The sum of the probability of all the possible Sequences of all possible Length is 1

Using Markov Chains To Predict

What is a Prediction Given A sequence We want to know what is the probability that this sequence is a CpG 1-We need a training set: -CpG+ sequences -CpG- sequences 2-We will Measure the transition frequencies, and treat them like probabilities

What is a Prediction Transition GC: G followed by a C = GCCGCTGCGCGA Ratio between the number of transitions GC, and all the other transitions involving G->X + S N + X GC A + GC N GX Is my sequence a CpG ??? 2-We will Measure the transition frequencies, and treat them like probabilities

What is a Prediction A 0.18 0.17 0.16 0.08 A 0.30 0.32 0.25 0.17 + A C G T - A C G T C 0.27 0.36 0.33 0.35 C 0.21 0.30 0.25 0.24 G 0.42 0.27 0.37 0.38 G 0.28 0.08 0.30 0.29 T0.12 0.18 0.12 0.18 T0.21 0.30 0.20 0.29 1 Is my sequence a CpG ??? 2-We will Measure the transition frequencies, and treat them like probabilities

What is a Prediction - + A 0.18 0.17 0.16 0.08 A 0.30 0.32 0.25 0.17 + A C G T - A C G T C 0.27 0.36 0.33 0.35 C 0.21 0.30 0.25 0.24 G 0.42 0.27 0.37 0.38 G 0.28 0.08 0.30 0.29 T0.21 0.30 0.20 0.29 T0.12 0.18 0.12 0.18 P(seq ¦ M-)= Axi-1 xi P(seq ¦ M+)= Axi-1 xi L L P P i=1 i=1 Is my sequence a CpG ??? 3-Evaluate the probability for each of these models to generate our sequence

Uncovering Sequences Mysteries With Hidden Markov Model