1.32k likes | 1.33k Views
Understand the principle of Hidden Markov Models (HMMs) and how they are used in biology, specifically in gene prediction, Tm predictions, and domain/prot family prediction. Explore conditional probabilities and Bayesian theorem. Analyze an essay by Mr. Bayes on probabilistic models and their application to biological questions.
E N D
Uncovering Sequences Mysteries WithHidden Markov Model Cédric Notredame
Our Scope Look once Under the Hood Understand the principle of HMMs Understand HOW HMMs are used in Biology
Outline -Reminder of Bayesian Probabilities -HMMs and Markov Chains -Application to gene prediction -Application Tm predictions -Application to Domain/Prot Family Prediction -Future Applications
Conditional Probabilities And Bayes Theorem
I now send you an essay which I have found among the papers of our deceased friend Mr Bayes, and which, in my opinion, has great merit... In an introduction which he has writ to this Essay, he says, that his design at first in thinking on the subject of it was, to find out a method by which we might judge concerning the probability that an event has to happen, in given circumstances, upon supposition that we know nothing concerning it but that, under the same circumstances, it has happened a certain number of times, and failed a certain other number of times. Bayes
What is a Probabilistic Model ? Dice = Probabilistic Model -Each Possible outcome has a probability (1/6) -Biological Questions: -What kind of dice would generate coding DNA -Non-Coding ?
Which Parameters ? OR -Through Observation: -measure frequencies on a large number of events Dice = Probabilistic Model Parameters: proba of each outcome -A Priori estimation: 1/6 for each Number
Which Parameters ? Parameters: proba of each outcome Model: Intra/Extra Protein 1- Make a set of Inside Proteins using annotation 2- Make a set of Outside Proteins using annotation 3- COUNT Frequencies on the two sets Model Accuracy Training Set
Maximum Likelihood Models 1- Make training set 2- Count Frequencies Model Accuracy Training Set Maximum Likelihood Model: Model probability MAXIMISES Data probability Model: Intra/Extra Proteins
Maximum Likelihood Models Model: Intra/Extra-Cell Proteins Maximum Likelihood Model Model Probability MAXIMISES Data Probability AND Data Probability MAXIMISES Model Probability P ( Model ¦ Data) is Maximised ¦ means GIVEN!
Maximum Likelihood Models Model: Intra/Extra-Cell Proteins Maximum Likelihood Model P ( Model ¦ Data) is Maximised P ( Data ¦ Model) is Maximised Model Probability MAXIMISES Data Probability AND Data Probability MAXIMISES Model Probability
Maximum Likelihood Models Model: Intra/Extra-Cell Proteins Maximum Likelihood Model P ( Coin ¦ Data)< P(Dice ¦ Data) Data: 11121112221212122121112221112121112211111
Conditional Probabilities P (Win Lottery ¦ Participation) The Probability that something happens IF something else ALSOHappens
Conditional Probability Dice 1Dice 2 P(6¦ Dice 1)=1/6P(6¦ Dice 2)=1/2 Loaded! The Probability that something happens IF something else ALSOHappens
Joint Probability The Probability that something happens IF something else ALSOHappens AND P(6¦ D1)=1/6P(6¦ D2)=1/2 P(6,D2)=P(6¦D2) * P(D2)=1/2* 1/100 Comma
Joint Probability P(6¦ DF and DL)= P(6, DF) + P(6, DL) = P(6 ¦ DF) * P(DF) + P(6¦ DL)*P(DL) = 1/6*0.99 + 1/2*0.01 = 0.17 Question: What is the probability of Making a 6, given that the Loaded Dice is used 1% of the time (0.16 for an unloaded dice)
Joint Probability Unsuspected Heterogeneity In the training set Inaccurate Parameters Estimation P(6¦ DF and DL)= P(6, DF) + P(6, DL) = P(6 ¦ DF) * P(DL) + P(6¦ DF)*P(DL) = 1/6*0.99 + 1/2*0.01 = 0.17 (0.16 for an unloaded dice)
Bayes Theorem P(Y¦Xi) * P(Xi) P(Xi¦ Y) = S(P(Y¦Xi)*P(Xi)) i X : Model or Data or any Event Y : Model or Data or any Event
Bayes Theorem P(Y¦X) * P(X) P(X¦ Y) = P(Y¦X)*P(X)+ P(Y¦X)*P(X) P(Y,X)+ P(Y,X) P(Y) X : Model or Data or any Event Y : Model or Data or any Event XT=X+ X
Bayes Theorem Proba of Observing Y AND X simultaneously Proba of Observing XIF Y is fulfilled ‘Remove’ P(Y) to Get P(X¦Y) X : Model or Data or any Event Y : Model or Data or any event P(Y¦X) * P(X) P(X¦ Y) = P(Y)
Bayes Theorem X : Model or Data or any Event Y : Model or Data or any event Proba of Observing Y and X simultaneously P(X,Y) P(X¦Y) = P(Y) Proba of Observing XIF Y is fulfilled ‘Remove’ P(Y) to Get P(X¦Y)
Using Bayes Theorem We will use Bayes Theorem to test our belief: If the Dice was loaded (model) what would be the probability of this Model Given the data (three 6 in a row) Question:The dice gave three 6s in a row IS IT LOADED !!!
Using Bayes Theorem Question:The dice gave three 6s in a row IS IT LOADED !!! P(D1)=0.99 P(D2)=0.01 P(6¦D1)=1/6 P(6¦D2)=1/2 Occasionally Dishonest Casino…
Using Bayes Theorem P(Y¦X)*P(X) P(D1)=0.99 P(D2)=0.01 P(6¦D1)=1/6 P(6¦D2)=1/2 P(X¦ Y) = P(Y) Y: 63 X: D2 P(63 ¦D2)*P(D2) P(D2¦63) = P(63 ¦D1)*p(D1) + P(63¦D2)*P(D2) 63 with D2 63 with D1 Question:The dice gave three 6s in a row IS IT LOADED !!!
Using Bayes Theorem P(D1)=0.99 P(D2)=0.01 P(6¦D1)=1/6 P(6¦D2)=1/2 P(X,Y) P(X¦ Y) = P(Y) Question:The dice gave three 6s in a row IS IT LOADED !!! P(63 ¦D2)*P(D2) P(D2¦63) = = 0.21 P(63 ¦D1)*p(D1) + P(63¦D2)*P(D2) Probably NOT
Posterior Probability 0.21 is a posterior probability: it was estimated AFTER the Data was obtained P(63¦D2) is the likelihood of the Hypotheses Question:The dice gave three 6s in a row IS IT LOADED !!! P(63 ¦D2)*P(D2) P(D2¦63) = = 0.21 P(63 ¦D1)*p(D1) + P(63¦D2)*P(D2)
Debunking Headlines 50% of the crimes are committed by Migrants. Question: Are 50% of the Migrants Criminals??. P(Migrant) =0.1 P(Criminal) =0.0001 P(M¦C)=0.5 P(M¦C)*P(C) P(M¦C)*P(C) P(C¦M) = P(C¦M) = P(M) P(M) 0.5*0.0001 =0.0005 = 0.1 NO: 0.05% Migrants only are Criminals (NOT 50%!)
Debunking Headlines P(T¦P)*P(P) P(T)=0.1 P(P)=0.0001 P(T¦P)=0.5 P(T¦P)*P(P) P(P¦T) = P(P¦T) = P(T) P(T) 0.5*0.0001 =0.0005 = 0.1 50% of Gene Promoters contain TATA. Question:IS TATA a good gene predictor NO
Bayes Theorem TATA=High Sensitivity / Low Specificity Bayes Theorem Reveals the Trade-off Between Sensitivity:Finding ALL the genes and Specificity: Finding ONLY genes
What is a Markov Chain ? Markov Chain: Two Dices -You only use ONE dice: the fair OR the loaded -The Dice you roll only depends on the previous roll Simple Chain: One Dice -Each Roll is the same -A Roll does not depend on the previous
What is a Markov Chain ? Biological Sequences Tend To Behave like Markov Chains Question/Example Is it possible to Tell Whether my sequence is CpG island ???
What is a Markov Chain ? Question: Identify CpG Island sequences Old Fashion Solution -Slide a Window of size: Captain’s Height/p -Measure the % of CpG -Plot it against the sequence -Decide
sliding Window Methods Sliding Window Average Sliding Window
What is a Markov Chain ? Question: Identify CpG Island sequences Bayesian Solution -Make a CpG Markov Chain -Run the sequence through the Chain -Likelihood for the chain to produce the sequence?
Transition State T A C G Transition Probabilities Probability of Transition from G to C AGC=P(Xi=C ¦ Xi-1=G)
P(sequence)=P(XL,XL-1,XL-2,….., X1) Remember: P(X,Y)=P(X¦Y)*P(Y) In The Markov Chain, XL only depends on XL-1 P(sequence)=P(XL¦XL-1)*P(XL-1¦XL-2)….., P(X1) )
AGC=P(Xi=C ¦ Xi-1=G) P(sequence)=P(XL¦XL-1)*P(XL-1¦XL-2)….., P(X1) ) P(sequence)= P(x1)* Axi-1 xi L P i=2
T B A C G Arbitrary Beginning and End States can be added To The Chain. By Convention, Only the Beginning State is added
E A C G Adding An End State with a Transition Proba T Defines Length probabilities P(all the sequences length L)=T(1-T)L-1 T B
A C G T E B The transition are probabilities The sum of the probability of all the possible Sequences of all possible Length is 1
Using Markov Chains To Predict
What is a Prediction Given A sequence We want to know what is the probability that this sequence is a CpG 1-We need a training set: -CpG+ sequences -CpG- sequences 2-We will Measure the transition frequencies, and treat them like probabilities
What is a Prediction Transition GC: G followed by a C = GCCGCTGCGCGA Ratio between the number of transitions GC, and all the other transitions involving G->X + S N + X GC A + GC N GX Is my sequence a CpG ??? 2-We will Measure the transition frequencies, and treat them like probabilities
What is a Prediction A 0.18 0.17 0.16 0.08 A 0.30 0.32 0.25 0.17 + A C G T - A C G T C 0.27 0.36 0.33 0.35 C 0.21 0.30 0.25 0.24 G 0.42 0.27 0.37 0.38 G 0.28 0.08 0.30 0.29 T0.12 0.18 0.12 0.18 T0.21 0.30 0.20 0.29 1 Is my sequence a CpG ??? 2-We will Measure the transition frequencies, and treat them like probabilities
What is a Prediction - + A 0.18 0.17 0.16 0.08 A 0.30 0.32 0.25 0.17 + A C G T - A C G T C 0.27 0.36 0.33 0.35 C 0.21 0.30 0.25 0.24 G 0.42 0.27 0.37 0.38 G 0.28 0.08 0.30 0.29 T0.21 0.30 0.20 0.29 T0.12 0.18 0.12 0.18 P(seq ¦ M-)= Axi-1 xi P(seq ¦ M+)= Axi-1 xi L L P P i=1 i=1 Is my sequence a CpG ??? 3-Evaluate the probability for each of these models to generate our sequence