400 likes | 672 Views
BCB 444/544. Lecture 17 Finish HMMs Protein Motifs & Domain Prediction #17_Oct01. Required Reading ( before lecture). Mon Oct 1 - Lecture 17 Protein Motifs & Domain Prediction Chp 7 - pp 85-96 Wed Oct 3 - Lecture 18
E N D
BCB 444/544 Lecture 17 Finish HMMs Protein Motifs & Domain Prediction #17_Oct01 BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
Required Reading (before lecture) MonOct 1- Lecture 17 Protein Motifs & Domain Prediction • Chp 7 - pp 85-96 Wed Oct 3 - Lecture 18 Protein Structure: The Basics (Note chg in lecture Schedule!) • Chp 12 - pp173-186 Thurs Oct 4 - Lab 6 Protein Structure: Databases & Visualization Fri Oct 5 - Lecture 19 Protein Structure: Classification & Comparison • Chp 13 - pp187-199 BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
Assignments & Announcements • HW544Extra #1 - Due: Task 1.1 - Mon Oct 1 (today) by noon Task 1.2 & Task 2 - Mon Oct 8 by 5 PM • HomeWork #3 - posted online Due: Mon Oct 8 by 5 PM BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
BCB 544 - Extra Required Reading Mon Sept 24 BCB 544 Extra Required Reading Assignment: • Pollard KS, Salama SR, Lambert N, Lambot MA, Coppens S, Pedersen JS, Katzman S, King B, Onodera C, Siepel A, Kern AD, Dehay C, Igel H, Ares M Jr, Vanderhaeghen P, Haussler D. (2006) An RNA gene expressed during cortical development evolved rapidly in humans. Nature443: 167-172. • http://www.nature.com/nature/journal/v443/n7108/abs/nature05113.html doi:10.1038/nature05113 • PDF available on class website - under Required Reading Link BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
A few Online Resources for: Cell & Molecular Biology • NCBI Science Primer: What is a cell? • http://www.ncbi.nlm.nih.gov/About/primer/genetics_cell.html • NCBI Science Primer: What is a genome? • http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html • BioTech’s Life Science Dictionary • http://biotech.icmb.utexas.edu/search/dict-search.html • NCBI Bookshelf • http://www.ncbi.nlm.nih.gov/sites/entrez?db=books BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
Statistics References Statistical Inference (Hardcover) George Casella, Roger L. Berger StatWeb: A Guide to Basic Statistics for Biologists http://www.dur.ac.uk/stat.web/ Basic Statistics: http://www.statsoft.com/textbook/stbasic.html (correlations, tests, frequencies, etc.) Electronic Statistics Textbook: StatSoft http://www.statsoft.com/textbook/stathome.html (from basic statistics to ANOVA to discriminant analysis, clustering, regression, data mining, machine learning, etc.) BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
Extra Credit Questions #2-#6: • What is the size of the dystrophin gene(in kb)? Is it still the largest known human protein? • What is the largest protein encoded in human genome (i.e., longest single polypeptide chain)? • What is the largest protein complex for which a structure is known (for any organism)? • What is the most abundant protein (naturally occurring) on earth? • Which state in the US has the largest number of mobile genetic elements (transposons) in its living population? • For 1 pt total (0.2 pt each): Answer all questions correctly • & submit by to terrible@iastate.edu • For 2 pts total: Prepare a PPT slide with all correct answers • & submit to ddobbs@iastate.edu before 9 AM on Mon Oct 1 • Choose one option - you can't earn 3 pts! • Partial credit for incorrect answers? only if they are truly amusing! BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
Extra Credit Questions #7 & #8: Given that each male attending our BCB 444/544 class on a typical day is healthy (let's assume MH=7), and is generating sperm at a rate equal to the average normal rate for reproductively competent males (dSp/dT = ? per minute): 7a. How many rounds of meiosis will occur during our 50 minute class period? 7b. How many total sperm will be produced by our BCB 444/544 class during that class period? 8. How many rounds of meiosis will occur in the reproductively competent females in our class? (assume FH=5) • For 0.6 pts total (0.2 pt each): Answer all questions correctly • & submit by to terrible@iastate.edu • For 1 pts total: Prepare a PPT slide with all correct answers • & submit to ddobbs@iastate.edu before 9 AM on Mon Oct 1 • Choose one option - you can't earn more than 1 pt for this! • Partial credit for incorrect answers? only if they are truly amusing! BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
Answers? BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
Chp 6 - Profiles & Hidden Markov Models SECTION II SEQUENCE ALIGNMENT Xiong: Chp 6 Profiles & HMMs • √Position Specific Scoring Matrices (PSSMs) • √PSI-BLAST TODAY: • Profiles • Markov Models & Hidden Markov Models BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
Statistical Models for Representing Biological Sequences 3 types of probabilistic models, all of which: • Are based on MSA • Capture both observed frequencies & predicted frequencies of unobserved characters In order of "sensitivity": • PSSM- scoring table derived from an ungapped MSA; stores frequencies (log odds scores) for each amino acid in each position of a protein sequence, • Profile- A PSSM with gaps: based on gapped MSA with penalties for insertions & delations • HMM - hidden Markov Model - more complex mathematical model (than PSSMs or Profiles) because it also differentiates between insertions and deletions BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
Sequence Motifs (Patterns) Types of representations: • √ Consensus Sequences • √ Sequence Logos • √ PSSMs - Position-Specific Scoring Matrices • √ Profiles HMMs - Hidden Markov Models BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
HMM example: CpG Islands Nucleotide frequencies in human genome: BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
CpG Islands Written CpG to distinguish from a C≡G base pair) • CpG dinucleotides are rarer than would be expected from independent probabilities of C and G (given the background frequencies in human genome) • High CpG frequency is sometimes biologically significant; e.g., sometimes associated with promoter regions (“start sites”for genes) • CpG island - a region where CpG dinucleotides are much more abundant than elsewhere How can we represent or model CpG islands? BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
Hidden Markov Models - HMMs Goal: Find most likely explanation for observed variables Components: • Observed variables • Hiddenvariables • Emitted symbols • Emission probabilities • Transition probabilities • Graphical representation to illustrate relationships among these BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
Different Types of Markov Models Zero-order Markov Model: probability of current state is independent of previous state(s) e.g., random sequence, each residue with equal frequency First-order MM:probability of current state is determined by the previous state e.g., frequencies of two linked residues (dimer) occurring simultaneously Second-order MM: describes situation in which probability of current state is determined by the previous two states e.g., frequencies of thee linked residues (trimers) - occurring simultaneously, as in a codon Higher orders? Also possible, later… BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
But, What is a Markov Model? Markov Model (or Markov chain) = mathematical model used to describe a sequence of events that occur one after another in a chain = a process that moves in one direction from one state to the next with a certain transition probability For biological sequences: • each letter = state • linked together by transition probabilities BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
So, What is a hidden Markov Model? Hidden Markov Model (HMM) • a more sophisticated model in which some of states are hidden • some "unobserved" factors influence the state transition probabilities • MM which: combines 2 or more Markov chains: • only 1 chain is made up of observed states • other chains are made up of unobserved or "hidden" states BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
HMMs for Biological Sequences? • HMMs originally developed for speech recognition • Now widely used in bioinformatics • Many applications (motif/domain detection, sequence alignment, phylogenetic HMMs are "machine learning" algorithms - must be "trained" to obtain optimal statistical parameters • For Biological sequences: • each character of a sequence is considered a state in a Markov process BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
Hidden Markov Models - HMMs Goal: Find most likely explanation for observed variables Components: • States- composed of a number of elements or "symbols" (e.g., A,C,G,T) • Observed variables - sequence (or outcome) we can "see" • Hidden variables - insertions/deletions/transition probabilities that can't be "seen" • Emission probability - probability value associated with each "symbol" in each state • Transition probability - probability of going from one state to another • Special graphical representation used to illustrate relationships BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
HMM example from Eddy HMM paper: Toy HMM for Splice Site Prediction BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
The Occasionally Dishonest Casino A casino uses a fair die most of the time, but occasionally switches to a "loaded" one • Fair die: Prob(1) = Prob(2) = . . . = Prob(6) = 1/6 • Loaded die: Prob(1) = Prob(2) = . . . = Prob(5) = 1/10, Prob(6) = ½ • These are emission probabilities Transition probabilities • Prob(Fair Loaded) = 0.01 • Prob(Loaded Fair) = 0.2 • Transitions between states obey a Markov process a linear chain of events linked by probability values such that the occurrence of one event (state) depends on the occurrence of previous event(s) or state(s) BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
An HMM for Occasionally Dishonest Casino Transition probabilities • Prob(Fair Loaded) = 0.01 • Prob(Loaded Fair) = 0.2 BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
The Occasionally Dishonest Casino • Known: • Structure of the model • Transition probabilities • Hidden: What casino actually did • FFFFFLLLLLLLFFFF... • Observable: Series of die tosses • 3415256664666153... • What we must infer: • When was a fair die used? • When was a loaded one used? • Answer is a sequenceFFFFFFFLLLLLLFFF... BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
HMM: Making the Inference • Model assigns a probability to each explanation for the observation, e.g.: P(326|FFL) = P(3|F) · P(FF) · P(2|F) · P(FL) · P(6|L) = 1/6 · 0.99 · 1/6 · 0.01 · ½ • Maximum Likelihood: Determine which explanation is most likely • Find path most likely to have produced observed sequence • Total Probability: Determine probability that observed sequence was produced by HMM • Consider all paths that could have produced observed sequence BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
HMM Notation • x = sequence of symbols emitted by model • xi = symbol emitted at time i • = path, a sequence of states • i-th state in is i • akr = transition probability, for making a transition from state k to state r • ek(b) = emission probability, that symbol b is emitted when in state k BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
Calculating Different Paths to an Observed Sequence transition probability emission probability BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
Identifying the Most Probable Path? The most likely path *satisfies: To find*,consider all possible ways the last "symbol" of x could have been emitted Let Then BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
Calculate optimal path? Construct a matrix of probability values for every state at every residue How: one way = Viterbi Algorithm • Initialization (i = 0) • Recursion (i = 1, . . . , L): For each state k • Termination: To find*, use trace-back, as in dynamic programming BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
Viterbi for Most Probable Path: Example x 2 6 6 0 0 B 1 0 (1/6)max{(1/12)0.99, (1/4)0.2} = 0.01375 (1/6)max{0.013750.99, 0.020.2} = 0.00226875 (1/6)(1/2) = 1/12 0 F (1/2)max{0.013750.01, 0.020.8} = 0.08 (1/10)max{(1/12)0.01, (1/4)0.8} = 0.02 (1/2)(1/2) = 1/4 0 L BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
Total Probability Several different paths can result in observation x Probability that our model will emit x is: Total Probability BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
Total Probability: Example x 2 6 6 0 0 B 1 0 (1/6)sum{(1/12)0.99, (1/4)0.2} = 0.022083 (1/6)sum{0.0220830.99, 0.0200830.2} = 0.004313 (1/6)(1/2) = 1/12 0 F (1/2)sum{0.0220830.01, 0.0200830.8} = 0.008144 (1/10)sum{(1/12)0.01, (1/4)0.8} = 0.020083 (1/2)(1/2) = 1/4 0 L Total probability = = 0.004313 + 0.008144 = 0.012 BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
Viterbi gets it right more often than not BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
An HMM for CpG Islands? Emission probabilities are0 or 1e.g.,eG-(G) = 1, eG-(T) = 0 See Durbin et al., Biological Sequence Analysis, Cambridge, 1998 BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
Estimating the Probabilities or “Training” the HMM • Viterbi training • Derive probable paths for training data using Viterbi algorithm • Re-estimate transition probabilities based on Viterbi path • Iterate until paths stop changing • Other algorithms can be used • e.g., "forward" algorithm • (see text - or see Wikipedia re: HMMs) BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
Profile HMMs • Used to model a family of related sequences (or motif or domain) • Derived from a MSA of family members • Transition & emission probabilities are position-specific • Set parameters of model so that total probability peaks at members of family • Sequences can be tested for family membership using Viterbi algorithm to evaluate match against profile BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
An HMM can represent a MSA BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
Pfam: Protein Familieshttp://pfam.sanger.ac.uk/ • “A comprehensive collection of protein domains and families, with a range of well-established uses including genome annotation.” • Pfam: clans, web tools and services: R.D. Finn, J. Mistry, B. Schuster-Bkler, S. Griffiths-Jones, V. Hollich, T. Lassmann, S. Moxon, M. Marshall, A. Khanna, R. Durbin, S.R. Eddy, E.L.L. Sonnhammer and A. Bateman (2006) Nucleic Acids Res Database Issue 34:D247-D5 • Each family is represented by: • 2 MSAs • 2 Hidden Markov Models (profile-HMMs) • cf. Superfamily - from Lab 5 • similar collection of curated MSAs & HMMs, focuses on superfamily level BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
Chp 7 - Protein Motifs & Domain Prediction SECTION II SEQUENCE ALIGNMENT Xiong: Chp 7 Protein Motifs and Domain Prediction • Identification of Motifs & Domains in MSAs • Motif & Domain Databases Using Regular Expressions • Motif & Domain Databases Using Statistical Models • Protein Family Databases • Motif Discovery in Unaligned Sequences • √Sequence Logos BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction
Motifs & Domains • Motif - short conserved sequence pattern • Associated with distinct function in protein or DNA • Avg = 10 residues (usually 6-20 residues) • e.g., zinc finger motif - in protein • e.g., TATA box - in DNA • Domain - "longer" conserved sequence pattern, defined as a independent functional and/or structural unit • Avg = 100 residues (range from 40-700 in proteins) • e.g., kinase domain or transmembrane domain - in protein • Domains may (or may not) include motifs BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction