1 / 40

BCB 444/544

BCB 444/544. Lecture 17 Finish HMMs Protein Motifs & Domain Prediction #17_Oct01. Required Reading ( before lecture). Mon Oct 1 - Lecture 17 Protein Motifs & Domain Prediction Chp 7 - pp 85-96 Wed Oct 3 - Lecture 18

rea
Download Presentation

BCB 444/544

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BCB 444/544 Lecture 17 Finish HMMs Protein Motifs & Domain Prediction #17_Oct01 BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  2. Required Reading (before lecture) MonOct 1- Lecture 17 Protein Motifs & Domain Prediction • Chp 7 - pp 85-96 Wed Oct 3 - Lecture 18 Protein Structure: The Basics (Note chg in lecture Schedule!) • Chp 12 - pp173-186 Thurs Oct 4 - Lab 6 Protein Structure: Databases & Visualization Fri Oct 5 - Lecture 19 Protein Structure: Classification & Comparison • Chp 13 - pp187-199 BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  3. Assignments & Announcements • HW544Extra #1 - Due: Task 1.1 - Mon Oct 1 (today) by noon Task 1.2 & Task 2 - Mon Oct 8 by 5 PM • HomeWork #3 - posted online Due: Mon Oct 8 by 5 PM BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  4. BCB 544 - Extra Required Reading Mon Sept 24 BCB 544 Extra Required Reading Assignment: • Pollard KS, Salama SR, Lambert N, Lambot MA, Coppens S, Pedersen JS, Katzman S, King B, Onodera C, Siepel A, Kern AD, Dehay C, Igel H, Ares M Jr, Vanderhaeghen P, Haussler D. (2006) An RNA gene expressed during cortical development evolved rapidly in humans. Nature443: 167-172. • http://www.nature.com/nature/journal/v443/n7108/abs/nature05113.html doi:10.1038/nature05113 • PDF available on class website - under Required Reading Link BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  5. A few Online Resources for: Cell & Molecular Biology • NCBI Science Primer: What is a cell? • http://www.ncbi.nlm.nih.gov/About/primer/genetics_cell.html • NCBI Science Primer: What is a genome? • http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html • BioTech’s Life Science Dictionary • http://biotech.icmb.utexas.edu/search/dict-search.html • NCBI Bookshelf • http://www.ncbi.nlm.nih.gov/sites/entrez?db=books BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  6. Statistics References Statistical Inference (Hardcover) George Casella, Roger L. Berger StatWeb: A Guide to Basic Statistics for Biologists http://www.dur.ac.uk/stat.web/ Basic Statistics: http://www.statsoft.com/textbook/stbasic.html (correlations, tests, frequencies, etc.) Electronic Statistics Textbook: StatSoft http://www.statsoft.com/textbook/stathome.html (from basic statistics to ANOVA to discriminant analysis, clustering, regression, data mining, machine learning, etc.) BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  7. Extra Credit Questions #2-#6: • What is the size of the dystrophin gene(in kb)? Is it still the largest known human protein? • What is the largest protein encoded in human genome (i.e., longest single polypeptide chain)? • What is the largest protein complex for which a structure is known (for any organism)? • What is the most abundant protein (naturally occurring) on earth? • Which state in the US has the largest number of mobile genetic elements (transposons) in its living population? • For 1 pt total (0.2 pt each): Answer all questions correctly • & submit by to terrible@iastate.edu • For 2 pts total: Prepare a PPT slide with all correct answers • & submit to ddobbs@iastate.edu before 9 AM on Mon Oct 1 • Choose one option - you can't earn 3 pts! • Partial credit for incorrect answers? only if they are truly amusing! BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  8. Extra Credit Questions #7 & #8: Given that each male attending our BCB 444/544 class on a typical day is healthy (let's assume MH=7), and is generating sperm at a rate equal to the average normal rate for reproductively competent males (dSp/dT = ? per minute): 7a. How many rounds of meiosis will occur during our 50 minute class period? 7b. How many total sperm will be produced by our BCB 444/544 class during that class period? 8. How many rounds of meiosis will occur in the reproductively competent females in our class? (assume FH=5) • For 0.6 pts total (0.2 pt each): Answer all questions correctly • & submit by to terrible@iastate.edu • For 1 pts total: Prepare a PPT slide with all correct answers • & submit to ddobbs@iastate.edu before 9 AM on Mon Oct 1 • Choose one option - you can't earn more than 1 pt for this! • Partial credit for incorrect answers? only if they are truly amusing! BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  9. Answers? BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  10. Chp 6 - Profiles & Hidden Markov Models SECTION II SEQUENCE ALIGNMENT Xiong: Chp 6 Profiles & HMMs • √Position Specific Scoring Matrices (PSSMs) • √PSI-BLAST TODAY: • Profiles • Markov Models & Hidden Markov Models BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  11. Statistical Models for Representing Biological Sequences 3 types of probabilistic models, all of which: • Are based on MSA • Capture both observed frequencies & predicted frequencies of unobserved characters In order of "sensitivity": • PSSM- scoring table derived from an ungapped MSA; stores frequencies (log odds scores) for each amino acid in each position of a protein sequence, • Profile- A PSSM with gaps: based on gapped MSA with penalties for insertions & delations • HMM - hidden Markov Model - more complex mathematical model (than PSSMs or Profiles) because it also differentiates between insertions and deletions BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  12. Sequence Motifs (Patterns) Types of representations: • √ Consensus Sequences • √ Sequence Logos • √ PSSMs - Position-Specific Scoring Matrices • √ Profiles HMMs - Hidden Markov Models BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  13. HMM example: CpG Islands Nucleotide frequencies in human genome: BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  14. CpG Islands Written CpG to distinguish from a C≡G base pair) • CpG dinucleotides are rarer than would be expected from independent probabilities of C and G (given the background frequencies in human genome) • High CpG frequency is sometimes biologically significant; e.g., sometimes associated with promoter regions (“start sites”for genes) • CpG island - a region where CpG dinucleotides are much more abundant than elsewhere How can we represent or model CpG islands? BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  15. Hidden Markov Models - HMMs Goal: Find most likely explanation for observed variables Components: • Observed variables • Hiddenvariables • Emitted symbols • Emission probabilities • Transition probabilities • Graphical representation to illustrate relationships among these BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  16. Different Types of Markov Models Zero-order Markov Model: probability of current state is independent of previous state(s) e.g., random sequence, each residue with equal frequency First-order MM:probability of current state is determined by the previous state e.g., frequencies of two linked residues (dimer) occurring simultaneously Second-order MM: describes situation in which probability of current state is determined by the previous two states e.g., frequencies of thee linked residues (trimers) - occurring simultaneously, as in a codon Higher orders? Also possible, later… BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  17. But, What is a Markov Model? Markov Model (or Markov chain) = mathematical model used to describe a sequence of events that occur one after another in a chain = a process that moves in one direction from one state to the next with a certain transition probability For biological sequences: • each letter = state • linked together by transition probabilities BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  18. So, What is a hidden Markov Model? Hidden Markov Model (HMM) • a more sophisticated model in which some of states are hidden • some "unobserved" factors influence the state transition probabilities • MM which: combines 2 or more Markov chains: • only 1 chain is made up of observed states • other chains are made up of unobserved or "hidden" states BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  19. HMMs for Biological Sequences? • HMMs originally developed for speech recognition • Now widely used in bioinformatics • Many applications (motif/domain detection, sequence alignment, phylogenetic HMMs are "machine learning" algorithms - must be "trained" to obtain optimal statistical parameters • For Biological sequences: • each character of a sequence is considered a state in a Markov process BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  20. Hidden Markov Models - HMMs Goal: Find most likely explanation for observed variables Components: • States- composed of a number of elements or "symbols" (e.g., A,C,G,T) • Observed variables - sequence (or outcome) we can "see" • Hidden variables - insertions/deletions/transition probabilities that can't be "seen" • Emission probability - probability value associated with each "symbol" in each state • Transition probability - probability of going from one state to another • Special graphical representation used to illustrate relationships BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  21. HMM example from Eddy HMM paper: Toy HMM for Splice Site Prediction BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  22. The Occasionally Dishonest Casino A casino uses a fair die most of the time, but occasionally switches to a "loaded" one • Fair die: Prob(1) = Prob(2) = . . . = Prob(6) = 1/6 • Loaded die: Prob(1) = Prob(2) = . . . = Prob(5) = 1/10, Prob(6) = ½ • These are emission probabilities Transition probabilities • Prob(Fair  Loaded) = 0.01 • Prob(Loaded Fair) = 0.2 • Transitions between states obey a Markov process a linear chain of events linked by probability values such that the occurrence of one event (state) depends on the occurrence of previous event(s) or state(s) BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  23. An HMM for Occasionally Dishonest Casino Transition probabilities • Prob(Fair  Loaded) = 0.01 • Prob(Loaded Fair) = 0.2 BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  24. The Occasionally Dishonest Casino • Known: • Structure of the model • Transition probabilities • Hidden: What casino actually did • FFFFFLLLLLLLFFFF... • Observable: Series of die tosses • 3415256664666153... • What we must infer: • When was a fair die used? • When was a loaded one used? • Answer is a sequenceFFFFFFFLLLLLLFFF... BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  25. HMM: Making the Inference • Model assigns a probability to each explanation for the observation, e.g.: P(326|FFL) = P(3|F) · P(FF) · P(2|F) · P(FL) · P(6|L) = 1/6 · 0.99 · 1/6 · 0.01 · ½ • Maximum Likelihood: Determine which explanation is most likely • Find path most likely to have produced observed sequence • Total Probability: Determine probability that observed sequence was produced by HMM • Consider all paths that could have produced observed sequence BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  26. HMM Notation • x = sequence of symbols emitted by model • xi = symbol emitted at time i •  = path, a sequence of states • i-th state in  is i • akr = transition probability, for making a transition from state k to state r • ek(b) = emission probability, that symbol b is emitted when in state k BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  27. Calculating Different Paths to an Observed Sequence transition probability emission probability BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  28. Identifying the Most Probable Path? The most likely path *satisfies: To find*,consider all possible ways the last "symbol" of x could have been emitted Let Then BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  29. Calculate optimal path? Construct a matrix of probability values for every state at every residue How: one way = Viterbi Algorithm • Initialization (i = 0) • Recursion (i = 1, . . . , L): For each state k • Termination: To find*, use trace-back, as in dynamic programming BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  30. Viterbi for Most Probable Path: Example x 2 6  6 0 0 B 1 0 (1/6)max{(1/12)0.99, (1/4)0.2} = 0.01375 (1/6)max{0.013750.99, 0.020.2} = 0.00226875 (1/6)(1/2) = 1/12 0 F  (1/2)max{0.013750.01, 0.020.8} = 0.08 (1/10)max{(1/12)0.01, (1/4)0.8} = 0.02 (1/2)(1/2) = 1/4 0 L BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  31. Total Probability Several different paths can result in observation x Probability that our model will emit x is: Total Probability BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  32. Total Probability: Example x 2 6  6 0 0 B 1 0 (1/6)sum{(1/12)0.99, (1/4)0.2} = 0.022083 (1/6)sum{0.0220830.99, 0.0200830.2} = 0.004313 (1/6)(1/2) = 1/12 0 F  (1/2)sum{0.0220830.01, 0.0200830.8} = 0.008144 (1/10)sum{(1/12)0.01, (1/4)0.8} = 0.020083 (1/2)(1/2) = 1/4 0 L Total probability = = 0.004313 + 0.008144 = 0.012 BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  33. Viterbi gets it right more often than not BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  34. An HMM for CpG Islands? Emission probabilities are0 or 1e.g.,eG-(G) = 1, eG-(T) = 0 See Durbin et al., Biological Sequence Analysis, Cambridge, 1998 BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  35. Estimating the Probabilities or “Training” the HMM • Viterbi training • Derive probable paths for training data using Viterbi algorithm • Re-estimate transition probabilities based on Viterbi path • Iterate until paths stop changing • Other algorithms can be used • e.g., "forward" algorithm • (see text - or see Wikipedia re: HMMs) BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  36. Profile HMMs • Used to model a family of related sequences (or motif or domain) • Derived from a MSA of family members • Transition & emission probabilities are position-specific • Set parameters of model so that total probability peaks at members of family • Sequences can be tested for family membership using Viterbi algorithm to evaluate match against profile BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  37. An HMM can represent a MSA BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  38. Pfam: Protein Familieshttp://pfam.sanger.ac.uk/ • “A comprehensive collection of protein domains and families, with a range of well-established uses including genome annotation.” • Pfam: clans, web tools and services: R.D. Finn, J. Mistry, B. Schuster-Bkler, S. Griffiths-Jones, V. Hollich, T. Lassmann, S. Moxon, M. Marshall, A. Khanna, R. Durbin, S.R. Eddy, E.L.L. Sonnhammer and A. Bateman (2006) Nucleic Acids Res Database Issue 34:D247-D5 • Each family is represented by: • 2 MSAs • 2 Hidden Markov Models (profile-HMMs) • cf. Superfamily - from Lab 5 • similar collection of curated MSAs & HMMs, focuses on superfamily level BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  39. Chp 7 - Protein Motifs & Domain Prediction SECTION II SEQUENCE ALIGNMENT Xiong: Chp 7 Protein Motifs and Domain Prediction • Identification of Motifs & Domains in MSAs • Motif & Domain Databases Using Regular Expressions • Motif & Domain Databases Using Statistical Models • Protein Family Databases • Motif Discovery in Unaligned Sequences • √Sequence Logos BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

  40. Motifs & Domains • Motif - short conserved sequence pattern • Associated with distinct function in protein or DNA • Avg = 10 residues (usually 6-20 residues) • e.g., zinc finger motif - in protein • e.g., TATA box - in DNA • Domain - "longer" conserved sequence pattern, defined as a independent functional and/or structural unit • Avg = 100 residues (range from 40-700 in proteins) • e.g., kinase domain or transmembrane domain - in protein • Domains may (or may not) include motifs BCB 444/544 F07 ISU Dobbs #17- Protein Motifs & Domain Prediction

More Related