350 likes | 519 Views
BCB 444/544. Lecture 18 More details: HMMs Protein Motifs & Domain Prediction Maybe: Protein Structure - The Basics #18_Oct03. Required Reading ( before lecture). √ Mon Oct 1 - Lecture 17 Protein Motifs & Domain Prediction Chp 7 - pp 85-96 Wed Oct 3 - Lecture 18
E N D
BCB 444/544 Lecture 18 More details: HMMs Protein Motifs & Domain Prediction Maybe: Protein Structure - The Basics #18_Oct03 BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
Required Reading (before lecture) √MonOct 1- Lecture 17 Protein Motifs & Domain Prediction • Chp 7 - pp 85-96 Wed Oct 3 - Lecture 18 Protein Structure: The Basics (Note chg in lecture Schedule!) • Chp 12 - pp 173-186 Thurs Oct 4 - Lab 6 Protein Structure: Databases & Visualization Fri Oct 5 - Lecture 19 Protein Structure: Classification & Comparison • Chp 13 - pp 187-199 BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
Assignments & Announcements • HW544Extra #1 - √Due: Task 1.1 - Mon Oct 1 (today) by noon Task 1.2 & Task 2 - Mon Oct 8 by 5 PM • HomeWork #3 - posted online Due: Mon Oct 8 by 5 PM BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
BCB 544 - Extra Required Reading Mon Sept 24 BCB 544 Extra Required Reading Assignment: • Pollard KS, Salama SR, Lambert N, Lambot MA, Coppens S, Pedersen JS, Katzman S, King B, Onodera C, Siepel A, Kern AD, Dehay C, Igel H, Ares M Jr, Vanderhaeghen P, Haussler D. (2006) An RNA gene expressed during cortical development evolved rapidly in humans. Nature443: 167-172. • http://www.nature.com/nature/journal/v443/n7108/abs/nature05113.html doi:10.1038/nature05113 • PDF available on class website - under Required Reading Link BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
A few Online Resources for: Cell & Molecular Biology • NCBI Science Primer: What is a cell? • http://www.ncbi.nlm.nih.gov/About/primer/genetics_cell.html • NCBI Science Primer: What is a genome? • http://www.ncbi.nlm.nih.gov/About/primer/genetics_genome.html • BioTech’s Life Science Dictionary • http://biotech.icmb.utexas.edu/search/dict-search.html • NCBI Bookshelf • http://www.ncbi.nlm.nih.gov/sites/entrez?db=books BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
Statistics References Statistical Inference (Hardcover) George Casella, Roger L. Berger StatWeb: A Guide to Basic Statistics for Biologists http://www.dur.ac.uk/stat.web/ Basic Statistics: http://www.statsoft.com/textbook/stbasic.html (correlations, tests, frequencies, etc.) Electronic Statistics Textbook: StatSoft http://www.statsoft.com/textbook/stathome.html (from basic statistics to ANOVA to discriminant analysis, clustering, regression, data mining, machine learning, etc.) BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
Extra Credit Questions #2-#6: • What is the size of the dystrophin gene(in kb)? Is it still the largest known human protein? • What is the largest protein encoded in human genome (i.e., longest single polypeptide chain)? • What is the largest protein complex for which a structure is known (for any organism)? • What is the most abundant protein (naturally occurring) on earth? • Which state in the US has the largest number of mobile genetic elements (transposons) in its living population? • For 1 pt total (0.2 pt each): Answer all questions correctly • & submit by to terrible@iastate.edu • For 2 pts total: Prepare a PPT slide with all correct answers • & submit to ddobbs@iastate.edu before 9 AM on Mon Oct 1 • Choose one option - you can't earn 3 pts! • Partial credit for incorrect answers? only if they are truly amusing! BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
Extra Credit Questions #7 & #8: Given that each male attending our BCB 444/544 class on a typical day is healthy (let's assume MH=7), and is generating sperm at a rate equal to the average normal rate for reproductively competent males (dSp/dT = ? per minute): 7a. How many rounds of meiosis will occur during our 50 minute class period? 7b. How many total sperm will be produced by our BCB 444/544 class during that class period? 8. How many rounds of meiosis will occur in the reproductively competent females in our class? (assume FH=5) • For 0.6 pts total (0.2 pt each): Answer all questions correctly • & submit by to terrible@iastate.edu • For 1 pts total: Prepare a PPT slide with all correct answers • & submit to ddobbs@iastate.edu before 9 AM on Mon Oct 1 • Choose one option - you can't earn more than 1 pt for this! • Partial credit for incorrect answers? only if they are truly amusing! BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
Answers? BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
Chp 6 - Profiles & Hidden Markov Models SECTION II SEQUENCE ALIGNMENT Xiong: Chp 6 Profiles & HMMs • Position Specific Scoring Matrices (PSSMs) • PSI-BLAST • Profiles • Markov Models & Hidden Markov Models BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
Statistical Models for Representing Biological Sequences 3 types of probabilistic models, all of which: • Are based on MSA • Capture both observed frequencies & predicted frequencies of unobserved characters In order of "sensitivity": • PSSM- scoring table derived from an ungapped MSA; stores frequencies (log odds scores) for each amino acid in each position of a protein sequence, • Profile- A PSSM with gaps: based on gapped MSA with penalties for insertions & delations • HMM - hidden Markov Model - more complex mathematical model (than PSSMs or Profiles) because it also differentiates between insertions and deletions BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
HMMs for Biological Sequences? • HMMs originally developed for speech recognition • Now widely used in bioinformatics • Many applications (motif/domain detection, sequence alignment, phylogenetic HMMs are "machine learning" algorithms - must be "trained" to obtain optimal statistical parameters • For Biological sequences: • each character of a sequence is considered a state in a Markov process BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
But, What is a Markov Model? Markov Model (or Markov chain) = mathematical model used to describe a sequence of events that occur one after another in a chain = a process that moves in one direction from one state to the next with a certain transition probability For biological sequences: • each letter = state • linked together by transition probabilities BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
Different Types of Markov Models Zero-order Markov Model: probability of current state is independent of previous state(s) e.g., random sequence, each residue with equal frequency First-order MM:probability of current state is determined by the previous state e.g., frequencies of two linked residues (dimer) occurring simultaneously Second-order MM: describes situation in which probability of current state is determined by the previous two states e.g., frequencies of thee linked residues (trimers) - occurring simultaneously, as in a codon Higher orders? Also possible, later… BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
So, What is a hidden Markov Model? Hidden Markov Model (HMM) • a more sophisticated model in which some of states are hidden • some "unobserved" factors influence the state transition probabilities • MM which: combines 2 or more Markov chains: • only 1 chain is made up of observed states • other chains are made up of unobserved or "hidden" states BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
Hidden Markov Models - HMMs Goal: Find most likely explanation for observed variables Components: • States- composed of a number of elements or "symbols" (e.g., A,C,G,T) • Observed variables - sequence (or outcome) we can "see" • Hidden variables - insertions/deletions/transition probabilities that can't be "seen" • Emission probability - probability value associated with each "symbol" in each state • Transition probability - probability of going from one state to another • Special graphical representation used to illustrate relationships BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
An HMM for CpG Islands? Emission probabilities are0 or 1e.g.,eG-(G) = 1, eG-(T) = 0 See Durbin et al., Biological Sequence Analysis, Cambridge, 1998 BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
HMM example from Eddy HMM paper: Toy HMM for Splice Site Prediction This is a new slide BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
An HMM for Occasionally Dishonest Casino Transition probabilities • Prob(Fair Loaded) = 0.01 • Prob(Loaded Fair) = 0.2 BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
Calculating Different Paths to an Observed Sequence This slide has been changed transition probability emission probability Calculations such as those shown below are used to fill a matrix with probability values for every state at every position BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
Calculating the Most Probable Path*, using Viterbi algorithm (using traceback as in DP) This slide has been changed x 2 6 6 0 0 B 1 0 (1/6)max{(1/12)0.99, (1/4)0.2} = 0.01375 (1/6)max{0.013750.99, 0.020.2} = 0.00226875 (1/6)(1/2) = 1/12 0 F (1/2)max{0.013750.01, 0.020.8} = 0.08 (1/10)max{(1/12)0.01, (1/4)0.8} = 0.02 (1/2)(1/2) = 1/4 0 L * Path within HMM that matches query sequence with highest probability BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
Calculating the Total Probability: This slide has been changed x 2 6 6 0 0 B 1 0 (1/6)sum{(1/12)0.99, (1/4)0.2} = 0.022083 (1/6)sum{0.0220830.99, 0.0200830.2} = 0.004313 (1/6)(1/2) = 1/12 0 F (1/2)sum{0.0220830.01, 0.0200830.8} = 0.008144 (1/10)sum{(1/12)0.01, (1/4)0.8} = 0.020083 (1/2)(1/2) = 1/4 0 L Total probability = = 0 + 0.004313 + 0.008144 = 0.012 Note: This not the same as matrix on previous slide! Here, last column contains sums for each row BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
Estimating the Probabilities or “Training” the HMM This slide has been changed • Calculate frequencies in each column of MSA built from set of related sequences • Use frequency values to fill the emission and transition probabilities in the model (use two matrices for this) • Viterbi training • Derive probable paths for training data using Viterbi algorithm • Re-estimate transition probabilities based on Viterbi path • Iterate until paths stop changing • Other algorithms can be used • e.g., "forward" & "backward" algorithms • (see text - or see Wikipedia re: HMMs) BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
Profile HMMs • Used to model a family of related sequences (or motif or domain) • Derived from a MSA of family members • Transition & emission probabilities are position-specific • Set parameters of model so that total probability peaks at members of family • Sequences can be tested for family membership using Viterbi algorithm to evaluate match against profile BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
Profile HMM represents a gapped MSA This slide has been changed Character in alignment can be in one of 3 states: Match - observed Insert - hidden Delete - hidden Hidden chains Observed chain BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
Example: Pfam: Protein Familieshttp://pfam.sanger.ac.uk/ • “A comprehensive collection of protein domains and families, with a range of well-established uses including genome annotation.” • Pfam: clans, web tools and services: R.D. Finn, …A. Bateman (2006) Nucleic Acids Res Database Issue 34:D247-D5 • Each family is represented by: • 2 MSAs • 2 Hidden Markov Models (profile-HMMs) • cf. Superfamily - from Lab 5 • similar collection of curated MSAs & HMMs, focuses on superfamily level BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
A few more Details re: Profiles & HMMs • Smoothing or "Regularization" - method used to avoid "over-fitting" • Common problem in machine learning (data-driven) approaches • Limited training sample size causes over-representation of observed characters while "ignoring" unobserved characters • Result?Miss members of family not yet sampled (too many false negative hits) • Pseudocounts- adding artificial values for 'extra' amino acid(s) not observed in the training set • Treated as a 'real' values in calculating probabilities • Improve predictive power of profiles & HMMs • Dirichlet mixture - commonly used mathematical model to simulate the aa distribution in a sequence alignment • To "correct" problems in an observed alignment based on limited number of sequences BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
Applications (of PSSMs, Profiles, HMMs) • HMMer - for building & using HMMs • developed by Sean Eddy's group • Not a web-based server; must download the software • 9 related programs • but check out the site - it's fun! • Psi-BLAST- you've heard enough about this! • Uses Profiles (not actually PSSMs) - iteratively • In previous lab: used SuperFam (HMMs) • http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/ • Prosite - includes patterns (regular expressions) & profiles for motifs & domains • http://ca.expasy.org/prosite • Pfam (MSAs & HMMs) • http://pfam.sanger.ac.uk/ (new URL) • Many others BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
Chp 7 - Protein Motifs & Domain Prediction SECTION II SEQUENCE ALIGNMENT Xiong: Chp 7 Protein Motifs and Domain Prediction • Identification of Motifs & Domains in MSAs • Motif & Domain Databases Using Regular Expressions • Motif & Domain Databases Using Statistical Models • Protein Family Databases • Motif Discovery in Unaligned Sequences • √Sequence Logos BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
Motifs & Domains • Motif - short conserved sequence pattern • Associated with distinct function in protein or DNA • Avg = 10 residues (usually 6-20 residues) • e.g., zinc finger motif - in protein • e.g., TATA box - in DNA • Domain - "longer" conserved sequence pattern, defined as a independent functional and/or structural unit • Avg = 100 residues (range from 40-700 in proteins) • e.g., kinase domain or transmembrane domain - in protein • Domains may (or may not) include motifs BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
2 Approaches for Representing "Consensus" Information in Motifs & Domains • Regular expression - reduce information from MSA • e.g., protein phosphorylation site motif: [S,T]- X- [R,K] • Symbols represent specific or unspecified residues, spaces, etc. • 2 mechanisms for matching: • Exact • "Fuzzy" (inexact, approximate) - flexible, more permissive to detect "near matches" • Statistical model - includes probability information derived from MSA • e.g., PSSM, Profile or HMM BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
Motif & Domain Databases Based on regular expressions: • Prosite (Interpro) • Emofit Limitation: these don't take probability info into account Based on statistical models: • PRINTS • BLOCKS • ProDom • Pfam • SMART • CDART • Reverse PsiBLAST • READ your textbook & try some of these at home; there are distinct advantages/disadvantages associated with each • TAKE HOME LESSON: Always try several methods! (not just one!) BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains
Chp 12 - Protein Structure Basics SECTION V STRUCTURAL BIOINFORMATICS Xiong: Chp 12 Protein Structure Basics • Introduction to the Protein DataBank - PDB • NEXT lecture! BCB 444/544 F07 ISU Dobbs #18- Protein Motifs & Domains