570 likes | 783 Views
BCB 444/544. Lecture 16 Profiles & Hidden Markov Models (HMMs) #16_Sept28. Required Reading ( before lecture). √ Mon & Wed Sept 24 & 26- Lecture 14 & 15 Review: Nucleus, Chromosomes, Genes, RNAs, Proteins Surprise lecture: No assigned reading √ Fri Sept 28 - Lectures 16
E N D
BCB 444/544 Lecture 16 Profiles & Hidden Markov Models (HMMs) #16_Sept28 BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Required Reading (before lecture) √ Mon & Wed Sept 24 & 26-Lecture 14 & 15 Review: Nucleus, Chromosomes, Genes, RNAs, Proteins Surprise lecture: No assigned reading √Fri Sept 28 - Lectures 16 Profiles & Hidden Markov Models • Chp 6 - pp 79-84 • Eddy: What is a hidden Markov Model? 2004 Nature Biotechnol 22:1315 http://www.nature.com/nbt/journal/v22/n10/abs/nbt1004-1315.html Thurs Sept 27 - Lab 4 &Mon Oct 1 - Lecture 17 Protein Families, Domains, and Motifs • Chp 7 - pp 85-96 BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Assignments & Announcements Fri Sept 26 • Exam 1 - Graded & returned in class - Really! • HW#2 - Graded & returned in class - Really! • Answer KEYs posted on website • Grades posted on WebCT • HomeWork #3 - posted online Due: Mon Oct 8 by 5 PM • HW544Extra #1- posted online Due: Task 1.1 - Mon Oct 1 by noon Task 1.2 & Task 2 - Mon Oct 8 by 5 PM BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
BCB 544 - Extra Required Reading Mon Sept 24 BCB 544 Extra Required Reading Assignment: • Pollard KS, Salama SR, Lambert N, Lambot MA, Coppens S, Pedersen JS, Katzman S, King B, Onodera C, Siepel A, Kern AD, Dehay C, Igel H, Ares M Jr, Vanderhaeghen P, Haussler D. (2006) An RNA gene expressed during cortical development evolved rapidly in humans. Nature443: 167-172. • http://www.nature.com/nature/journal/v443/n7108/abs/nature05113.html doi:10.1038/nature05113 • PDF available on class website - under Required Reading Link BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Extra Credit Questions #2-6: • What is the size of the dystrophin gene(in kb)? Is it still the largest known human protein? • What is the largest protein encoded in human genome (i.e., longest single polypeptide chain)? • What is the largest protein complex for which a structure is known (for any organism)? • What is the most abundant protein (naturally occurring) on earth? • Which state in the US has the largest number of mobile genetic elements (transposons) in its living population? • For 1 pt total (0.2 pt each): Answer all questions correctly • & submit by to terrible@iastate.edu • For 2 pts total: Prepare a PPT slide with all correct answers • & submit to ddobbs@iastate.edu before 9 AM on Mon Oct 1 • Choose one option - you can't earn 3 pts! • Partial credit for incorrect answers? only if they are truly amusing! BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Extra Credit Questions #7 & #8: Given that each male attending our BCB 444/544 class on a typical day is healthy (let's assume MH=7), and is generating sperm at a rate equal to the average normal rate for reproductively competent males (dSp/dT = ? per minute): 7a. How many rounds of meiosis will occur during our 50 minute class period? 7b. How many total sperm will be produced by our BCB 444/544 class during that class period? 8. How many rounds of meiosis will occur in the reproductively competent females in our class? (assume FH=5) • For 0.6 pts total (0.2 pt each): Answer all questions correctly • & submit by to terrible@iastate.edu • For 1 pts total: Prepare a PPT slide with all correct answers • & submit to ddobbs@iastate.edu before 9 AM on Mon Oct 1 • Choose one option - you can't earn more than 1 pt for this! • Partial credit for incorrect answers? only if they are truly amusing! BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Information flow in the cell? • DNA -> RNA -> protein: • Replication= DNA to DNA - by DNA polymerase • Transcription= DNA to RNA - by RNA polymerase • Translation= RNA to protein - by ribosomes • Exceptions/Complications: • DNA rearrangements: (by mobile genetic elements, recombination) • Reverse transcription: (RNA -> DNA, by reverse transcriptase) • Post-transcriptional modifications: • RNA splicing (removal of introns, by spliceosome) • RNA editing (addition/removal of nucleotides - usually U's) • Post-translational modifications: • Protein processing BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Modeling Metabolic Pathways? seeMetNethttp://metnet.vrac.iastate.edu/MetNet_overview.htm BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Chromosomes & Genes Genes in chromatin are not just “beads on a string” they are packaged in complex structures that we don't yet fully understand BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Gene regulation • Transcriptional regulation is primarily mediated by proteins that bind cis-acting elements or DNA sequence signals associated with genes: • DNA level (sequence-specific) regulatory signals • Promoters, terminators • Enhancers, repressors, silencers • Chromatin level (global) regulation • Heterochromatin (inactive) • e.g., X-inactivation in female mammals • In eukaryotes, genes are often regulated at other levels: • Post-transcriptional(RNA transport, splicing, stability) • Post-translational (protein localization, folding, stability) BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Promoter = DNA sequences required for initiation of transcription; contain TF binding sites, usually "close" to start site • Transcription factors (TFs) - proteins that regulate transcription • (In eukaryotes) RNA polymerase binds by recognizing a complex of • TFs bound at promotor First, TFs must bind TF binding sites (TFBSs) within promoters; then RNA polymerase can bind and initiate transcription of RNA ~200 bp Pre-mRNA BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Enhancers & repressors = DNA sequences that regulate initiation of transcription; contain TF binding sites,can be far from start site! Enhancers "enhance" transcription Repressors or silencers "repress" transcription RNAP = RNA polymerase II Promoter Enhancer Gene 10-50,000 bp Repressor Enhancer binding proteins (TFs) interact with RNAP Repressor binding proteins (TFs) block transcription BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Transcription factors (TFs) & their binding sites (TFBSs) • Transcription factors -trans-acting factors - proteins that either activate or repress transcription, usually by binding DNA (via a DNA binding domain) & interacting with RNA polymerase (via a "trans-activating domain) to affect rate of transcription initiation • Promotors, enhancers, and repressors - all contain binding sites for transcription factors • Promoters - usually located close to start site; vs • Enhancers/Silencers/Repressor sequences - can be close or very far away: located upstream, downstream or even within the coding sequence of genes !! BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
"Non-coding" DNA? Many genes encode RNA that is not translated 4 Major Classes of RNA: • mRNA = messenger RNA • tRNA = transfer RNA • rRNA = ribosomal RNA • "Other" -Lots of these, diverse structures & functions: • "Natural" RNAs: • siRNA, miRNA, piRNA, snRNA, snoRNA, … • ribozymes • Artificial RNAs: • RNAi • antisense RNA BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
RNA Sequence, Structure & Function • RNAs can have complex 3D stuctures (like proteins) & have many important functions in cellular processes Ribosomes contain RNAs & proteins Ribozymes are RNA enzymes capable of RNA cleavage • RNA molecules are believed to be precursors to DNA-based life • Form complementary base pairs and replicate (like DNA) • Perform enzymatic functions (like proteins) BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Protein Sequence, Structure & Function • Amino acid sequence determines protein structure • But some proteins need help folding ("chaperones") in vivo • Protein structure determines function • But level, timing & location of expression are important • Interactions with other proteins, DNA, RNA, & small ligands are also very important!! • We don't know the "folding code" that determines how proteins fold! • We don't know the "recognition code" that determines how proteins find and interact with correct partners! BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
A few Online Resources for: Cell & Molecular Biology • NCBI Science Primer: What is a cell? • NCBI Science Primer: What is a genome? • BioTech’s Life Science Dictionary • NCBI bookshelf BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Chp 6 - Profiles & Hidden Markov Models SECTION II SEQUENCE ALIGNMENT Xiong: Chp 6 Profiles & HMMs • √Position Specific Scoring Matrices (PSSMs) • √PSI-BLAST TODAY: • Profiles • Markov Models & Hidden Markov Models BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Algorithms & Software for MSA? #3 (NOT covered on Exam1) Heuristic Methods - continued • Progressive alignments (Star Alignment, Clustal) • Others: T-Coffee, DbClustal -see text: can be better than Clustal • Match closely-related sequences first using a guide tree • Partial order alignments (POA) • Doesn't rely on guide tree; adds sequences in order given • PRALINE • Preprocesses input sequences by building profiles for each • Iterative methods • Idea: optimal solution can be found by repeatedly modifying existing suboptimal solutions(eg: PRRN) • Block-based Alignment • Multiple re-building attempts to find best alignment (eg:DIALIGN2 & Match-Box) • Local alignments • Profiles, Blocks, Patterns - more on these soon! BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Applications of MSA • Building phylogenetic trees • Finding conserved patterns: • Regulatory motifs (TF binding sites) • Splice sites • Protein domains • Identifying and characterizing protein families • Find out which protein domains have same function • Finding SNPs(single nucleotide polymorphisms) & mRNA isoforms (alternatively spliced forms) • DNA fragment assembly (in genomic sequencing) BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Application: Discover Conserved Patterns Is there a conservedcis-actingregulatory sequence? Rationale: if sequences are homologous (derived from a common ancestor), they may be structurally/functionally equivalent TATA box = transcriptional promoter element Sequence Logo BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Patterns can also be represented as Sequence Logos BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Sequence Logo BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Sequence Logos: for Promoter elements (TF Binding Sites) • Example was created from a set of TATA binding sites from TRANSFAC database. • http://www.gene-regulation.com/pub/databases.html • Logo was created by WebLogo. • http://weblogo.berkeley.edu/logo.cgi • Can see TATA-box quite easily. BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Sequence Logos - for RNA Splicing Sites Human intron donor and acceptor sites http://www-lmmb.ncifcrf.gov/~toms/gallery/SequenceLogoSculpture.gif BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
PSSM vs Profile Position-Specific Scoring Matrix: from ungapped MSA PSI-BLAST Pseudocode Convert query to PSSM (or a Profile) do { BLAST database with PSSM Stop if no new homologs are found Add new homologs to PSSM } Print current set of homologs Profile: from MSA, including gaps Note: Xiong textbook distinguishes between PSSMs (which have no gaps) & Profiles (can include gaps). Thus, based on these definitions, PSI-BLAST uses a Profile to iteratively add new homologs - other authors refer to pattern used by PSI-BLAST as a PSSM. BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
What is a PSSM? Position-Specific Scoring Matrix I added more text to this slide “K” at position 3 gets a score of 2 8 residue sequence A PSSM is: • a representation of a motif • an n by m matrix, where n is size of alphabet & m is length of sequence • a matrix of scores in which entry at (i, j) is score assigned by PSSM to letter i at the jth position 20 letter alphabet Xiong:PSSM = table that contains probability information re: residues at each position of an ungapped MSA Also, sometimes called: Position Weight Matrix (PWM) Note: Assumes positions are independent BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
PSSM Entries = Log-Odds Scores This slide was modified Observed frequency of residue “A” Foreground model (i.e., the PSSM) • Estimate probability of observing each residue(probability of A given M, where M is PSSM model) • Divide by background probability of observing each residue(probability of A given B, where B is background model) • Take log so that can add (rather than multiply) scores Background model BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Statistics References Statistical Inference (Hardcover) George Casella, Roger L. Berger StatWeb: A Guide to Basic Statistics for Biologists http://www.dur.ac.uk/stat.web/ Basic Statistics: http://www.statsoft.com/textbook/stbasic.html (correlations, tests, frequencies, etc.) Electronic Statistics Textbook: StatSoft http://www.statsoft.com/textbook/stathome.html (from basic statistics to ANOVA to discriminant analysis, clustering, regression data mining, machine learning, etc.) BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Sequence Profiles Goal: to characterize sequences belonging to a class (structural or functional) & determine whether a query sequence also belongs to that class • DNA or RNA sequences • Protein sequences • Idea is to provide a "model" of the class against which we can test the new sequence BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Protein Sequence Profiles & PSSMs • Profile- a table that lists frequencies of each amino acid in each position of a protein sequence • PSSM - a special type of Profile - with no gaps • Frequencies are calculated from a MSA containing a domain of interest • Can be used to generate a consensus sequence • Derived scoring scheme can be used to align a new sequence to the profile • Profile can be used in database searches (PSI-BLAST) to find new sequences that match the profile • Profiles can also be used to compute MSAs heuristically (e.g., progressive alignment) BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
PSI-BLAST Limitations for generating patterns or "motifs" • With PSSMs, can't have insertions and deletions • With Profiles, essentially 'add extra columns' to PSSM to allow for gaps • Better approach (for defining domains)? • Profile HMM: elaborated version of a profile • Intuitively, a profile that models gaps BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Sequence Motifs (Patterns) Types of representations? • √ Consensus Sequence • √ Sequence Logo - "enhanced"consensus sequence, in which symbol size information entropy • Information entropy???In information theory, the Shannon entropy or information entropy is a measure of the [decrease in] uncertainty associated with a random variable. Entropy quantifies information in a piece of data. - Wikipedia • Check out this interesting website: Tom Schneider, NCIF • http://www.ccrnp.ncifcrf.gov/~toms/glossary.html#sequence_logo • √PSSM - Position-Specific Scoring Matrix • √Profiles HMMs - Hidden Markov Models BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
HMMs: an example Nucleotide frequencies in human genome BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
CpG Islands Written CpG to distinguish from a C≡G base pair) • CpG dinucleotides are rarer than would be expected from independent probabilities of C and G (given the background frequencies in human genome) • High CpG frequency is sometimes biologically significant; e.g., sometimes associated with promoter regions (“start sites”for genes) • CpG island - a region where CpG dinucleotides are much more abundant than elsewhere BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Hidden Markov Models - HMMs Goal: Find most likely explanation for observed variables Components: • Observed variables • Hiddenvariables • Emitted symbols • Emission probabilities • Transition probabilities • Graphical representation to illustrate relationships among these BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
The Occasionally Dishonest Casino A casino uses a fair die most of the time, but occasionally switches to a "loaded" one • Fair die: Prob(1) = Prob(2) = . . . = Prob(6) = 1/6 • Loaded die: Prob(1) = Prob(2) = . . . = Prob(5) = 1/10, Prob(6) = ½ • These are emission probabilities Transition probabilities • Prob(Fair Loaded) = 0.01 • Prob(Loaded Fair) = 0.2 • Transitions between states obey a Markov process • (more on Markov chains/models/processes a bit later) BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
An HMM for Occasionally Dishonest Casino Transition probabilities • Prob(Fair Loaded) = 0.01 • Prob(Loaded Fair) = 0.2 Emission probabilities • Fair die: Prob(1) = Prob(2) = . . . = Prob(6) = 1/6 • Loaded die: Prob(1) = Prob(2) = . . . = Prob(5) = 1/10, Prob(6) = ½ BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
The Occasionally Dishonest Casino • Known: • Structure of the model • Transition probabilities • Hidden: What casino actually did • FFFFFLLLLLLLFFFF... • Observable: Series of die tosses • 3415256664666153... • What we must infer: • When was a fair die used? • When was a loaded one used? • Answer is a sequenceFFFFFFFLLLLLLFFF... BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
HMM: Making the Inference • Model assigns a probability to each explanation for the observation, e.g.: P(326|FFL) = P(3|F) · P(FF) · P(2|F) · P(FL) · P(6|L) = 1/6 · 0.99 · 1/6 · 0.01 · ½ • Maximum Likelihood: Determine which explanation is most likely • Find path most likely to have produced observed sequence • Total Probability: Determine probability that observed sequence was produced by HMM • Consider all paths that could have produced the observed sequence BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
HMM Notation • x = sequence of symbols emitted by model • xi = symbol emitted at time i • = path, a sequence of states • i-th state in is i • akr = probability of making a transition from state k to state r • ek(b) = probability that symbol b is emitted when in state k BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Calculating Different Paths to an Observed Sequence BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Identifying the Most Probable Path The most likely path *satisfies: To find*,consider all possible ways the last "symbol" of x could have been emitted Let Then BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Viterbi Algorithm • Initialization (i = 0) • Recursion (i = 1, . . . , L): For each state k • Termination: To find*, use trace-back, as in dynamic programming BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Viterbi: Example x 2 6 6 0 0 B 1 0 (1/6)max{(1/12)0.99, (1/4)0.2} = 0.01375 (1/6)max{0.013750.99, 0.020.2} = 0.00226875 (1/6)(1/2) = 1/12 0 F (1/2)max{0.013750.01, 0.020.8} = 0.08 (1/10)max{(1/12)0.01, (1/4)0.8} = 0.02 (1/2)(1/2) = 1/4 0 L BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Viterbi gets it right more often than not BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
An HMM for CpG islands Emission probabilities are0 or 1e.g.,eG-(G) = 1, e G-(T) = 0 See Durbin et al., Biological Sequence Analysis,. Cambridge 1998 BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Total Probability Many different paths can result in observation x Probability that our model will emit x is Total Probability BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Viterbi: Example x 2 6 6 0 0 B 1 0 (1/6)max{(1/12)0.99, (1/4)0.2} = 0.01375 (1/6)max{0.013750.99, 0.020.2} = 0.00226875 (1/6)(1/2) = 1/12 0 F (1/2)max{0.013750.01, 0.020.8} = 0.08 (1/10)max{(1/12)0.01, (1/4)0.8} = 0.02 (1/2)(1/2) = 1/4 0 L BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs
Total Probability: Example x 2 6 6 0 0 B 1 0 (1/6)sum{(1/12)0.99, (1/4)0.2} = 0.022083 (1/6)sum{0.0220830.99, 0.0200830.2} = 0.004313 (1/6)(1/2) = 1/12 0 F (1/2)sum{0.0220830.01, 0.0200830.8} = 0.008144 (1/10)sum{(1/12)0.01, (1/4)0.8} = 0.020083 (1/2)(1/2) = 1/4 0 L Total probability = = 0.004313 + 0.008144 = 0.012457 BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs