470 likes | 540 Views
Reminder. Elements of an HMM. An HMM is characterized by the following: N, the number of states in the model. M, the number of distinct observation symbols per state. the state transition probability distribution where
E N D
Elements of an HMM An HMM is characterized by the following: • N, the number of states in the model. • M, the number of distinct observation symbols per state. • the state transition probability distribution where • the observation symbol probability distribution in state qj, , where bj(k) is the probability that the k-th observation symbol pops up at time t, given that the model is in state Ej. • the initial state distribution
Three Basic Problems for HMMs • Given the observation sequence O = O1O2O3…Ot, and a model m = (A, B, p), how do we efficiently compute P(O | m)? –Forward or backward algorithm • Given the observation sequence O and a model m, how do we choose a corresponding state sequence Q = q1q2q3…qt which is optimal in some meaningful sense? - Viterbi • How do we adjust the model parameters to maximize P(O | m)? – Baum Welch
HMM Applications Multiple Sequence alignment Protein families and motifs Gene Finding
Multiple alignments from unaligned sequences • Start with a model of random probabilities • Or a reasonable guess if it is available • Build a model from this alignment (Viterbi) • Use the alignment to improve the probabilities (BW) • May lead to a slightly different alignment • Stop when alignment fails to change iterate
Some preliminary remarks • Sequence alignment is useful for discovering functional, structural, and evolutionary information in biological research. • Different metrics (or notions of distance) could be defined to compare sequences. • Mathematician Peter Sellers (1974) showed that if a sequence alignment is formulated in terms of distances instead of similarity, a biologically more appealing interpretation of gaps is possible. • The latter is an evolution-motivated definition, relying on the concept of ancestry.
Multiple Sequence Alignment • The MSA of a set of sequences may be viewed as an evolutionary history of the sequences. • HMMs often provide a MSA as good as, if not better than, other methods. • No sequence ordering is required. • Insertion/deletion penalties are not needed. • The aligned sequences are used as the training data, to train the parameters of the model. • For each sequence, the Viterbi algorithm is then used to determine a path most likely to have produced that sequence. • This in turn can be used to realign the sequences
Multiple Sequence Alignment Consider the following Markov chain underlying a HMM, with three types of states: “match”; “insert”; “delete”
Modeling Protein Families • The states of our HMM will be divided into match states, insert states and delete states. • It is useful to include an initial state and a final one, and we assume that no match or delete state is visited more than once. • The alphabet M consists of twenty amino acids together with one dummy symbol representing “delete”. Delete states output only. • Each insert and match state has its own distribution over the 20 amino acids, and does not emit the symbol .
Multiple Sequence Alignment There are two extreme situations depending on the HMM parameters: • The emission probabilities for the match & insert states are uniform over the 20 amino acids - the model produces random sequences • Each state emits one specific amino acid with prob 1 &mimi+1 with probability 1 - the model produces the same sequence always
Multiple Sequence Alignment • Between the two extremes consider a “family” of somewhat similar sequences: • A “tight” family of very similar sequences • A “loose” family with little similarity • Similarity may be confined to certain areas of the sequences – if some match states emit a few amino acids, while other match states emit all amino acids uniformly/randomly • Allowing gap penalties and substitution probabilities to vary along the sequences reflects biological reality better.
Multiple Sequence Alignment • The model is chosen to have length equal to the average length of a sequence in the training set, and all parameters are initialized by using uniform distributions. • Start with “training”, or estimating, the parameters of the model using a set of training sequences from the protein family, using BW • Next, compute the path of states most likely to have produced each sequence • Amino acids are aligned if both are produced by the same match state in their paths
Example • Consider: CAEFDDH, CDAEFPDDH • Suppose the model has length 10, and the most likely paths for the two sequences are: m0m1m2m3m4d5d6m7m8m9m10 and m0m1i1m2m3m4d5 m6m7m8m9m10
Example The alignment induced is found by aligning positions generated by the same match state: m0 m1 m2 m3 m4 d5 d6 m7m8m9m10 C A E F D D H C D A E F P D D H m0 m1 i1 m2 m3m4 d5 m6m7m8m9m10
Example This leads to the following alignment: C– AEF–DDH CDAEFPDDH
The basic dogma DNA sequence contains genes which are transcribed and spliced into mRNA which is translated into protein. Every 3 bases of mRNA = 1 amino acid
What is a gene ? In general the transcribed sequence is longer than the translated portion: parts called introns (intervening sequence) are removed, leaving exons (expressed sequence), and yet other regions remain untranslated. The translated sequence comes in triples called codons, beginning and ending with a unique start (ATG) and one of three stop (TAA, TAG, TGA) codons.
What is a gene? • There are also characteristic intron-exon boundaries called splice donor and acceptor sites, and a variety of other motifs: promoters, transcription start sites, polyA sites,branching sites, and so on.
Summary In higher organisms, genes contain alternating regions of exons, which form the mature mRNA, and introns, which are spliced out. Exon 1 Exon 2 Exon 3 Transcription and splicing introns exons Exon 1 Exon 2 Exon 3 Translation Protein
In more detail (color ~state)
Some facts about human genes • Comprise about 3% of the genome • Average gene length: ~ 8,000 bp • Average of 5-6 exons/gene • Average exon length: ~200 bp
Some facts about human genes • Average intron length: ~2,000 bp • ~8% genes have a single exon • Some exons can be as small as 1 or 3 bp. • HUMFMR1S is not atypical: 17 exons 40-60 bp long, comprising 3% of a 67,000 bp gene
Gene finding • Given: A (potentially very long) string S over the alphabet {A,G,C,T} • Find: Intervals of that string which correspond to genes, and their intron/exon structure. ACAGATAGATGCAGACGAGTGACAGTGACACAGATAGATGCAGACGAGTGACAGTGACACAGATAGATGCAGACGAGTGACAGTGACCAGATAGATGCAGACGAGTGACAGTGACACAGATAGATGCAGACGAGTGACAGTGACACAGATAGATGCAGACGAGTGACAGTGACCAGATAGATGCAGACGAGTGACAGTGAACAGATAGATGCAGACGAGTGACAGTGACACAGATAGATGCAGACGAGTGACAGTGACACAGATAGATGCAGACGAGTGACAGTGAC exons introns
Gene Finding Challenges • Need the correct reading frame • Introns can interrupt an exon in mid-codon • There is no hard and fast rule for identifying donor and acceptor splice sites • Signals are very weak
The idea behind a GHMM genefinder • States represent standard gene features: intergenic region, exon, intron, perhaps more (promotor, 5’UTR, 3’UTR, Poly-A,..). • Observations embody state-dependent base composition, dependence, and signal features. • In a GHMM, duration must be included as well. Finally, reading frames and both strands must be dealt with.
Gene model B = gene start S = translation start D = donor A = accceptor T = translation stop E = gene end
Why HMMs might be a good fit for Gene Finding • Classification: Classifying observations within a sequence • Order: A DNA sequence is a set of ordered observations • Grammar / Architecture: Our grammatical structure (and the beginnings of our architecture) is right here: • Success measure: # of complete exons correctly labeled • Training data: Available from various genome annotation projects
Two kinds of Cells • Prokaryotes – no nucleus (bacteria) • Their genomes are circular – have no introns • Eukaryotes – have nucleus (animal,plants) • Linear genomes with multiple chromosomes in pairs. When pairing up, they look like Middle: centromere Top: p-arm Bottom: q-arm
Formalization of the gene prediction problem • Given a sequence of letters of {A,C,G,T}, label each position with labels {I, T, P, G}, where I means intergenic, G means internal codons, T means start of a gene, P means stop codon. • Example: • ..TAGTCATGCATATTGAACTTGCATCGCCAGTTGCACATATTUGATTCTTA.. • ..IIIII T G G G G G G G G G G G P IIIIII..
Problem • The output letter of the HMM at one state only depends on the state itself. However, it should also depends on the previous output letter(s). • A more complex HMM • Replace Pr(output | current_state) by Pr(output | current_state, previous_output)
GeneScan Generalized HMM (GHMM) Each state may output a string of symbols (according to some probability distribution).
GENESCAN (Burge & Karlin) 6 2 0 0 1 A G G A C A G G T A C G G C T G T C A T C A C T T A G A C C T C A C C C T G T G G A G C C A C A C C 6 2 0 5 1 C T A G G G T T G G C C A A T C T A C T C C C A G G A G C A G G G A G G G C A G G A G C C A G G G C 6 2 1 0 1 T G G G C A T A A A A G T C A G G G C A G A G C C A T C T A T T G C T T A C A T T T G C T T C T G A 6 2 1 5 1 C A C A A C T G T G T T C A C T A G C A A C C T C A A A C A G A C A C C A T G G T G C A C C T G A C E E E 6 2 2 0 1 T C C T G A G G A G A A G T C T G C C G T T A C T G C C C T G T G G G G C A A G G T G A A C G T G G 0 1 2 6 2 2 5 1 A T G A A G T T G G T G G T G A G G C C C T G G G C A G G T T G G T A T C A A G G T T A C A A G A C 6 2 3 0 1 A G G T T T A A G G A G A C C A A T A G A A A C T G G G C A T G T G G A G A C A G A G A A G A C T C 6 2 3 5 1 T T G G G T T T C T G A T A G G C A C T G A C T C T C T C T G C C T A T T G G T C T A T T T T C C C 6 2 4 0 1 A C C C T T A G G C T G C T G G T G G T C T A C C C T T G G A C C C A G A G G T T C T T T G A G T C I I I 0 1 2 6 2 4 5 1 C T T T G G G G A T C T G T C C A C T C C T G A T G C T G T T A T G G G C A A C C C T A A G G T G A 6 2 5 0 1 A G G C T C A T G G C A A G A A A G T G C T C G G T G C C T T T A G T G A T G G C C T G G C T C A C 6 2 5 5 1 C T G G A C A A C C T C A A G G G C A C C T T T G C C A C A C T G A G T G A G C T G C A C T G T G A 6 2 6 0 1 C A A G C T G C A C G T G G A T C C T G A G A A C T T C A G G G T G A G T C T A T G G G A C C C T T E E i t 6 2 6 5 1 G A T G T T T T C T T T C C C C T T C T T T T C T A T G G T T A A G T T C A T G T C A T A G G A A G 6 2 7 0 1 G G G A G A A G T A A C A G G G T A C A G T T T A G A A T G G G A A A C A G A C G A A T G A T T G C 6 2 7 5 1 A T C A G T G T G G A A G T C T C A G G A T C G T T T T A G T T T C T T T T A T T T G C T G T T C A 6 2 8 0 1 T A A C A A T T G T T T T C T T T T G T T T A A T T C T T G C T T T C T T T T T T T T T C T T C T C 6 2 8 5 1 C G C A A T T T T T A C T A T T A T A C T T A A T G C C T T A A C A T T G T G T A T A A C A A A A G E 5'UTR s 3'UTR 6 2 9 0 1 G A A A T A T C T C T G A G A T A C A T T A A G T A A C T T A A A A A A A A A C T T T A C A C A G T 6 2 9 5 1 C T G C C T A G T A C A T T A C T A T T T G G A A T A T A T G T G T G C T T A T T T G C A T A T T C 6 3 0 0 1 A T A A T C T C C C T A C T T T A T T T T C T T T T A T T T T T A A T T G A T A C A T A A T C A T T 6 3 0 5 1 A T A C A T A T T T A T G G G T T A A A G T G T A A T G T T T T A A T A T G T G T A C A C A T A T T poly-A 6 3 1 0 1 G A C C A A A T C A G G G T A A T T T T G C A T T T G T A A T T T T A A A A A A T G C T T T C T T C promoter 6 3 1 5 1 T T T T A A T A T A C T T T T T T G T T T A T C T T A T T T C T A A T A C T T T C C C T A A T C T C 6 3 2 0 1 T T T C T T T C A G G G C A A T A A T G A T A C A A T G T A T C A T G C C T C T T T G C A C C A T T 6 3 2 5 1 C T A A A G A A T A A C A G T G A T A A T T T C T G G G T T A A G G C A A T A G C A A T A T T T C T 6 3 3 0 1 G C A T A T A A A T A T T T C T G C A T A T A A A T T G T A A C T G A T G T A A G A G G T T T C A T Forward (+) strand Forward (+) strand intergenic 6 3 3 5 1 A T T G C T A A T A G C A G C T A C A A T C C A G C T A C C A T T C T G C T T T T A T T T T A T G G region Reverse (-) strand Reverse (-) strand 6 3 4 0 1 T T G G G A T A A G G C T G G A T T A T T C T G A G T C C A A G C T A G G C C C T T T T G C T A A T 6 3 4 5 1 C A T G T T C A T A C C T C T T A T C T T C C T C C C A C A G C T C C T G G G C A A C G T G C T G G 6 3 5 0 1 T C T G T G T G C T G G C C C A T C A C T T T G G C A A A G A A T T C A C C C C A C C A G T G C A G 6 3 5 5 1 G C T G C C T A T C A G A A A G T G G T G G C T G G T G T G G C T A A T G C C C T G G C C C A C A A 6 3 6 0 1 G T A T C A C T A A G C T C G C T T T C T T G C T G T C C A A T T T C T A T T A A A G G T T C C T T 6 3 6 5 1 T G T T C C C T A A G T C C A A C T A C T A A A C T G G G G G A T A T T A T G A A G G G C C T T G A 6 3 7 0 1 G C A T C T G G A T T C T G C C T A A T A A A A A A C A T T T A T T T T C A T T G C A A T G A T G T
GeneScan • Currently, a popular and successful gene finder for human DNA sequences is GENSCAN (Burge et al. 1997.) • It is based on a generalization of HMMs, called Semi hidden Markov Models. • The algorithms involved in this model are an order of magnitude more complex than for a regular HMM. • The gene-finding application requires a generalization of the Viterbi algorithm.
GeneScan • Burge (1997) observed that if the lengths of the long intergenic regions can be taken as having geometric distributions, and if these lengths generate sequences in a relatively iid fashion, then the algorithm can be adjusted so that practical running times can be obtained.
HMM Gene Finders:VEIL • A straight HMM Gene Finder • Takes advantage of grammatical structure and modular design • Uses many states that can only emit one symbol to get around state independence
Effectiveness of HMM-based finders • The best gene-finding HMM (GenScan, Burge and Karlin 1997) has ~80% sensitivity and ~80% specificity at the exon level. (That is, roughly 80% of true exons are entirely correctly found, and about 80% of the predicted exons are entirely correct.)
Beyond position-specific distributions • The bases in splice sites exhibit dependence, and not simply of the nearest neighbor kind. • High-order (non-stationary) Markov models would be one option, but the number of parameters in relation to the amount of data rules them out. The class of variable length Markov models (VLMMs) deriving from early research by Rissanen prove to be valuable in this context.
Pfam • Pfam is a web-based resource maintained by the Sanger Center http://www.sanger.ac.uk/Pfam • Pfam uses the basic theory described above to determine protein domains in a query sequence. • Suppose that a new protein is obtained for which no information is available except the raw sequence. • We wish to “annotate” this sequence.
Protein Family Classification • Pfam • large collection of multiple sequence alignments and hidden Markov models • covers many common protein domains and families • Over 73% of all known protein sequences have at least one match • 5,193 different protein families
Pfam Family Types • family – default classification, stating members are related • domain – structural unit found in multiple protein contexts • repeat –domain that in itself is not stable, but when combined with multiple tandem repeats forms a domain or structure • motif – shorter sequence units found outside of domains
Pfam • Initial multiple alignment of seeds using a program such as Clustal • Alignment hand scrutinized and adjusted • additional sequences are added to the family by comparing the HMM against sequence databases • Resulting full alignments with additional family members may look worse than initial seed alignments