450 likes | 576 Views
David H. Ardell,Forskarassistent. Introduction to Probabilistic Sequence Models: Theory and Applications. Lecture Outline: Intro. to Probabilistic Sequence Models. Motif Representations: Consensus Sequences, Motifs and Blocks, Regular Expressions
E N D
David H. Ardell,Forskarassistent Introduction to Probabilistic Sequence Models:Theory and Applications
Lecture Outline: Intro. to Probabilistic Sequence Models • Motif Representations: Consensus Sequences, Motifs and Blocks, Regular Expressions • Probabilistic Sequence Models: profiles, HMMs, SCFG
A T C G Consensus sequences revisited • Consense sequences make poor summaries
A motif is a short stretch of protein sequence associated with a particular function (R. Doolittle, 1981) • The first described, and most prominent example is the P-loop that binds phosphate in ATP/GTP-binding proteins • [GA]x(4)GK[ST] • A variety of databases of such motifs exist: such as BLOCKS, PROSITE, and PRINTS, and there are many tools to search proteins for matches to blocks.
Introduction to Regular Expressions (Regexes) • Regular Expressions specify sets of sequences that match a pattern. • Ex: a[bc]a matches "aba" and "aca" • In addition to literals like a and b in the last example, regular expressions provide quantifiers like * (0 or more), + (1 or more), ? (0 or 1) and {N,M} (between N and M): • Ex: a[bc]*a matches "aa", "aba", "acca", "acbcba" etc • As well as grouping constructions like character classes [xy], compound literals like (this)+, and logical relations, like | which means "or" in (this|that) • Anchors match the beginning ^ and end $ of strings
IUPAC DNA ambiguity codes as reg-ex classes • Pyrimidines Y = [CT] • PuRines R = [AG] • Strong S = [CG] • Weak W = [AT] • Keto K = [GT] • aMino M = [AC] • B B = [CGT] (one letter greater than A=not-A) • D D = [AGT] • H H = [ACT] • V V = [ACG] • Any base N = [ACGT]
Regular Expressions are like machines that eat sequences one letter at a time Ex: a[bc]+a matching "ghghgacbah" [bc] a [bc] a Begin End [^a] [^bc] [^bc]
Regular Expressions are like machines that eat sequences one letter at a time Ex: a[bc]+a matching "ghstuacbah" [bc] a [bc] a ghstu… End [^a] [^bc] [^bc]
Regular Expressions are like machines that eat sequences one letter at a time Ex: a[bc]+a matching "ghstuacbah" [bc] a [bc] a hstua… End [^a] [^bc] [^bc]
Regular Expressions are like machines that eat sequences one letter at a time Ex: a[bc]+a matching "ghstuacbah" [bc] a [bc] a stuac… End [^a] [^bc] [^bc]
Regular Expressions are like machines that eat sequences one letter at a time Ex: a[bc]+a matching "ghstugacbah" [bc] a [bc] a tuacb… End [^a] [^bc] [^bc]
Regular Expressions are like machines that eat sequences one letter at a time Ex: a[bc]+a matching "ghstuacbah" [bc] a [bc] a uacbah End [^a] [^bc] [^bc]
Regular Expressions are like machines that eat sequences one letter at a time Ex: a[bc]+a matching "ghstuacbah" [bc] a [bc] a acbah End [^a] [^bc] [^bc]
Regular Expressions are like machines that eat sequences one letter at a time Ex: a[bc]+a matching "ghstuacbah" [bc] a [bc] a Begin End cbah [^a] [^bc] [^bc]
Regular Expressions are like machines that eat sequences one letter at a time Ex: a[bc]+a matching "ghstuacbah" [bc] a [bc] a End bah [^a] [^bc] [^bc]
Regular Expressions are like machines that eat sequences one letter at a time Ex: a[bc]+a matching "ghstuacbah" [bc] a [bc] a End ah [^a] [^bc] [^bc]
Regular Expressions are like machines that eat sequences one letter at a time Ex: a[bc]+a matching "ghstuacbah" [bc] a [bc] a h [^a] [^bc] [^bc]
Regular Expressions are like machines that eat sequences one letter at a time Ex: a[bc]+a matching "ghstuacbah" [bc] a [bc] a MATCH! [^a] [^bc] [^bc]
Motifs are almost always either too selective or too specific • The first described, and most prominent example is the P-loop that binds phosphate in ATP/GTP-binding proteins • [GA]x(4)GK[ST] • Prob. of this motif ≈ (1/10)(1/20)(1/20)(1/10) = 0.000025 • Expected number of matches in database with 3.2 x108 residues: about 8000! • About half of the proteins that match this motif are not NTPases of the P-loop class. (Lack of specificity)
Motifs are almost always either too selective or too specific • [GA]x(4)GK[ST] Larger and larger alignments of true members of the class give more and more exceptions to the rule (lack of sensitivity) Extending the rule ([GAT]x(4)[GAF][KTL][STG]) leads to loss of specificity
A better way to model motifs • REGULAR EXPRESSIONS • “(TTR[ATC]WT) N{15,22} (TRWWAT)” • Can find alternative members of a class • Treat alternative character states as equally likely. • Treat all spacer lengths as equally likely. • PROFILES (Position-Specific Score Matrices)
C C H T M G L … S G G S A graphical view of the same profile: CCGTL… CGHSV… GCGSL… CGGTL… CCGSS…
You can also allow for unobserved residues or bases in a profile by giving them small probabilities: A A T T A G T … C G C C G T G
The probability that a sequence matches a profile P is the product of its parts: A 0.8 A 0.7 T 0.1 T 0.2 A 0.1 G 0.8 T 0.6 C 0.7 P G 0.2 C 0.2 G 0.1 Ex: p(AAGCT | P) = p(A) x p(A) x p(G) x p(C) x p(T) = 0.8 x 0.7 x 0.8 x 0.7 x 0.6 = 0.18
In practice, we compare this probability to that of matching a null model A A T T A G T C G C G A A A A A G G G G G T T T T T C C C C C
A 0.25 G 0.25 T 0.25 C 0.25 The null model is usually based on a composition. A 0.8 A 0.7 T 0.1 T 0.2 A 0.1 G 0.8 T 0.6 C 0.7 G 0.2 C 0.2 G 0.1 No positional information need be taken into account.
A 0.8 A 0.7 T 0.1 T 0.2 A 0.1 G 0.8 T 0.6 … C 0.7 G 0.2 C 0.2 G 0.1 A 0.25 G 0.25 T 0.25 C 0.25 Example: probabilities of AAGCT with the two models p = 0.18 p = 0.255= 0.00098
A 0.8 A 0.7 T 0.1 T 0.2 A 0.1 G 0.8 T 0.6 … C 0.7 G 0.2 C 0.2 G 0.1 A 0.25 G 0.25 T 0.25 C 0.25 Example: odds ratio of AAGCT with the two models p = 0.18 p = 0.255= 0.00098 The odds ratio is 0.18 / 0.00098 ≈ 184. It is 184 times more likely that AAGCT matches the profile than the null model!
Like with substitution scoring matrices, we prefer the log-odds as a profile score A positive log-odds (score) indicates a match.
Digression: interpreting BLAST results The bit score is a scaled log-odds of homology versus chance
Digression: interpreting BLAST results E value is the expected number of hits with scores at least S
A better way to model motifs • REGULAR EXPRESSIONS • “(TTR[ATC]WT) N{15,22} (TRWWAT)” • Can find alternative members of a class • Treat alternative character states as equally likely. • Treat all spacer lengths as equally likely. • PROFILES (Position-Specific Score Matrices) • Turn a multiple sequence alignment into a multidimensional (by position) multinomial distribution. • Explicit accounting of observed character states • Cannot handle gaps (separate models must be made for different spacer length -- O’Neill and Chiafari 1989) • Can't be used to make alignments
Hidden Markov Models • A Hidden Markov Model is a machine that can either parse or emit a family of sequences according to a Markov model • The same symbols can put the machine in different states, (A,C,T,G can be in a promoter, a codon, a terminator, etc.) so we say the states are “hidden” • Example: The Dice Factory 0.01 P(1) = 3/6 P(1) = 1/6 P(2) = 1/6 P(2) = 1/10 P(3) = 1/6 P(3) = 1/10 0.99 0.70 P(4) = 1/6 P(4) = 1/10 P(5) = 1/10 P(5) = 1/6 P(6) = 1/6 P(6) = 1/10 0.30 GENERATED BIASED FAIR ...11452161621233453261432152211121611112211... PREDICTED
A A T T A G T C G C G A Profile HMM is a profile with gaps
A A T T A G T C G C G A Profile HMM is a profile with gaps insertions
A A T T A G T C G C G A Profile HMM is a profile with gaps deletions
A A T T A G T C G C G A Profile HMM is a profile with gaps deletions insertions
A 0.25 G 0.25 T 0.25 C 0.25 The HMMer Null Model (composition of insertions may be set by user, eg to match genome)
The Plan 7 architecture in HMMer Permit local matches to sequence Permit local matches to model Permit repeated matches to sequence
HMMer2 (pronounced 'hammer', as in, “Why BLAST if you can hammer?”)
The HMMer2 design separates models from algorithms • With the same alignment or model design, you can easily change the search algorithm (encoded in the HMM) to do: • Multihit Global alignments of model to sequence • Multihit Smith-Waterman (local with respect to both model and sequence, multiple non-overlapping hits to sequence allowed) • Single (best) hit variants of both of the above.
This separation of model from algorithm provides a ready framework for sequence analysis(programs provided in HMMer) hmmalignAlign sequences to an existing model. hmmbuildBuild a model from a multiple sequence alignment. hmmcalibrateTakes an HMM and empirically determines parameters that are used to make searches more sensitive, by calculating more accurate expectation value scores (E-values). hmmconvertConvert a model file into different formats, including a compact HMMER 2 binary format, and “best effort” emulation of GCG profiles. hmmemitEmit sequences probabilistically from a profile HMM. hmmfetchGet a single model from an HMM database. hmmindexIndex an HMM database. hmmpfamSearch an HMM database for matches to a query sequence. hmmsearchSearch a sequence database for matches to an HMM.
HMMer2 format can be automatically converted for use with SAM