1 / 18

Sequence information, logos and Hidden Markov Models

Sequence information, logos and Hidden Markov Models. Morten Nielsen, CBS, BioCentrum, DTU. Information content. Information and entropy Conserved amino acid regions contain high degree of information (high order == low entropy)

nolcha
Download Presentation

Sequence information, logos and Hidden Markov Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequence information, logos and Hidden Markov Models Morten Nielsen, CBS, BioCentrum, DTU

  2. Information content • Information and entropy • Conserved amino acid regions contain high degree of information (high order == low entropy) • Variable amino acid regions contain low degree of information (low order == high entropy) • Shannon information D = log2(N) + S pi log2 pi (for proteins N=20, DNA N=4) • Conserved residue pA=1, pi<>A=0, D = log2(N) ( = 4.3 for proteins) • Variable region pA=0.05, pC=0.05, .., D = 0

  3. Sequence logo MHC class II Logo from 10 sequences • Height of a column equal to D • Relative height of a letter is pA • Highly useful tool to visualize sequence motifs High information positions http://www.cbs.dtu.dk/~gorodkin/appl/plogo.html

  4. Description of binding motif Example PA = 6/10 PG = 2/10 PT = PK = 1/10 PC = PD = …PV = 0 Problems Few data Data redundancy/duplication ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Sequence information

  5. ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Sequence information Raw sequence counting

  6. ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Sequence weighting

  7. ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Pseudo counts • Sequence weighting and pseudo count • Motif found on more data

  8. …and now you • cp files from/usr/opt/www/pub/CBS/researchgroups/immunology/intro/HMM/exercise • Make weight matrix and logos using • pep2mat -swt 2 -wlc 0 data > mat • mat2logo mat • ghostview logo.ps • Include sequence weighting • pep2mat -swt 0 -wlc 0 data > mat • make and view logo • Try the other sequence weighting scheme (clustering) -swt 1. What difference does this make? • Include pseudo counts • pep2mat data > mat • make and view logo

  9. Weight matrices • Estimate amino acid frequencies from alignment including sequence weighting and pseudo counts • Construct a weight matrix as Wij = log(pij/qj) • Here i is a position in the motif, and j an amino acid. qj is the prior frequency for amino acid j. • W is a L x 20 matrix, L is motif length • Score sequences to weight matrix by looking up and adding L values from matrix

  10. Weight matrix predictions • Use the program seq2hmm to evaluate the prediction accuracy of your weight matrix • seq2hmm -hmm mat -xs eval.set | grep -v # | args 2,3 | xycorr • What is going on here? • By leaving out the -xs option you can generate the scores at each position in the sequence. This is often useful for Neural Network training

  11. Complexity of problem Peptides of different length Weak motif signal Alignment crucial Gibbs Monte Carlo sampler RFFGGDRGAPKRG YLDPLIRGLLARPAKLQV KPGQPPRLLIYDASNRATGIPA GSLFVYNITTNKYKAFLDKQ SALLSSDITASVNCAK PKYVHQNTLKLAT GFKGEQGPKGEP DVFKELKVHHANENI SRYWAIRTRSGGI TYSTNEIDLQLSQEDGQTIE MHC class II prediction DRB1*0401 peptides

  12. Gibbs sample algorithm Alignment by Gibbs sampler E = i,j pij * log( p`ij/qi ) RFFGGDRGAPKRG YLDPLIRGLLARPAKLQV KPGQPPRLLIYDASNRATGIPA GSLFVYNITTNKYKAFLDKQ SALLSSDITASVNCAK PKYVHQNTLKLAT GFKGEQGPKGEP DVFKELKVHHANENI SRYWAIRTRSGGI TYSTNEIDLQLSQEDGQTI • Maximize E using MC • Random change in offset • Random shift on box position • Accept moves to higher E always • Accept moves to lower E with decreasing probability

  13. Gibbs sampler exercise • The file clasII.fsa is a FASTA file containing 50 classII epitopes • gibbss_mc -iw -w 1,0,0,1,0,1,0,0,1 -m gibbs.mat classII.fsa • The options -iw and -w 1,0,0,1,0,1,0,0,1 increase matrix weight on important anchor positions in binding motif • Make and view logo • Use the matrix to predict classII epitopes • cl2pred -mat gibbs.mat classII.eval.dat | grep -v # | args 4,5 | xycorr • Do you understand what is going on in this command?

  14. Hidden Markov Models • Weight matrices do not deal with insertions and deletions • In alignments, this is done in an ad-hoc manner by optimization of the two gap penalties for first gap and gap extension • HMM is a natural frame work where insertions/deletions are dealt with explicitly

  15. ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC Example from A. Krogh Core region defines the number of states in the HMM (red) Insertion and deletion statistics are derived from the non-core part of the alignment (black) HMM (a simple example) Core of alignment

  16. HMM construction • 5 matches. A, 2xC, T, G • 5 transitions in gap region • C out, G out • A-C, C-T, T out • Out transition 3/5 • Stay transition 2/5 ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC .4 .2 A C G T .4 .2 .2 .6 .6 .8 A C G T A C G T A C G T .8 A C G T 1 A C G T A C G T 1. 1. 1. 1. .4 .8 .2 .8 .2 .2 .2 .8 .2 ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x1x0.8x1x0.2 = 3.3x10-2

  17. Align sequence to HMM ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2=3.3x10-2 TCAACTATC 0.2x1x0.8x1x0.8x0.6x0.2x0.4x0.4x0.4x0.2x0.6x1x1x0.8x1x0.8=0.0075x10-2 ACAC--AGC =1.2x10-2 AGA---ATC =3.3x10-2 ACCG--ATC =0.59x10-2 Consensus: ACAC--ATC =4.7x10-2, ACA---ATC =13.1x10-2 Exceptional: TGCT--AGG =0.0023x10-2

  18. Score depends strongly on length Null model is a random model. For length L the score is 0.25L Log-odds score for sequence S Log( P(S)/0.25L) Positive score means more likely than Null model ACA---ATG = 4.9 TCAACTATC = 3.0 ACAC--AGC = 5.3 AGA---ATC = 4.9 ACCG--ATC = 4.6 Consensus: ACAC--ATC = 6.7 ACA---ATC = 6.3 Exceptional: TGCT--AGG = -0.97 Align sequence to HMM - Null model Note!

More Related