Sequence motifs, information content, logos, and HMM’s

Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Outline • Multiple alignments and sequence motifs • Weight matrices and consensus sequence • Sequence weighting • Low (pseudo) counts • Information content • Sequence logos • Mutual information • Example from the real world • HMM’s and profile HMM’s • TMHMM (trans-membrane protein) • Gene finding • Links to HMM packages

Core Consensus sequence Weight matrices Problems Sequence weights Low counts ----------MLEFVVEADLPGIKA-------- ----------MLEFVVEFALPGIKA-------- ----------MLEFVVEFDLPGIAA-------- -------------YLQDSDPDSFQD-------- ---GSDTITLPCRMKQFINMWQE---------- ---RNQEERLLADLMQNYDPNLR---------- -------YDPNLRPAERDSDVVNVSLK------ ----------NVSLKLTLTNLISLNEREEA--- ----EREEALTTNVWIEMQWCDYR--------- ----------WCDYRLRWDPRDYEGLWVLR--- --LWVLRVPSTMVWRPDIVLEN----------- ------------IVLENNVDGVFEVALYCNVL- -------------YCNVLVSPDGCIYWLPPAIF ---------PPAIFRSACSISVTYFPFDW---- ********* FVVEFDLPG Multiple alignment and sequence motifs Consensus

Sequences weighting 1 - Clustering ----------MLEFVVEADLPGIKA-------- ----------MLEFVVEFALPGIKA-------- ----------MLEFVVEFDLPGIAA-------- -------------YLQDSDPDSFQD-------- ---GSDTITLPCRMKQFINMWQE---------- ---RNQEERLLADLMQNYDPNLR---------- -------YDPNLRPAERDSDVVNVSLK------ ----------NVSLKLTLTNLISLNEREEA--- ----EREEALTTNVWIEMQWCDYR--------- ----------WCDYRLRWDPRDYEGLWVLR--- --LWVLRVPSTMVWRPDIVLEN----------- ------------IVLENNVDGVFEVALYCNVL- -------------YCNVLVSPDGCIYWLPPAIF ---------PPAIFRSACSISVTYFPFDW---- ********* } Homologous sequences Weight = 1/n (1/3) Consensus sequence YRQELDPLV Previous FVVEFDLPG

Sequences weighting 2 - (Henikoff & Henikoff) w FVVEADLPG 0.37 FVVEFALPG 0.43 FVVEFDLPG 0.32 YLQDSDPDS 0.59 MKQFINMWQ 0.90 LMQNYDPNL 0.68 PAERDSDVV 0.75 LKLTLTNLI 0.85 VWIEMQWCD 0.84 YRLRWDPRD 0.51 WRPDIVLEN 0.71 VLENNVDGV 0.59 YCNVLVSPD 0.71 FRSACSISV 0.75 • waa’ = 1/rs • r: Number of different aa in a column • s: Number occurrences • Normalize so S waa= 1 for each column • Sequence weight is sum of waa F: r=7 (FYMLPVW), s=4 w’=1/28, w = 0.055 Y: s=3, w`=1/21, w = 0.073 M,P,W: s=1, w’=1/7, w = 0.218 L,V: s=2, w’=1/14, w = 0.109

--------MLEFVVEADLPGIKA-------- --------MLEFVVEFALPGIKA-------- --------MLEFVVEFDLPGIAA-------- -----------YLQDSDPDSFQD-------- -GSDTITLPCRMKQFINMWQE---------- -RNQEERLLADLMQNYDPNLR---------- -----YDPNLRPAERDSDVVNVSLK------ --------NVSLKLTLTNLISLNEREEA--- --EREEALTTNVWIEMQWCDYR--------- --------WCDYRLRWDPRDYEGLWVLR--- LWVLRVPSTMVWRPDIVLEN----------- ----------IVLENNVDGVFEVALYCNVL- -----------YCNVLVSPDGCIYWLPPAIF -------PPAIFRSACSISVTYFPFDW---- ********* Limited number of data Poor sampling of sequence space I is not found at position P1. Does this mean that I is forbidden? No! Use Blosum matrix to estimate pseudo frequency of I Low count correction P1

# I L V L 0.1154 0.3755 0.0962 V 0.1646 0.1303 0.2689 Every time for instance L/V is observed, I is also likely to occur Estimate low (pseudo) count correction using this approach As more data are included the pseudo count correction becomes less important Low count correction using Blosum matrices Blosum62 substitution frequencies NL = 2, NV=2, Neff=12 => fI = (2*0.1154 + 2*0.1646)/12 = 0.05 pI* = (Neff * pI + b * fI)/(Neff+b) = (12*0 + 10*0.05)/(12+10) = 0.02

Information content • Information and entropy • Conserved amino acid regions contain high degree of information (high order == low entropy) • Variable amino acid regions contain low degree of information (low order == high entropy) • Shannon information D = log2(N) + S pi log2 pi (for proteins N=20, DNA N=4) • Conserved residue pA=1, pi<>A=0, D = log2(N) ( = 4.3 for proteins) • Variable region pA=0.05, pC=0.05, .., D = 0

Sequence logo MHC class II Logo from 10 sequences • Height of a column equal to D • Relative height of a letter is pA • Highly useful tool to visualize sequence motifs High information position http://www.cbs.dtu.dk/~gorodkin/appl/plogo.html

More on logos Frequency matrix A R N D C Q E G H I L K M F P S T W Y V 2 1 1 1 1 1 1 1 1 4 16 1 6 15 7 1 2 7 18 13 8 19 1 1 7 2 2 2 1 3 15 13 6 2 1 2 2 7 1 8 3 2 7 2 1 17 13 2 1 8 14 3 1 1 7 7 2 0 1 8 8 13 13 14 1 2 13 2 1 2 3 3 1 7 1 3 7 0 1 7 4 1 7 7 7 1 2 2 1 13 15 2 6 6 1 7 2 7 7 4 5 2 8 23 1 6 3 2 1 3 3 2 1 1 1 13 8 0 1 18 2 1 7 13 1 1 2 2 1 8 14 2 6 1 20 7 2 7 1 3 3 7 7 8 7 1 7 8 1 2 8 2 1 1 13 7 2 7 1 7 3 2 7 19 1 6 2 8 1 9 9 2 1 1 1 7 2 0 1 18 • Information content D = S pi log2 (pi/qi) • Shannon, qi = 1/N = 0.05 D = S pi log2 (pi) - S pi log2 (1/N) = log2 N - S pi log2 (pi) • Kullback-Leibler, qi = background frequency • V/L/A more frequent than for instance C/H/W

Mutual information P1 P6 I(i,j) = SaaiSaajP(aai, aaj) * log[P(aai, aaj)/P(aai)*P(aaj)] P(G1) = 2/9 = 0.22, .. P(V6) = 4/9 = 0.44,.. P(G1,V6) = 2/9 = 0.22, P(G1)*P(V6) = 8/81 = 0.10 log(0.22/0.10) > 0 ALWGFFPVA ILKEPVHGV ILGFVFTLT LLFGYPVYV GLSPTVWLS YMNGTMSQV GILGFVFTL WLSLLVPFV FLPSDFFPS

Mutual information 313 binding peptides 313 random peptides

Mutual information at anchor position is low • Mutual information between anchor positions 2 and 9 and other residues low • At pos 2 we know that L,M,T,V and I are the most frequent amino acids. • At pos 9 V,L,I and A are most frequent • 313 Rammensee + Buus pep • P(L2) = 0.51, P(V9)=0.48, P(L2,V9) = 0.23 • P(L2,V9)/(P(L2)*P(V9) )=0.23/0.24 = 1.0 • Knowing that we have L at position 2 does nottell us which one of V,L or I is placed on position 9 => NO mutual information

Weight matrices • Estimate amino acid frequencies from alignment inc. sequence weighting and pseudo counts • Now a weight matrix is given as Wij = log(pij/qj) • Here i is a position in the motif, and j an amino acid. qj is the background frequency for amino acid j. • W is a L x 20 matrix, L is motif length • Score sequences to weight matrix by looking up and adding L values from matrix

10 peptides from MHCpep database Bind to the MHC complex Relevant for immune system recognition Estimate sequence motif and weight matrix Evaluate on 528 peptides ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Example from real life

ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Example (cont.) • Raw sequence counting • No sequence weighting • No pseudo count • Prediction accuracy 0.45 • Sequence weighting • No pseudo count • Prediction accuracy 0.5

ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Example (cont.) • Sequence weighting and pseudo count • Prediction accuracy 0.60 • Motif found on all data (485) • Prediction accuracy 0.79

Hidden Markov Models • Weight matrices do not deal with insertions and deletions • In alignments, this is done in an ad-hoc manner by optimization of the two gap penalties for first gap and gap extension • HMM is a natural frame work where insertions/deletions are dealt with explicitly

ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC Example from A. Krogh Core region defines the number of states in the HMM (red) Insertion and deletion statistics is derived from the non-core part of the alignment (blue) HMM (a simple example) Core of alignment

HMM construction • 5 matches. A, 2xC, T, G • 5 transitions in gap region • C out, G out • A-C, C-T, T out • Out transition 3/5 • Stay transition 2/5 ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC .4 .2 A C G T .4 .2 .2 .6 .6 .8 A C G T A C G T A C G T .8 A C G T 1 A C G T A C G T 1. 1. 1. 1. .4 .8 .2 .8 .2 .2 .2 .8 .2 ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2 = 3.3x10-2

Align sequence to HMM ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2=3.3x10-2 TCAACTATC 0.2x1x0.8x1x0.8x0.6x0.2x0.4x0.4x0.4x0.2x0.6x1x1x0.8x1x0.8=0.0075x10-2 ACAC--AGC =1.2x10-2 AGA---ATC =3.3x10-2 ACCG--ATC =0.59x10-2 Consensus: ACAC--ATC =4.7x10-2, ACA---ATC =13.1x10-2 Exceptional: TGCT--AGG =0.0023x10-2

Score depends strongly on length Null model is a random model. For length L the score is 0.25L Log-odd score for sequence S Log( P(S)/0.25L) ACA---ATG = 4.9 TCAACTATC = 3.0 ACAC--AGC = 5.3 AGA---ATC = 4.9 ACCG--ATC = 4.6 Consensus: ACAC--ATC = 6.7 ACA---ATC = 6.3 Exceptional: TGCT--AGG = -0.97 Align sequence to HMM - Null model Note!

HMM’s and weight matrices • In the case of un-gapped alignments HMM’s become simple weight matrices • It still might be useful to use a HMM tool package to estimate a weight matrix • Sequence weighting • Pseudo counts

Profile HMM’s • Alignments based on conventional scoring matrices (BLOSUM62) scores all positions in a sequence in an equal manner • Some position are highly conserved, some are highly flexible (more than what is described in the BLOSUM matrix) • Profile HMM’s are ideal suited to describe such position specific variations

ExampleSequence profiles • Alignment of 1PLC._ to 1GYC.A • Blast e-value > 1000 • Profile alignment • Align 1PLC._ against Swiss-prot • Make position specific weight matrix from alignment • Use this matrix to align 1PLC._ against 1GYC.A • E-value > 10-22. Rmsd=3.3

Example continued Score = 97.1 bits (241), Expect = 9e-22 Identities = 13/107 (12%), Positives = 27/107 (25%), Gaps = 17/107 (15%) Query: 3 VLLGADDGSLAFVPSEFSISPGEKI------VFKNNAGFPHNIVFDEDSIPSGVDASKIS 56 V+ G F + G++ N+ + +G + + Sbjct: 26 VVNG------VFPSPLITGKKGDRFQLNVVDTLTNHTMLKSTSIHWHGFFQAGTNWADGP 79 Query: 57 MSEEDLLNAKGETFEVAL---SNKGEYSFYCSP--HQGAGMVGKVTV 98 A G +F G + ++ G+ G V Sbjct: 80 AFVNQCPIASGHSFLYDFHVPDQAGTFWYHSHLSTQYCDGLRGPFVV 126 Rmsd=3.3

Profile HMM’s Conserved Insertion EM55_HUMAN WWQGRVEGSSKESAGLIPSPELQEWRVASMAQSAP--SEAPSCSPFGKKKK-YKDKYLAK CSKP_HUMAN WWQGKLENSKNGTAGLIPSPELQEWRVACIAMEKTKQEQQASCTWFGKKKKQYKDKYLAK KAPB_MOUSE -----PENLLIDHQGYIQVTDFGFAKRVKG------------------------------ NRC2_NEUCR -----PENILLHQSGHIMLSDFDLSKQSDPGGKPTMIIGKNGTSTSSLPTIDTKSCIANF EM55_HUMAN HSSIFDQLDVVSYEEVVRLPAFKRKTLVLIGASGVGRSHIKNALLSQNPEKFVYPVPYTT CSKP_HUMAN HNAVFDQLDLVTYEEVVKLPAFKRKTLVLLGAHGVGRRHIKNTLITKHPDRFAYPIPHTT KAPB_MOUSE RTWTLCGTPEYLAPEIILSKGYNKAVDWWALGVLIYEMAAGYPPFFADQPIQIYEKIVSG NRC2_NEUCR RTNSFVGTEEYIAPEVIKGSGHTSAVDWWTLGILIYEMLYGTTPFKGKNRNATFANILRE EM55_HUMAN RPPRKSEEDGKEYHFISTEEMTRNISANEFLEFGSYQGNMFGTKFETVHQIHKQNKIAIL CSKP_HUMAN RPPKKDEENGKNYYFVSHDQMMQDISNNEYLEYGSHEDAMYGTKLETIRKIHEQGLIAIL KAPB_MOUSE KVRFPSHF-----SSDLKDLLRNLLQVDLTKRFGNLKNGVSDIKTHKWFATTDWIAIYQR NRC2_NEUCR DIPFPDHAGAPQISNLCKSLIRKLLIKDENRRLG-ARAGASDIKTHPFFRTTQWALI--R EM55_HUMAN NNGVDETLKKLQEAFDQACSSPQWVPVSWVY CSKP_HUMAN NNEIDETIRHLEEAVELVCTAPQWVPVSWVY KAPB_MOUSE EKCGKEFCEF--------------------- NRC2_NEUCR ENAVDPFEEFNSVTLHHDGDEEYHSDAYEKR Deletion

Profile HMM’s All M/D pairs must be visited once

TMHMM (trans-membrane HMM)(Sonnhammer, von Heijne, and Krogh) Model TM length distribution. Power of HMM. Difficult in alignment.

Combination of HMM’s -Gene finding Stop codon Start codon cccTAAxxxxxxxx x ccc xxxxxxxxATGccc Region around start codon Coding region Region around stop codon Inter-genic region

HMM packages • HMMER(http://hmmer.wustl.edu/) • S.R. Eddy, WashU St. Louis. Freely available. • SAM (http://www.cse.ucsc.edu/research/compbio/sam.html) • R. Hughey, K. Karplus, A. Krogh, D. Haussler and others, UC Santa Cruz. Freely available to academia, nominal license fee for commercial users. • META-MEME (http://metameme.sdsc.edu/) • William Noble Grundy, UC San Diego. Freely available. Combines features of PSSM search and profile HMM search. • NET-ID, HMMpro(http://www.netid.com/html/hmmpro.html) • Freely available to academia, nominal license fee for commercial users. • Allows HMM architecture construction.

Simple Hmmer command hmmbuild - build a hidden Markov model from an alignment HMMER 2.2g (August 2001) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Alignment file: A2.fsa File format: a2m Search algorithm configuration: Multiple domain (hmmls) Model construction strategy: Fast/ad hoc (gapmax 0.0) Null model used: (default) Sequence weighting method: G/S/C tree weights - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Alignment: #1 Number of sequences: 232 Number of columns: 9 Determining effective sequence number ... done. [192] Weighting sequences heuristically ... done. Constructing model architecture ... done. Converting counts to probabilities ... done. Setting model name, etc. ... done. [A2.fasta] Constructed a profile HMM (length 9) Average score: -6.42 bits Minimum score: -15.47 bits Maximum score: -0.84 bits Std. deviation: 2.72 bits hmmbuild --gapmax 0.0 --fast A2.hmmer A2.fsa >HLA-A.0201 16 Example_for_Ligand SLLPAIVEL >HLA-A.0201 16 Example_for_Ligand YLLPAIVHI >HLA-A.0201 16 Example_for_Ligand TLWVDPYEV >HLA-A.0201 16 Example_for_Ligand SXPSGGXGV >HLA-A.0201 16 Example_for_Ligand GLVPFLVSV

Sequence motifs, information content, logos, and HMM’s