1-month Practical Course Genome Analysis Lecture 5: Multiple Sequence Alignment

C E N T E R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U 1-month Practical Course Genome Analysis Lecture 5: Multiple Sequence Alignment Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam The Netherlands ibi.vu.nl heringa@cs.vu.nl

Multiple sequence alignment

Multiple sequence alignment • Sequences can be conserved across species and perform similar or identical functions.> hold information about which regions have high mutation rates over evolutionary time and which are evolutionarily conserved;> identification of regions or domains that are critical to functionality. • Sequences can be mutated or rearranged to perform an altered function.> which changes in the sequences have caused a change in the functionality. Multiple sequence alignment: the idea is to take three or more sequences and align them so that the greatest number of similar characters are aligned in the same column of the alignment.

What to ask yourself • How do we get a multiple alignment?(three or more sequences) • What is our aim?– Do we go for max accuracy, least computational time or the best compromise? • What do we want to achieve each time

Sequence-sequence alignment sequence sequence

Multiple alignment methods • Multi-dimensional dynamic programming> extension of pairwise sequence alignment. • Progressive alignment> incorporates phylogenetic information to guide the alignment process • Iterative alignment> correct for problems with progressive alignment by repeatedly realigning subgroups of sequence

Simultaneous multiple alignmentMulti-dimensional dynamic programming The combinatorial explosion • 2 sequences of length n • n2 comparisons • Comparison number increases exponentially • i.e. nN where n is the length of the sequences, and N is the number of sequences • Impractical for even a small number of short sequences

Multi-dimensional dynamic programming(Murata et al., 1985) Sequence 1 Sequence 3 Sequence 2

The MSA approach • MSA (Lipman et al., 1989, PNAS 86, 4412) • MSA restricts the amount of memory by computing bounds that approximate the centre of a multi-dimensional hypercube. • Calculate all pair-wise alignment scores. • Use the scores to to predict a tree. • Calculate pair weights based on the tree (lower bound). • Produce a heuristic alignment based on the tree. • Calculate the maximum weight for each sequence pair (upper bound). • Determine the spatial positionsthat must be calculated to obtain the optimal alignment. • Perform the optimal alignment. • Report the weight found comparedto the maximum weight previouslyfound (measure of divergence). • Extremely slow and memory intensive. • Max 8-9 sequences of ~250 residues.

The DCA approach • DCA (Stoye et al., 1997, Appl. Math. Lett. 10(2), 67-73) • Each sequence is cut in two behinda suitable cut position somewhere close to its midpoint. • This way, the problem of aligningone family of (long) sequences is divided into the two problems of aligning two families of (shorter) sequences. • This procedure is re-iterated untilthe sequences are sufficiently short. • Optimal alignment by MSA. • Finally, the resulting short alignments are concatenated.

So in effect … Sequence 1 Sequence 3 Sequence 2

Multiple alignment methods • Multi-dimensional dynamic programming> extension of pairwise sequence alignment. • Progressive alignment> incorporates phylogenetic information to guide the alignment process • Iterative alignment> correct for problems with progressive alignment by repeatedly realigning subgroups of sequence

The progressive alignment method • Underlying idea: usually we are interested in aligning families of sequences that are evolutionary related. • Principle: construct an approximate phylogenetic tree for the sequences to be aligned and than to build up the alignment by progressively adding sequences in the order specified by the tree. • But before going into details, some facts about multiple alignment profiles …

How to represent a block of sequences? • Historically: consensus sequence – single sequence that best represents the amino acids observed at each alignment position. When consensus sequences are used the pair-wise DP algorithm can be used without alterations • Modern methods: Alignment profile – representation that retains the information about frequencies of amino acids observed at each alignment position.

Multiple alignment profiles (Gribskov et al. 1987) • Gribskov created a probe: group of typical sequences of functionally related proteins that have been aligned by similarity in sequence or three-dimensional structure (in his case: globins & immunoglobulins). • Then he constructed a profile, which consists of a sequence position-specific scoring matrix M(p,a) composed of 21 columns and N rows (N = length of probe). • The first 20 columns of each row specify the score for finding, at that position in the target, each of the 20 amino acid residues. An additional column contains a penalty for insertions or deletions at that position (gap-opening and gap-extension).

Multiple alignment profiles Core region Gapped region Core region i A C D    W Y fA.. fC.. fD..    fW.. fY.. fA.. fC.. fD..    fW.. fY.. fA.. fC.. fD..    fW.. fY.. - Gapo, gapx Gapo, gapx Gapo, gapx Position dependent gap penalties

Profile building • Example: each aa is represented as a frequency;gap penalties as weights. i A C D    W Y 0.5 0 0    0 0.5 0.3 0.1 0    0.3 0.3 0 0.5 0.2    0.1 0.2 Gap penalties 1.0 0.5 1.0 Position dependent gap penalties

Profile-sequence alignment sequence ACD……VWY

Sequence to profile alignment A A V V L 0.4 A 0.2 L 0.4 V Score of amino acid L in sequence that is aligned against this profile position: Score = 0.4 * s(L, A) + 0.2 * s(L, L) + 0.4 * s(L, V)

Profile-profile alignment profile A C D . . Y profile ACD……VWY

Profile to profile alignment 0.75 G 0.25 S 0.4 A 0.2 L 0.4 V Match score of these two alignment columns using the a.a frequencies at the corresponding profile positions: Score = 0.4*0.75*s(A,G) + 0.2*0.75*s(L,G) + 0.4*0.75*s(V,G) + + 0.4*0.25*s(A,S) + 0.2*0.25*s(L,S) + 0.4*0.25*s(V,S) s(x,y) is value in amino acid exchange matrix (e.g. PAM250, Blosum62) for amino acid pair (x,y)

So, for scoring profiles … • Think of sequence-sequence alignment. • Same principles but more information for each position. Reminder: • The sequence pair alignment score S comes from the sum of the positional scores M(aai,aaj) (i.e. the substitution matrix values at each alignment position minus penalties if applicable) • Profile alignment scores are exactly the same, but the positional scores are more complex

Scoring a profile position Profile 1 Profile 2 A C D . . Y A C D . . Y • At each position (column) we have different residue frequencies for each amino acid (rows) SO: • Instead of saying S=M(aa1, aa2) (one residue pair) • For frequency f>0 (amino acid is actually there)we take:

Progressive alignment • Perform pair-wise alignments of all of the sequences; • Use the alignment scores to produces a dendrogram using neighbour-joining methods (guide-tree); • Align the sequences sequentially, guided by the relationships indicated by the tree. • Biopat (first integrated method ever) • MULTAL (Taylor 1987) • DIALIGN (1&2, Morgenstern 1996) • PRRP (Gotoh 1996) • ClustalW (Thompson et al 1994) • PRALINE (Heringa 1999) • T Coffee (Notredame 2000) • POA (Lee 2002) • MAFFT (Katoh 2002) • MUSCLE (Edgar 2004) • PROBCONS (Do 2005)

Progressive multiple alignment 1 Score 1-2 2 1 Score 1-3 3 4 Score 4-5 5 Scores Similarity matrix 5×5 Clustering (tree-building) method Iteration possibilities Align Guide tree Multiple alignment

General progressive multiple alignment technique(follow generated tree) d 1 3 1 3 2 5 1 3 2 5 1 root 3 2 5

PRALINE progressive strategy d 1 3 1 3 2 1 3 2 5 4 1 3 2 5 4

There are problems … Accuracy is very important !!!! • Alignment errors during the construction of the MSA cannot be repaired anymore: errors made at any alignment step are propagatedthrough subsequent progressive steps. • The comparisons of sequences at early steps during progressive alignments cannot make use of information from other sequences. • It is only later during the alignment progression that more information from other sequences (e.g. through profile representation) becomes employed in further alignment steps. “Once a gap, always a gap” Feng & Doolittle, 1987

Clustal, ClustalW, ClustalX • CLUSTAL W/X (Thompson et al., 1994) uses Neighbour Joining (NJ) algorithm (Saitou and Nei, 1984), widely used in phylogenetic analysis, to construct a guide tree. • Sequence blocks are represented by a profile, in which the individual sequences are additionally weighted according to the branch lengths in the NJ tree. • Further carefully crafted heuristics include: • (i) local gap penalties • (ii) automatic selection of the amino acid substitution matrix, (iii) automatic gap penalty adjustment • (iv) mechanism to delay alignment of sequences that appear to be distant at the time they are considered. • CLUSTAL (W/X) does not allow iteration (Hogeweg and Hesper, 1984; Corpet, 1988, Gotoh, 1996; Heringa, 1999, 2002)

Sequence weighting dilemmaPair-wise alignment quality versus sequence identity(Vogt et al., JMB 249, 816-831,1995)

Additional strategies for multiple sequence alignment • Profile pre-processing • Secondary structure-induced alignment • Homology-extended alignment • Matrix extension Objective: try to avoid (early) errors

Profile pre-processing 1 Score 1-2 2 1 Score 1-3 3 4 5 Score 4-5 1 Key Sequence 2 1 Pre-alignment 3 4 5 Master-slave (N-to-1) alignment A C D . . Y 1 Pre-profile Pi Px

Pre-profile generation 1 Score 1-2 2 1 Score 1-3 3 4 Score 4-5 5 Cut-off Pre-profiles Pre-alignments 1 A C D . . Y 1 2 3 4 5 2 2 A C D . . Y 1 3 4 5 5 A C D . . Y 1 5 2 3 4

Pre-profile alignment Pre-profiles 1 A C D . . Y 2 A C D . . Y Final alignment 3 A C D . . Y 1 2 3 4 5 4 A C D . . Y 5 A C D . . Y

Pre-profile alignment 1 2 1 3 4 5 2 2 1 3 4 Final alignment 5 3 1 1 3 2 2 4 3 5 4 5 4 4 1 2 3 5 5 1 5 2 3 4

Pre-profile alignmentAlignment consistency Ala131 1 1 2 1 A131 A131 L133 C126 A131 3 4 5 2 2 1 2 3 4 5 3 1 3 2 4 5 4 4 1 2 5 3 5 5 1 5 2 3 4

PRALINE pre-profile generation • Idea: use the information from all query sequences to make a pre-profile for each query sequence that contains information from other sequences • You can use all sequences in each pre-profile, or use only those sequences that will probably align ‘correctly’. Incorrectly aligned sequences in the pre-profiles will increase the noise level. • Select using alignment score: only allow sequences in pre-profiles if their alignment with the score higher than a given threshold value. In PRALINE, this threshold is given as prepro=1500 (alignment score threshold value is 1500 – see next two slides)

Flavodoxin-cheY consistency scores(PRALINE prepro=0) 1fx1 --7899999999999TEYTAETIARQL8776-6657777777777777553799VL999ST97775599989-435566677798998878AQGRKVACF FLAV_DESVH -46788999999999TEYTAETIAREL7777-7757777777777777553799VL999ST97775599989-435566677798998878AQGRKVACF FLAV_DESDE -47899999999999999999999988776695658888777777778763YDAVL999SAW9877789877753556666669777776789GRKVAAF FLAV_DESGI -46788999999999TEGVAEAIAKTL9997-76678888777777887539DVVL999ST987776--9889546667776697776557777888888 FLAV_DESSA 93677799999999999999999999988759765777888888888876399999999STW77765--9999536666677797998779999999999 4fxn -878779999999999999999999776666967567788888888888777999999988777776--9889577788888897773237888888888 FLAV_MEGEL 9776779999999999999999997777766-665666677788899976799999999987777669--887362334466695555455778888888 2fcr --87899999999999TEVADFIGK996541900300000112233355679DLLF99999855312888111224555555407777777888888888 FLAV_ANASP -47899LFYGTQTGKTESVAEIIR9777653922356677777777897779999999999988843--9998555778777899998879999999999 FLAV_ECOLI 997789999GSDTGNTENIAKMIQ8774222922456678889999995569999999999755553----99262225555495777767778999999 FLAV_AZOVI --79IGLFFGSNTGKTRKVAKSIK99887759657577888888999777899999999999877761112222222244555-5555555778999999 FLAV_ENTAG 94789999999999999999999998755229223234555555555555688899999998875521111111133477777-7777777999999999 FLAV_CLOAB -86999ILYSSKTGKTERVAK9997555555057678887888887777765778899998522223--9888342234455597777777777777777 3chy 0122222223333335666665555555222922222222222221112163335555755553222888877674533344493332222222222222 Avrg Consist 8667778888888889999999998776554844455566666666665557888888888766544887666334445566586666556778888888 Conservation 0125538675848969746963946463343045244355446543473516658868567554455000000314365446505575435547747759 1fx1 G888799955555559888888888899777----7777797787787978---555555566776555677777778888799------ FLAV_DESVH G888799955555559888888888899777----7777797787787978---555555566776555677777778888799------ FLAV_DESDE A88878685555555999988888889998879--8777788-98777777--8555555554433245667777777777599------ FLAV_DESGI 87775977755555677777777777777778---88888887667778777775555555555542424667888887777-------- FLAV_DESSA 977768777555556777777777777777767887777777778888-978985555555556536556888888888877-------- 4fxn 867777555555552666666666555555577887767999877777977777665555555555444466666666555798------ FLAV_MEGEL 8577775666666525556777778888888689977888988776558677885544333222222212233223355557-------- 2fcr 877773573333333777766667777765533333333333333322833333333332244444567777777888777633------ FLAV_ANASP 977773775333344777888888777777733334444444444433833333344444444444455577777788777734------ FLAV_ECOLI 977743786444444777788888888888833334444444444444244444555554555775667788888888877734110000 FLAV_AZOVI 97776355333333466666667777777773333444444444444482333355555555555545558888888877772311---- FLAV_ENTAG 977773886555555866666666677666633333333333333322123333344444444455555665566666555582------ FLAV_CLOAB 766627222222212444444444455555587882222222222222111111122222222222344443333333233399------ 3chy 222227222222224111355431113324578-87778997666556877776322222222222322222323344444422------ Avrg Consist 866656564444444666666666666666656665555565555555655565444443444443344455666666666666889999 Conservation 73663057433334163464534444*746710000011010011000000010434744645443225474454448434301000000 Iteration 0 SP= 135136.00 AvSP= 10.473 SId= 3838 AvSId= 0.297 Consistency values are scored from 0 to 10; the value 10 is represented by the corresponding amino acid (red)

Flavodoxin-cheY consistency scores (PRALINE prepro=1500) 1fx1 -42444IVYGSTTGNTEYTAETIARQL886666666577777775667888DLVLLGCSTW77766----995476666769-77888788AQGRKVACFFLAV_DESVH -34444IVYGSTTGNTEYTAETIAREL776666666577777775667888DLVLLGCSTW77766----995476666769-77888788AQGRKVACFFLAV_DESSA -33444IVYGSTTGNTET99999888777655777668888899666686YDIVLFGCSTW77777----996466666779-88SL98ADLKGKKVSVFFLAV_DESGI -34444IVYGSTTGNTEGVA9999999999765555677777886666678DVVLLGCSTW77777----995466666779-88887688888KKVGVFFLAV_DESDE -44777IVFGSSTGNTE988777666655566777778899999777777YDAVLFGCSAW88877----997587777779-8887766777GRKVAAF4fxn -32222IVYWSGTGNTE8888888876666778888888888NI8888586DILILGCSA888888------8-8888886--66665378ISGKKVALFFLAV_MEGEL -12222IVYWSGTGNTEAMA8888888888888888555555555555485DVILLGCPAMGSE77------572222288--8888755588GKKVGLF2fcr -41456IFFSTSTGNTTEVA999998865432222765554443244779YDLLFLGAPT944411999-111112454441-8DKLPEVDMKDLPVAIFFLAV_ANASP -00456LFYGTQTGKTESVAEII987755323322427776666623589YQYLIIGCPTW55532--999843678W988899998888888GKLVAYFFLAV_AZOVI -42445LFFGSNTGKTRKVAKSIK87777434333536666665467777YQFLILGTPTLGEG862222222222355558-45666666888KTVALFFLAV_ENTAG -266IGIFFGSDTGQTRKVAKLIHQKL6664664424DVRRATR88888SYPVLLLGTPT88888644444444446WQEF8-8NTLSEADLTGKTVALFFLAV_ECOLI -51114IFFGSDTGNTENIAKMI987743311111555555588355599YDILLLGIPT954431----88355225544--44666666779KLVALFFLAV_CLOAB -63666ILYSSKTGKTERVAKLIE63333333333333333333366LQESEGIIFGTPTY63--6--------66SWE33333333333333GKLGAAF3chy ADKELKFLVVDDFSTMRRIVRNLLKELGFNNVEEAEDGVDALNKLQ-AGGYGFVI---SDWNMPNM----------DGLEL--LKTIRADGAMSALPVLMAvrg Consist 9334459999999999999999988776655555555666667756667889999999999767658888775555566668967777677889999999Conservation 02364286758489697469639464633443543125645654143443665886856755445500000031446544600555753455477477591fx1 G98879-89-999877977--7788899999999955--88888-9988887798999777778766553344588776666222266899899FLAV_DESVH G98879-89-999877977--7788899999999955--88888-9988887798999777778766553344588776666222266899899FLAV_DESSA G98878-688688888-88--88999999999999979988888887788889-89-9787777666756645577776666654466899899FLAV_DESGI G98879-898688888987--788888999GATLV7698899-9998789888-8899787878776663122477788888333276899899FLAV_DESDE AS8888-68-888888899--9999999999988888-999888889887788978887766688542222122555555553332779999994fxn GS2228-228222222222--2388888888888888888888888888888888888887778866765535577555533221288888888FLAV_MEGEL G4888--28-8888882MD--AWKQRTEDTGATVI77---------------------77222--224444222222244222112--------2fcr GLGDA5-8Y5DNFC88-88--8877777777777765444555555555544385555777774465333357799999987555333899899FLAV_ANASP GTGDQ5-GY5899999-99--99EEKISQRGG99975555544444444433284444466665555555556666676666433333899899FLAV_AZOVI GLGDQ5-885777555-55--55555788888888555555555555555554855555555555666555555888855555544442--288FLAV_ENTAG GLGDQL-NYSKNFVSA-MR--ILYDLVIARGACVVG8888EGYKFSFSAA6664NEFVGLPLDQEN88888EERIDSWLE88842242688688FLAV_ECOLI GC99549784688888987997777777778888855444444444444444114444777774455775567788888887433322100100FLAV_CLOAB STANS6366663333333333336666666666666666663333363366336663333336EDENARIFGERIANKVKQI3333336666663chy VTAEA---KKENIIAA-----------AQAGAS-------------------------GYVVK-----PFTAATLEEKLNKIFEKLGM------Avrg Consist 9988779787777777777997788888888888866777777777767766677777676667766655455577776666433355788788Conservation 746640037154545706300354534444*745753000001010010000000010683760144442335574454448434301000000Iteration 0 SP= 136702.00 AvSP= 10.654 SId= 3955 AvSId= 0.308 Consistency values are scored from 0 to 10; the value 10 is represented by the corresponding amino acid (red)

Iteration • Alignment iteration: • do an alignment • learn from it • do it better next time • Bootstrapping

Consistency-based iteration Pre-profiles Multiple alignment positional consistency scores

Pre-profile update iteration Pre-profiles Multiple alignment

Iteration Convergence Limit cycle Divergence

CLUSTAL X (1.64b) multiple sequence alignment Flavodoxin-cheY 1fx1 -PKALIVYGSTTGNTEYTAETIARQLANAG-Y-EVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSIE------LQDDFIPLFD-SLEETGAQGRK FLAV_DESVH MPKALIVYGSTTGNTEYTAETIARELADAG-Y-EVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSIE------LQDDFIPLFD-SLEETGAQGRK FLAV_DESGI MPKALIVYGSTTGNTEGVAEAIAKTLNSEG-M-ETTVVNVADVTAPGLAEGYDVVLLGCSTWGDDEIE------LQEDFVPLYE-DLDRAGLKDKK FLAV_DESSA MSKSLIVYGSTTGNTETAAEYVAEAFENKE-I-DVELKNVTDVSVADLGNGYDIVLFGCSTWGEEEIE------LQDDFIPLYD-SLENADLKGKK FLAV_DESDE MSKVLIVFGSSTGNTESIAQKLEELIAAGG-H-EVTLLNAADASAENLADGYDAVLFGCSAWGMEDLE------MQDDFLSLFE-EFNRFGLAGRK FLAV_CLOAB -MKISILYSSKTGKTERVAKLIEEGVKRSGNI-EVKTMNLDAVDKKFLQE-SEGIIFGTPTYYAN---------ISWEMKKWID-ESSEFNLEGKL FLAV_MEGEL --MVEIVYWSGTGNTEAMANEIEAAVKAAG-A-DVESVRFEDTNVDDVAS-KDVILLGCPAMGSE--E------LEDSVVEPFF-TDLAPKLKGKK 4fxn ---MKIVYWSGTGNTEKMAELIAKGIIESG-K-DVNTINVSDVNIDELLN-EDILILGCSAMGDE--V------LEESEFEPFI-EEISTKISGKK FLAV_ANASP SKKIGLFYGTQTGKTESVAEIIRDEFGNDVVT----LHDVSQAEVTDLND-YQYLIIGCPTWNIGELQ---SD-----WEGLYS-ELDDVDFNGKL FLAV_AZOVI -AKIGLFFGSNTGKTRKVAKSIKKRFDDETMSD---ALNVNRVSAEDFAQ-YQFLILGTPTLGEGELPGLSSDCENESWEEFLP-KIEGLDFSGKT 2fcr --KIGIFFSTSTGNTTEVADFIGKTLGAKADAP---IDVDDVTDPQALKD-YDLLFLGAPTWNTGADTERSGT----SWDEFLYDKLPEVDMKDLP FLAV_ENTAG MATIGIFFGSDTGQTRKVAKLIHQKLDGIADAP---LDVRRATREQFLS--YPVLLLGTPTLGDGELPGVEAGSQYDSWQEFTN-TLSEADLTGKT FLAV_ECOLI -AITGIFFGSDTGNTENIAKMIQKQLGKDVAD----VHDIAKSSKEDLEA-YDILLLGIPTWYYGEAQ-CD-------WDDFFP-TLEEIDFNGKL 3chy --ADKELKFLVVDDFSTMRRIVRNLLKELG----FNNVEEAEDGVDALN------KLQAGGYGFV--I------SDWNMPNMDG-LELLKTIR--- . ... : . . : 1fx1 VACFGCGDSSYEYF--CGAVDAIEEKLKNLGAEIVQDG----------------LRIDGDPRAARDDIVGWAHDVRGAI--------------- FLAV_DESVH VACFGCGDSSYEYF--CGAVDAIEEKLKNLGAEIVQDG----------------LRIDGDPRAARDDIVGWAHDVRGAI--------------- FLAV_DESGI VGVFGCGDSSYTYF--CGAVDVIEKKAEELGATLVASS----------------LKIDGEPDSAE--VLDWAREVLARV--------------- FLAV_DESSA VSVFGCGDSDYTYF--CGAVDAIEEKLEKMGAVVIGDS----------------LKIDGDPERDE--IVSWGSGIADKI--------------- FLAV_DESDE VAAFASGDQEYEHF--CGAVPAIEERAKELGATIIAEG----------------LKMEGDASNDPEAVASFAEDVLKQL--------------- FLAV_CLOAB GAAFSTANSIAGGS--DIALLTILNHLMVKGMLVYSGGVA----FGKPKTHLGYVHINEIQENEDENARIFGERIANKVKQIF----------- FLAV_MEGEL VGLFGSYGWGSGE-----WMDAWKQRTEDTGATVIGTA----------------IVN-EMPDNAPECKE-LGEAAAKA---------------- 4fxn VALFGSYGWGDGK-----WMRDFEERMNGYGCVVVETP----------------LIVQNEPDEAEQDCIEFGKKIANI---------------- FLAV_ANASP VAYFGTGDQIGYADNFQDAIGILEEKISQRGGKTVGYWSTDGYDFNDSKALR-NGKFVGLALDEDNQSDLTDDRIKSWVAQLKSEFGL------ FLAV_AZOVI VALFGLGDQVGYPENYLDALGELYSFFKDRGAKIVGSWSTDGYEFESSEAVV-DGKFVGLALDLDNQSGKTDERVAAWLAQIAPEFGLSL---- 2fcr VAIFGLGDAEGYPDNFCDAIEEIHDCFAKQGAKPVGFSNPDDYDYEESKSVR-DGKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV------ FLAV_ENTAG VALFGLGDQLNYSKNFVSAMRILYDLVIARGACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSWLEKLKPAVL------- FLAV_ECOLI VALFGCGDQEDYAEYFCDALGTIRDIIEPRGATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKWVKQISEELHLDEILNA 3chy AD--GAMSALPVL-----MVTAEAKKENIIAAAQAGAS----------------GYV-VKPFTAATLEEKLNKIFEKLGM-------------- . . : . .

Flavodoxin-cheY: Pre-processing (prepro1500) 1fx1 -PKALIVYGSTTGNT-EYTAETIARQLANAG-YEVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPLF-DSLEETGAQGRKVACF FLAV_DESDE MSKVLIVFGSSTGNT-ESIaQKLEELIAAGG-HEVTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDDFLSLF-EEFNRFGLAGRKVAAf FLAV_DESVH MPKALIVYGSTTGNT-EYTaETIARELADAG-YEVDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDDFIPLF-DSLEETGAQGRKVACf FLAV_DESSA MSKSLIVYGSTTGNT-ETAaEYVAEAFENKE-IDVELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQDDFIPLY-DSLENADLKGKKVSVf FLAV_DESGI MPKALIVYGSTTGNT-EGVaEAIAKTLNSEG-METTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQEDFVPLY-EDLDRAGLKDKKVGVf 2fcr --KIGIFFSTSTGNT-TEVADFIGKTLGA---KADAPIDVDDVTDPQALKDYDLLFLGAPTWNTG----ADTERSGTSWDEFLYDKLPEVDMKDLPVAIF FLAV_AZOVI -AKIGLFFGSNTGKT-RKVaKSIKKRFDDET-MSDA-LNVNRVS-AEDFAQYQFLILgTPTLGEGELPGLSSDCENESWEEFL-PKIEGLDFSGKTVALf FLAV_ENTAG MATIGIFFGSDTGQT-RKVaKLIHQKLDG---IADAPLDVRRAT-REQFLSYPVLLLgTPTLGDGELPGVEAGSQYDSWQEFT-NTLSEADLTGKTVALf FLAV_ANASP SKKIGLFYGTQTGKT-ESVaEIIRDEFGN---DVVTLHDVSQAE-VTDLNDYQYLIIgCPTWNIGEL--------QSDWEGLY-SELDDVDFNGKLVAYf FLAV_ECOLI -AITGIFFGSDTGNT-ENIaKMIQKQLGK---DVADVHDIAKSS-KEDLEAYDILLLgIPTWYYGE--------AQCDWDDFF-PTLEEIDFNGKLVALf 4fxn -MK--IVYWSGTGNT-EKMAELIAKGIIESG-KDVNTINVSDVNIDELL-NEDILILGCSAMGDEVL-------EESEFEPFI-EEIS-TKISGKKVALF FLAV_MEGEL MVE--IVYWSGTGNT-EAMaNEIEAAVKAAG-ADVESVRFEDTNVDDVA-SKDVILLgCPAMGSEEL-------EDSVVEPFF-TDLA-PKLKGKKVGLf FLAV_CLOAB -MKISILYSSKTGKT-ERVaKLIEEGVKRSGNIEVKTMNLDAVD-KKFLQESEGIIFgTPTYYAN---------ISWEMKKWI-DESSEFNLEGKLGAAf 3chy ADKELKFLVVDDFSTMRRIVRNLLKELGFN--NVEEAEDGVDALNKLQAGGYGFVI---SDWNMPNM----------DGLELL-KTIRADGAMSALPVLM T 1fx1 GCGDS-SY-EYFCGA-VDAIEEKLKNLGAEIVQD---------------------GLRIDGD--PRAARDDIVGWAHDVRGAI-------- FLAV_DESDE ASGDQ-EY-EHFCGA-VPAIEERAKELgATIIAE---------------------GLKMEGD--ASNDPEAVASfAEDVLKQL-------- FLAV_DESVH GCGDS-SY-EYFCGA-VDAIEEKLKNLgAEIVQD---------------------GLRIDGD--PRAARDDIVGwAHDVRGAI-------- FLAV_DESSA GCGDS-DY-TYFCGA-VDAIEEKLEKMgAVVIGD---------------------SLKIDGD--PE--RDEIVSwGSGIADKI-------- FLAV_DESGI GCGDS-SY-TYFCGA-VDVIEKKAEELgATLVAS---------------------SLKIDGE--PD--SAEVLDwAREVLARV-------- 2fcr GLGDAEGYPDNFCDA-IEEIHDCFAKQGAKPVGFSNPDDYDYEESKS-VRDGKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV------ FLAV_AZOVI GLGDQVGYPENYLDA-LGELYSFFKDRgAKIVGSWSTDGYEFESSEA-VVDGKFVGLALDLDNQSGKTDERVAAwLAQIAPEFGLS--L-- FLAV_ENTAG GLGDQLNYSKNFVSA-MRILYDLVIARgACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSwLEKLKPAV-L------ FLAV_ANASP GTGDQIGYADNFQDA-IGILEEKISQRgGKTVGYWSTDGYDFNDSKA-LRNGKFVGLALDEDNQSDLTDDRIKSwVAQLKSEFGL------ FLAV_ECOLI GCGDQEDYAEYFCDA-LGTIRDIIEPRgATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKwVKQISEELHLDEILNA 4fxn G-----SY-GWGDGKWMRDFEERMNGYGCVVVET---------------------PLIVQNE--PDEAEQDCIEFGKKIANI--------- FLAV_MEGEL G-----SY-GWGSGEWMDAWKQRTEDTgATVIGT----------------------AIVNEM--PDNA-PECKElGEAAAKA--------- FLAV_CLOAB STANSIAGGSDIA---LLTILNHLMVKgMLVYSG----GVAFGKPKTHLGYVHINEIQENEDENARIfGERiANkVKQIF----------- 3chy VTAEAKK--ENIIAA---------AQAGAS-------------------------GYVV-----KPFTAATLEEKLNKIFEKLGM------ G Iteration 0 SP= 136944.00 AvSP= 10.675 SId= 4009 AvSId= 0.313

Flavodoxin-cheY: Local Pre-processing(locprepro300) • 1fx1 --PKALIVYGSTTGNTEYTAETIARQLANAGYEVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPL--FDSLEETGAQGRKVACF • FLAV_DESVH -MPKALIVYGSTTGNTEYTaETIARELADAGYEVDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDDFIPL--FDSLEETGAQGRKVACf • FLAV_DESSA -MSKSLIVYGSTTGNTETAaEYVAEAFENKEIDVELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQDDFIPL--YDSLENADLKGKKVSVf • FLAV_DESGI -MPKALIVYGSTTGNTEGVaEAIAKTLNSEGMETTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQEDFVPL--YEDLDRAGLKDKKVGVf • FLAV_DESDE -MSKVLIVFGSSTGNTESIaQKLEELIAAGGHEVTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDDFLSL--FEEFNRFGLAGRKVAAf • 4fxn --MK--IVYWSGTGNTEKMAELIAKGIIESGKDVNTINVSDVNIDELLN-EDILILGCSAMGDEVL------E-ESEFEPF--IEEIS-TKISGKKVALF • FLAV_MEGEL -MVE--IVYWSGTGNTEAMaNEIEAAVKAAGADVESVRFEDTNVDDVAS-KDVILLgCPAMGSEEL------E-DSVVEPF--FTDLA-PKLKGKKVGLf • 2fcr ---KIGIFFSTSTGNTTEVADFIGKTLGAKADAPI--DVDDVTDPQALKDYDLLFLGAPTWNTGAD----TERSGTSWDEFL-YDKLPEVDMKDLPVAIF • FLAV_ANASP -SKKIGLFYGTQTGKTESVaEIIRDEFGNDVVTLH--DVSQAEV-TDLNDYQYLIIgCPTWNIGEL--------QSDWEGL--YSELDDVDFNGKLVAYf • FLAV_AZOVI --AKIGLFFGSNTGKTRKVaKSIKKRFDDETMSDA-LNVNRVSA-EDFAQYQFLILgTPTLGEGELPGLSSDCENESWEEF--LPKIEGLDFSGKTVALf • FLAV_ENTAG -MATIGIFFGSDTGQTRKVaKLIHQKLDG--IADAPLDVRRATR-EQFLSYPVLLLgTPTLGDGELPGVEAGSQYDSWQEF--TNTLSEADLTGKTVALf • FLAV_ECOLI --AITGIFFGSDTGNTENIaKMIQKQLGKDVADVH--DIAKSSK-EDLEAYDILLLgIPTWYYGEA--------QCDWDDF--FPTLEEIDFNGKLVALf • FLAV_CLOAB --MKISILYSSKTGKTERVaKLIEEGVKRSGNIEVKTMNLDAVDKKFLQESEGIIFgTPTYYA-----------NISWEMKKWIDESSEFNLEGKLGAAf • 3chy ADKELKFLVVDDFSTMRRIVRNLLKELGFNNVEEAEDGVDALNKLQ-AGGYGFVI---SDWNMPNM----------DGLEL--LKTIRADGAMSALPVLM • 1fx1 GCGDS--SY-EYFCGA-VD--AIEEKLKNLGAEIVQD---------------------GLRID--GDPRAARDDIVGWAHDVRGAI-------- • FLAV_DESVH GCGDS--SY-EYFCGA-VD--AIEEKLKNLgAEIVQD---------------------GLRID--GDPRAARDDIVGwAHDVRGAI-------- • FLAV_DESSA GCGDS--DY-TYFCGA-VD--AIEEKLEKMgAVVIGD---------------------SLKID--GDPE--RDEIVSwGSGIADKI-------- • FLAV_DESGI GCGDS--SY-TYFCGA-VD--VIEKKAEELgATLVAS---------------------SLKID--GEPD--SAEVLDwAREVLARV-------- • FLAV_DESDE ASGDQ--EY-EHFCGA-VP--AIEERAKELgATIIAE---------------------GLKME--GDASNDPEAVASfAEDVLKQL-------- • 4fxn GS------Y-GWGDGKWMR--DFEERMNGYGCVVVET---------------------PLIVQ--NEPDEAEQDCIEFGKKIANI--------- • FLAV_MEGEL GS------Y-GWGSGEWMD--AWKQRTEDTgATVIGT---------------------AI-VN--EMPDNA-PECKElGEAAAKA--------- • 2fcr GLGDAE-GYPDNFCDA-IE--EIHDCFAKQGAKPVGFSNPDDYDYEESKSVRD-GKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV------ • FLAV_ANASP GTGDQI-GYADNFQDA-IG--ILEEKISQRgGKTVGYWSTDGYDFNDSKALRN-GKFVGLALDEDNQSDLTDDRIKSwVAQLKSEFGL------ • FLAV_AZOVI GLGDQV-GYPENYLDA-LG--ELYSFFKDRgAKIVGSWSTDGYEFESSEAVVD-GKFVGLALDLDNQSGKTDERVAAwLAQIAPEFGLS--L-- • FLAV_ENTAG GLGDQL-NYSKNFVSA-MR--ILYDLVIARgACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSwLEKLKPAV-L------ • FLAV_ECOLI GCGDQE-DYAEYFCDA-LG--TIRDIIEPRgATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKwVKQISEELHLDEILNA • FLAV_CLOAB STANSIAGGSDIALLTILNHLMVKgMLVYSGGVAFGKPKTHLGYVH----------INEIQENEDENARIfGERiANkVKQIF----------- • 3chy VTAEA---KKENIIAA-----------AQAGAS-------------------------GYVVK-----PFTAATLEEKLNKIFEKLGM------ • G

Strategies for multiple sequence alignment • Profile pre-processing • Secondary structure-induced alignment • Homology-extended alignment • Matrix extension Objective: integrate secondary structure information to anchor alignments and avoid errors

Protein structure hierarchical levels SECONDARY STRUCTURE (helices, strands) PRIMARY STRUCTURE (amino acid sequence) VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH QUATERNARY STRUCTURE (oligomers) TERTIARY STRUCTURE (fold)

Why use (predicted) structural information • “Structure more conserved than sequence” • Many structural protein families (e.g. globins) have family members with very low sequence similarities. For example, globin sequences identities can be as low as 10% while still having an identical fold. • This means that you can still observe equivalent secondary structures in homologous proteins even if sequence similarities are extremely low. • But you are dependent on the quality of prediction methods. For example, secondary structure prediction is currently at 76% correctness. So, 1 out of 4 predicted amino acids is still incorrect.

Two superposed protein structures with two well-superposed helices Red: well superposed Blue: low match quality C5 anaphylatoxin -- human (PDB code 1kjs) and pig (1c5a)) proteins are superposed

1-month Practical Course Genome Analysis Lecture 5: Multiple Sequence Alignment

1-month Practical Course Genome Analysis Lecture 5: Multiple Sequence Alignment

Presentation Transcript

Lecture 2: Sequence Alignment IBGP 705

Multiple sequence alignment Monday, December 6, 2010 Bioinformatics J. Pevsner pevsnerkennedykrieger

Chapter 6. Multiple sequence alignment methods

Multiple Sequence Alignment (MSA)

Multiple Sequence Alignment

Multiple Sequence Alignment (II)

Guide Trees and Progressive Multiple Sequence Alignment

Last lecture summary

Multiple Sequence Alignment “An Inexact Science”

Multiple Sequence alignment

1-month Practical Course

Lecture 1 Sequence Alignment

Multiple Sequence Alignment

Multiple Alignment

CS177 Lecture 4 Sequence Alignment

Multiple Alignment

Multiple Sequence Alignment

CS 5263 Bioinformatics

Multiple sequence alignment (MSA)

1-month Practical Course Genome Analysis 2008 Lecture 3: Profiles: representing sequence alignment