1.16k likes | 1.54k Views
Sequence Alignment and Phylogenetic Analysis. Evolution. Sequence Alignment. AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC. - AG G CTATCAC CT GACC T C CA GG C CGA -- TGCCC --- T AG - CTATCAC -- GACC G C -- GG T CGA TT TGCCC GAC. Definition
E N D
Sequence Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings x = x1x2...xM, y = y1y2…yN, an alignment is an assignment of gaps to positions 0,…, N in x, and 0,…, N in y, so as to line up each letter in one sequence with either a letter, or a gap in the other sequence
ClustalW • Popular multiple alignment tool today • ‘W’ stands for ‘weighted’ (different parts of alignment are weighted differently). • Three-step process 1.) Construct pairwise alignments 2.) Build Guide Tree 3.) Progressive Alignment guided by the tree
Step 3: Progressive Alignment • Start by aligning the two most similar sequences • Following the guide tree, add in the next sequences, aligning to the existing alignment • Insert gaps as necessary
Gathering Sequences with BLAST • The most convenient way to select your sequences is to use a BLAST server • Some BLAST servers are integrated with multiple-alignment methods: • www.expasy.ch (protein only) • srs.ebi.ac.uk (DNA/protein) • npsa-pbil.ibcp.fr
Selecting a Method • Many alternative methods exist for MSAs • Most of them use the progressive algorithm • They all are approximate methods • None is guaranteed to deliver the best alignments • All existing methods have pros and cons • ClustalW is the most popular (21,000 citations) • T-Coffee and ProbCons are more accurate but slower • MUSCLE is very fast, ideal for very large datasets
ClustalW • www.ebi.ac.uk/clustalw • pir.georgetown.edu/pirwww/search/multialn.shtml • www.ddbj.nig.ac.jp/search/clustalw-e.html
Tcoffee • TCOFFEE: www.tcoffee.org • CORE: evaluate MSA • MCOFFEE: run many and combine • EXPRESSO: with structural information
Running Many Methods at Once • MCOFFEE is a a meta-method • It runs all the individual MSA methods • It gathers all the produced MSAs • It combines the MSAs into a single MSA • MCOFFEE is more accurate than any individual method • Its color output lets you estimate the reliability of your MSA • MCOFFEE is available on www.tcoffee.org
Alignments and Formats • Many alternative formats exist for MSAs • One format does not always have a clear advantage over another • Changing formats is possible • Annotation information can sometimes be lost in a format change • Not all formats contain the same information • The annotation may change • Reformatting may cause the loss of annotation information
Interleaved and Non-interleaved • The MSF Format • Interleaved • The FASTA Format • Non-interleaved
Choosing Your Format • When choosing a format, ask yourself four questions: • Is it supported by the programs I need to use ? • Can my collaborators use it? • Can it support all of my annotation ? • Is it easy to read and manipulate ?
Converting Formats • Don’t re-compute your MSA if it is not in the right format • Convert your file using one of the online conversion tools • The 3 most popular reformatting utilities: • Fmtseq The most complete • RESDSEQ Very popular and robust • SeqCheck Can clean FASTA sequences
An Alignment CLUSTAL 2.1 multiple sequence alignment sp|P02620|PRVB_MERME ---------------------------------------------AFAGI 5 sp|P02622|PRVB_GADCA ---------------------------------------------AFKGI 5 sp|P02619|PRVB_ESOLU ---------------------------------------------SFAGL 5 sp|Q91482|PRVB1_SALSA --------------------------------------------MACAHL 6 sp|P43305|PRVU_CHICK --------------------------------------------MSLTDI 6 sp|P20472|PRVA_HUMAN --------------------------------------------MSMTDL 6 sp|P80079|PRVA_FELCA --------------------------------------------MSMTDL 6 sp|P02627|PRVA_RANES ---------------------------------------------PMTDL 5 sp|P02626|PRVA_AMPME ---------------------------------------------SMTDV 5 sp|P02586|TNNC2_RABIT MTDQQAEARSYLSEEMIAEFKAAFDMFDADGGGDISVKELGTVMRMLGQT 50 sp|P02620|PRVB_MERME LADADITAALAACKAEGS--FKHGEFFTKIG------LKGKSAADIKKVF 47 sp|P02622|PRVB_GADCA LSNADIKAAEAACFKEGS--FDEDGFYAKVG------LDAFSADELKKLF 47 sp|P02619|PRVB_ESOLU -KDADVAAALAACSAADS--FKHKEFFAKVG------LASKSLDDVKKAF 46 sp|Q91482|PRVB1_SALSA CKEADIKTALEACKAADT--FSFKTFFHTIG------FASKSADDVKKAF 48 sp|P43305|PRVU_CHICK LSPSDIAAALRDCQAPDS--FSPKKFFQISG------MSKKSSSQLKEIF 48 sp|P20472|PRVA_HUMAN LNAEDIKKAVGAFSATDS--FDHKKFFQMVG------LKKKSADDVKKVF 48 sp|P80079|PRVA_FELCA LGAEDIKKAVEAFTAVDS--FDYKKFFQMVG------LKKKSPDDIKKVF 48 sp|P02627|PRVA_RANES LAAGDISKAVSAFAAPES--FNHKKFFELCG------LKSKSKEIMQKVF 47 sp|P02626|PRVA_AMPME IPEADINKAIHAFKAGEA--FDFKKFVHLLG------LNKRSPADVTKAF 47 sp|P02586|TNNC2_RABIT PTKEELDAIIEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECF 100 :: : :. * * : : * sp|P02620|PRVB_MERME GIIDQDKSDFVEEDELKLFLQNFSAGARALTDAETATFLKAGDSDGDGKI 97 sp|P02622|PRVB_GADCA KIADEDKEGFIEEDELKLFLIAFAADLRALTDAETKAFLKAGDSDGDGKI 97 sp|P02619|PRVB_ESOLU YVIDQDKSGFIEEDELKLFLQNFSPSARALTDAETKAFLADGDKDGDGMI 96 sp|Q91482|PRVB1_SALSA KVIDQDASGFIEVEELKLFLQNFCPKARELTDAETKAFLKAGDADGDGMI 98 sp|P43305|PRVU_CHICK RILDNDQSGFIEEDELKYFLQRFECGARVLTASETKTFLAAADHDGDGKI 98 sp|P20472|PRVA_HUMAN HMLDKDKSGFIEEDELGFILKGFSPDARDLSAKETKMLMAAGDKDGDGKI 98 sp|P80079|PRVA_FELCA HILDKDKSGFIEEDELGFILKGFYPDARDLSVKETKMLMAAGDKDGDGKI 98 sp|P02627|PRVA_RANES HVLDQDQSGFIEKEELCLILKGFTPEGRSLSDKETTALLAAGDKDGDGKI 97 sp|P02626|PRVA_AMPME HILDKDRSGYIEEEELQLILKGFSKEGRELTDKETKDLLIKGDKDGDGKI 97 sp|P02586|TNNC2_RABIT RIFDRNADGYIDAEELAEIFR---ASGEHVTDEEIESLMKDGDKNNDGRI 147 : *.: ..::: :** :: . :: * :: .* :.** * sp|P02620|PRVB_MERME GVEEFAAMV-----KG 108 sp|P02622|PRVB_GADCA GVDEFGALVDKWGAKG 113 sp|P02619|PRVB_ESOLU GVDEFAAMI-----KA 107 sp|Q91482|PRVB1_SALSA GIDEFAVLV-----KQ 109 sp|P43305|PRVU_CHICK GAEEFQEMV-----QS 109 sp|P20472|PRVA_HUMAN GVDEFSTLVA----ES 110 sp|P80079|PRVA_FELCA DVDEFFSLVA----KS 110 sp|P02627|PRVA_RANES GVDEFVTLVS----ES 109 sp|P02626|PRVA_AMPME GVDEFTSLVA----ES 109 sp|P02586|TNNC2_RABIT DFDEFLKMMEG---VQ 160 . :** ::
READSEQ • http://www.ebi.ac.uk/cgi-bin/readseq.cgi
Converting Formats Can Be Dangerous • Format conversion can result in data loss • After converting your file, you must make sure your data is still intact • The following slide shows the most common losses that occur during conversion
Editing your MSA • If your MSA looks bad . . . • Don’t torture the online server • Edit the MSA yourself locally • Never, ever, ever (ever) use a standard word processor • Always use a dedicated MSA editor • The most popular online tool is Jalview • You can get it at www.jalview.org
With Jalview You Can . . . • Modify your MSA • Remove some of the redundant sequences • Insert/remove gaps • Shift portions of the MSA • Modify the alignment of a sub-group of sequences • Recompute some portions of your alignment
Some Special Features of Jalview • Computation of a consensus sequence • Computation of a phylogenetic tree • Removal of the redundancy • Applying any color scheme to your MSA
Preparing Your MSA for Publication • MSAs in publications usually come with shaded colors • You can improve your MSAs using online tools like Boxshade • Boxshade will shade your MSA according to its degree of conservation
MSA => LOGO Graph • A LOGO graph summarizes an MSA • Tall letters indicate highly conserved positions • Short letters indicate poorly conserved positions • LOGO graphs are ideal for identifying conserved patterns • weblogo.berkeley.edu/
Going Farther • Your imagination is the limit when it comes to making MSAs nice- looking and informative • Four very popular and easy-to-install MSA editors: • CINEMA • Seaview • Belvu • Kalignview • Boxshade is the simplest shading tool • If you need heavier capabilities, try Espript • Available at espript.ibpc.fr
Early Evolutionary Studies • Anatomical features were the dominant criteria used to derive evolutionary relationships between species since Darwin till early 1960s • The evolutionary relationships derived from these relatively subjective observations were often inconclusive. Some of them were later proved incorrect
Evolution and DNA Analysis: the Giant Panda Riddle • For roughly 100 years scientists were unable to figure out which family the giant panda belongs to • Giant pandas look like bears but have features that are unusual for bears and typical for raccoons, e.g., they do not hibernate • In 1985, Steven O’Brien and colleagues solved the giant panda classification problem using DNA sequences and algorithms