1 / 47

February 4 – 8, 2008

Centers for Disease Control, Atlanta, GA Workshop on Molecular Evolution: Special session on Phylogenetics. February 4 – 8, 2008. Tuesday, February 5, 2008, 2 to 6 PM. Multiple Sequence Alignment & Analysis with SeaView and MAFFT. Steven M. Thompson

ebenson
Download Presentation

February 4 – 8, 2008

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Centers for Disease Control, Atlanta, GA Workshop on Molecular Evolution: Special session on Phylogenetics February 4 – 8, 2008

  2. Tuesday, February 5, 2008, 2 to 6 PM Multiple Sequence Alignment & Analysis with SeaView and MAFFT Steven M. Thompson Florida State University School of Computational Science (SCS) More data yields stronger analyses — if done carefully! The patterns of conservation become ever clearer by comparing the conserved portions of sequences amongst a larger and larger dataset. Mosaic ideas and evolutionary ‘importance.’

  3. my lecture’s outline The Why — Applications: molecular phylogenetics; primer design and graphics; homology based inference. The How — Dynamic programming with just two sequences: the recursion and scoring matrices. The When — Significance and the extreme value distribution: the Expectation value and homology. The How again — Multiple sequence dynamic programming; the algorithm and some of the variants: Clustal, Muscle, ProbCons, T-Coffee, and MAFFT. Do it on the Web, your own computer, a server. Issues — Coding DNA versus protein sequences. Reliability and all the complications involved. How to cope — SeaView: editing, visualization, and analysis. Before proceeding I need to remind you that the manuscript that we’ll be using for the tutorial this afternoon after I finish ‘yacking’ has most of this talk in greater detail and with all references (versus the PDF from the slides).

  4. First off, why even bother — Applicability? • Molecular evolutionary analysis; plus • Probe/primer, and motif/profile design; • Graphical illustrations; and • Comparative ‘homology’ inference. • OK — here’s some examples.

  5. Molecular evolution and phylogenetics We all know multiple sequence alignments are necessary for phylogenetic inference, but does everybody here truly realize that the absolute positional homology of every column in a data matrix passed on to these programs is the most critical assumption that all the algorithms make (but see Bayesian coestimation)!

  6. And what about this other stuff? Multiple sequence alignments can be indispensable for primer design when you don’t have data on a particular taxa, yet data is available in related taxa. The conservation and variability within an alignment can help guide the design of universal or species specific primers.

  7. Here’s an HPV L1 example The ellipses show areas where PCR primers could differentiate the Type 16 clade from it’s closest relatives — areas of high L1 conservation in the Type 16 clade (red line) that correspond to areas of much weaker conservation in the others (blue line).

  8. HMG box Motif and profile definition An alignment of human SRY/SOX proteins illustrates the conservation of the HMG box. Conserved regions can be visualized with a sliding window approach and appear as peaks. Motifs and (better yet) HMM profiles can be created of the region to be used as a search tool to find other HMG box proteins.

  9. One picture’s worth . . . The HMG-box domain is strikingly conserved amongst the otherwise nearly unalignable human DNA regulatory paralogous protein family.

  10. Structure/function homology inference A Swiss-Model homology based model of Giardia EF1 superimposed over its eight most similar sequences with solved structure. Amazingly accurate structure/function inferences are often possible using comparative methods.

  11. OK, so alignment is worthwhile. One way to ‘see’ an alignment between two sequences is a dot plot, but how do we calculate the ‘best’ alignment?

  12. So, first, let’s review pairwise alignment Brute force just won’t work, complexity O ( ~N4N ) Dynamic programming reduces the complexity of this to O ( ~N2 ) An optimal alignment is defined as an arrangement of two sequences,1 of length i and 2 of length j, such that: Si-1 j-1 or max Si-xj-1 + wx-1 or Sij = sij + max 2 < x < i max Si-1 j-y + wy-1 2 < y < i where Sij is the score for the alignment ending at i in sequence 1 and j in sequence 2, sij is the score for aligning i with j, wx is the score for making a x long gap in sequence 1, wy is the score for making a y long gap in sequence 2, allowing gaps to be any length in either sequence. Usually an affine penalty is used: total = ( [ length of gap ] *[ gap extension penalty ] ) + gap opening penalty, i.e. y = mx + b.

  13. An illustration of a simplified DP alignment example total penalty = gap opening penalty {zero here} + ([length of gap][gap extension penalty {one here}])

  14. Optimum Alignments There may be more than one best path through the matrix (and optimum doesn’t guarantee biologically correct). Starting at the top and working down, then tracing back, the two best trace-back routes define the following two alignments: cTATAtAagg cTATAtAagg | ||||| and ||||| cg.TAtAaT. .cgTAtAaT. With the example’s scoring scheme these alignments have a score of 5, the highest bottom-right score in the trace-back path graph, and the sum of six matches minus one interior gap. This is the number optimized by the algorithm, not any type of a similarity or identity percentage, here 75% and 62% respectively! Software will report only one optimal solution. This was a Needleman Wunsch global solution. Smith Waterman style local solutions use negative numbers in the match matrix and pick the best diagonal within the overall graph.

  15. Independent of optimum, what is a ‘good’ alignment? So, significance: when is any alignment worth anything biologically? • An old statistics trick — Monte Carlo simulations: • Z score = [( actual score ) - ( mean of randomized scores )] • (standard deviation of randomized score distribution) • So, the previous solutions only get a Z score of 1.1 in spite of their seemingly high percent identities! And initially ‘we’ thought this was a Normal (Gaussian) distribution. Now we know that it is actually an ExtremeValue distribution, the distribution of maximum scores, not the distribution of mean scores.

  16. Based on this known statistical distribution, and robust statistical methodology, a realistic Expectation function, the E Value, can be calculated from database searches. The ‘take-home’ message is . . . ‘Sequence-space’ follows theExtreme Value distribution (http://mathworld.wolfram.com/ExtremeValueDistribution.html).

  17. The Expectation Value! The higher the E value is, the more probable that the observed match is due to chance in a search of the same size database, and the lower its Z score will be, i.e. is NOT significant. Therefore, the smaller the E value, i.e. the closer it is to zero, the more significant it is and the higher its Z score will be! The E value is the number that really matters. Also see http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

  18. And how does this relate to homology? Significant ‘enough’ similarity implies homology, insignificant similarity does not negate homology. And remember what “homology” really means (W. Fitch joke)! The Z score represents the number of standard deviations some particular alignment is from a distribution of random alignments (often the Normal distribution). They veryroughly correspond to the listed E Values (based on the Extreme Value distribution) for a typical protein sequence similarity search through a database with ~125,000 protein entries.

  19. What about proteins — conservative replacements and similarity as opposed to identity. The nitrogenous bases, A,C, T, G, are either the same or they’re not, but amino acids can be similar, genetically, evolutionarily, and structurally! BLOSUM62 table: Positive identity values range from 4 to 11 and negative values for those substitutions that rarely occur go as low as –4. The most conserved residue is tryptophan with a score of 11; cysteine is next with a score of 9; both proline and tyrosine get scores of 7 for identity.

  20. On to multiple sequences — dynamic programming’s complexity increases exponentially with the number of sequences being compared: N-dimensional matrix . . . . complexity O ( [sequence length]number of sequences )

  21. A couple ‘global’ solutions using heuristic tricks See — MSA (‘global’ within ‘bounding box’) and PIMA (‘local’ portions only) on the multiple alignment page at the Both available at the Baylor College of Medicine’s Search Launcher — http://searchlauncher.bcm.tmc.edu/ — but, severely limiting restrictions!

  22. Therefore — pairwise, progressive dynamic programming . . . . . . restricts the solution to the neighbor-hood of only two sequences at a time. All sequences are compared, pairwise, and then each is aligned to its most similar partner or group of partners represented as a consensus. Each group of partners is then aligned to finish the complete multiple sequence alignment.

  23. This was pretty much the original ClustalV and GCG PileUp program . . . then . . . Enhancements on the theme First enhancements came from ClustalW — variable sequence weighting, dynamically varying gap penalties and substitution matrices, and a neighbor-joining guide-tree. Since the year 2000 a slew of new programs have tried other heuristic variations, all in attempts to build faster, more accurate multiple sequence alignments. The devil’s in the details: Muscle, ProbCons, T-Coffee, MAFFT and many, many more.

  24. Muscle An iterative method that uses weighted log-expectation profile scoring along with a slew of optimizations. It proceeds in three stages — draft progressive using k-mer counting, improved progressive using a revised guide-tree from the previous iteration, and refinement by sequential deletion of each tree edge with subsequent profile realignment. ProbCon Uses Hidden Markov Model (HMM) techniques and posterior probability matrices that compare random pairwise alignments to expected pairwise alignments. Probability consistency transformation is used to reestimate the scores, and a guide-tree is then constructed, which is used to compute the alignment, which is then iteratively refined. Incredibly accurate.

  25. T-Coffee Uses a preprocessed, weighted library of all pairwise global alignments between your sequences, plus the ten best local alignments associated with each pair. This helps build the NJ guide-tree and from this the alignment. The library is also used to assure consistency and help prevent errors, by allowing ‘forward-thinking’ to see whether the overall alignment will be better one way or another after particular segments are aligned one way or another. The institutional schedule analogy . . . . T-Coffee can even tie together multiple methods as external modules, making consistency libraries from the results of each, as long as all the specified methods are installed on your system. T-Coffee is one of the most accurate methods available because of this consistency based rationale, but it is not the fastest. Regardless, I encourage you to check it out! Also see my manuscript.

  26. MAFFT — today’s example — has many modes, among them: a couple of progressive, approximate modes, using a fast Fourier transformation (FFT); a couple of iteratively refined methods that add in weighted-sum-of-pairs (WSP) scoring; and several iterative methods that use WSP scoring combined with a T-Coffee-like consistency based scoring scheme. Speed and accuracy are inversely proportional for these from fast and rough, to slow and accurate, respectively. MAFFT provides command aliases for all of these, from fast to slow — FFTNS with or without retree, FFTNSI with or without maxiterate, and the three combined approaches EINSI, LINSI, and GINSI. See command line help with “mafft --help” and the complete ‘man’ page style manual at http://align.bmr.kyushu-u.ac.jp/mafft/software/manual/manual.html.

  27. MAFFT’s basic algorithm MAFFT’s fast Fourier transform provide a huge speedup over previous methods. Homologous regions are quickly identified by converting amino acid residues to vectors of volume and polarity, thus changing a twenty-character alphabet to six, rather than by using an amino acid similarity matrix. Similarly, nucleotide bases are converted to vectors of imaginary and complex numbers. The FFT trick then reduces the complexity of the subsequent comparison to O ( N logN ). FFT identifies potential similarities though, without localizing them; a sliding window step using the BLOSUM62 matrix is used for this. Then MAFFT constructs a distance matrix, and hence a progressive guide tree, on the number of shared six-tuples from this Fourier transform, rather than on a ranking based on full-length, pairwise sequence similarity. The user can specify how many times a new guide tree is subsequently recalculated from a previous alignment as many times as desired; the alignment is reconstructed using the Needlman Wunsch algorithm each time.

  28. Some of MAFFT’s many modes And each mode has a bunch of additional options! 1) Most basic, fastest modes — just progressive. a) FFTNS1 (fftns --retree 1) b) FFTNS2 (fftns) (same as mafft --retree 2) Suitable for 1,000’s of easily aligned sequences. A rough distance matrix is built from the sequences using FFT and the shared number of six-mers. A modified UPGMA guide tree is built from this matrix. The sequences are aligned according to the rough, initial guide tree (as in ‘traditional’ methods). FFTNS2 adds a recomputation of the guide tree (retree 2) from the original alignment, from which a new progressive alignment is built.

  29. MAFFT’s iterative refinements 2) Intermediate modes — progressive + iterations to maximize the WSP objective function. a) FFTNSI (fftnsi) default two cycles, or e.g. fftnsi --maxiterate 1000 b) NWNSI (nwnsi) same as FFTNSI, but no FFT, Needleman Wunsch only. Progressive alignment and retree as before, with or without FFT, and then . . . . Iterative refinement is cycled twice (default), or repeatedly until there is no further improvement, or until you reach your specified limit number. Suitable for 100’s through 1000’s of sequences.

  30. MAFFT’s most accurate modes 3) Advanced modes — progressive + iterations to maximize the objective WSP and T-Coffee-like consistency functions. Options differ according to the way the pairwise alignments are calculated. a) EINSI (einsi) most general of these. Uses a Smith Waterman style local algorithm with generalized affine gap costs for the pairwise step. Most appropriate for sequences with multi-, shared, similarly ordered domains, in an otherwise nearly unalignable ‘mess,’ e.g: ooooooXXX------XXXX-----------------------XXXXXXXXXXX-XXXXXXXXXXXXXXXoooooooooo ------XXXXXXXXXXXXXooo--------------------XXXXXXXXXXXXXXXXXX-XXXXXXXX---------- --ooooXXXXXX---XXXXooooooooooo------------XXXXX----XXXXXXXXXXXXXXXXXXoooooooooo ------XXXXX----XXXXoooooooooooooooooooooooXXXXX-XXXXXXXXXXXX--XXXXXXX---------- ------XXXXX----XXXX-----------------------XXXXX---XXXXXXXXXX--XXXXXXXooooo-----

  31. MAFFT’s most accurate modes, cont. 3) Advanced modes — progressive + iterations to maximize the objective WSP and T-Coffee-like consistency functions. Options differ according to the way the pairwise alignments are calculated. b) LINSI (linsi) strictly local. Uses a Smith Waterman style local algorithm with affine gap costs for the pairwise step. Most appropriate for sequences with only one single, shared domain, in an otherwise nearly unalignable ‘mess,’ .e.g: --------------XXXXXXXXXXX-XXXXXXXXXXXXXXXoooooooooo --------------XXXXXXXXXXXXXXXXXX-XXXXXXXX---------- --------------XXXXX----XXXXXXXXXXXXXXXXXXoooooooooo ooooooooooooooXXXXX-XXXXXXXXXXXX--XXXXXXX---------- --------------XXXXX---XXXXXXXXXX--XXXXXXXooooo-----

  32. MAFFT’s most accurate modes, cont. 3) Advanced modes — progressive + iterations to maximize the objective WSP and T-Coffee-like consistency functions. Options differ according to the way the pairwise alignments are calculated. c) GINSI (ginsi) strictly global. Uses a Needleman Wunsch style global algorithm with affine gap costs for the pairwise step. Most appropriate for sequences where only one single, shared domain extends the full length of all of the sequences, .e.g: XXXXXXXXXXXXXXX-XXXXXXXXXXXXXXXooooXXXooXXX -XXXXXXXXXXXXXXXXXX-XXXXXXXX--XXXXXXX---XXX XX--XXXXX---XXXXXXXXXXXXXXXXXXXoooooXXoooXX ooooXXXXXoooooXXXXX-XXXXXXXXXXXX--XXXXXXXX- XXXXX---XXXXXXXXXX--XXXXXXXooooXXXXXXXXXX--

  33. How to know when to use what for MAFFT — see the algorithm and tips, tips3, and tips4 pages; for all of them — Take home message: For simplecasesitdoesn’treallymatter what program to use. For complicated situations it may, and what you use will depend on the size of your dataset, personal preferences, time allotted, and how much hand editing you want to do. Really nice, recent review: Edgar, R.C. and Batzoglou, S. (2006) Multiple sequence alignment. Current Opinion in Structural Biology16, 368–373. The rest of my references can be found in my tutorial manuscript for this workshop.

  34. You can do a lot of this stuff on the Web, if you need to — some resources for multiple sequence alignment: http://www.techfak.uni-bielefeld.de/bcd/Curric/MulAli/welcome.html. http://pbil.univ-lyon1.fr/alignment.html http://www.ebi.ac.uk/Tools/sequence.html http://searchlauncher.bcm.tmc.edu/ However, problems with very large datasets and huge multiple alignments make doing multiple sequence alignment on the Web impractical after your dataset has reached a certain size. You’ll know it when you’re there!

  35. If large datasets become intractable for analysis on the Web, what other resources are available? Soapbox detour . . . Desktop software solutions — all of these programs are available in public domain/open source, but . . . they can be complicated to install, configure, and maintain. User must be pretty computer savvy. So, commercial software packages are available, e.g. MacVector, DS Gene, DNAsis, DNAStar, etc., but . . . license hassles, big expense per machine, lack of most recent programs, underperformance, and Internet and/or CD database access all complicate matters! Therefore, I argue for UNIX server-based solutions . . .

  36. UNIX servers — pros and cons Free/public domain solutions still available, but now a very cooperative systems manager needs to maintain everything for users. If you have such a person, then: You end up with a more powerful, and usually faster computer, with larger storage capabilities. Plus, connections can be made from any networked terminal or workstation anywhere! Operating system: UNIX command line operation hassles; communications software — ssh, and terminal emulation; X graphics; file transfer — scp/sftp; and editors — vi, emacs, pico/nano (or desktop word processing followed by file transfer [save as "text only!"]). See my supplement pdf file.

  37. getting off my soapbox . . .Coding DNA issues Work with proteins! If at all possible. Twenty match symbols versus four, plus similarity versus identity! Way better signal to noise. Also guarantees no indels are placed within codons. So translate, then align. SeaView can do this for you! Nucleotide sequences will only reliably align if they are verysimilar to each other. And they will likely require extensive and carefully considered hand editing with an editor like SeaView.

  38. Reliability and the comparative approach . . . explicit homologous correspondence of residues within every column of your alignment; manual adjustments should be encouraged — based on knowledge, especially structural, regulatory, and functional sites. Therefore, editors like SeaView and databases like the Ribosomal Database Project: http://rdp.cme.msu.edu/index.jsp

  39. SeaView SeaView is a really good multiple sequence editor graphical user interface (GUI) with the ability to manually adjust alignments, create dot plots between any two sequences, and run external multiple sequence alignment programs on portions of or all of your data. Some of its very powerful features are it’s ability to allow you to work on DNA sequences based on their translations, to create “Sites” and “Species” sets that delineate subsets of your data matrix, and to annotate your data. It is available for all major operating systems.

  40. SeaView’s view The HPV E2/E4 gene reading frame overlap after EINSI refinement.

  41. ‘Mask’ out uncertain areas; SeaView’s “Sites sets” allows you to do this. Annotate known regions; SeaView’s “Footers” do this. X’s delineate sites that will be exported with “Save selection” or specified as a CHARSET by “Save as” “NEXUS.”

  42. Complications: beware of aligning apples and oranges [and grapefruit]! For example: receptors and/or activators with their namesake proteins; parologous versus orthologous homologs; genomic versus cDNA; mature versus precursor proteins . . . .

  43. Complications, cont. Order dependence. Not that big of a deal, makes biological sense. Substitution matrices and gap penalties. Can be a very big deal! Regional ‘realignment’ becomes incredibly important, especially with sequences that have areas of high and low similarity. SeaView let’s you do this!

  44. Complications still,format hassles! Specialized format conversion tools such as GCG’s SeqConv+ program and Don Gilbert’s public domain ReadSeq program. And many editors accept various input formats, such as SeaView (inputs and outputs NEXUS, MSF, Clustal, FastA, PHYLIP, and MASE formats).

  45. Yet more complications Indels and missing data symbols (i.e. gaps) designation discrepancy headaches — ., -, ~, ?, N, or X . . . . . Help!

  46. Conclusions • Gunnar von Heijne in his very old but quite readable treatise, Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit, provides a very appropriate conclusion: • “Think about what you’re doing; use your knowledge of the molecular system involved to guide both your interpretation of results and your direction of inquiry; use as much information as possible; and do not blindly accept everything the computer offers you.” • He continues: • “. . . if any lesson is to be drawn . . . it surely is that to be able to make a useful contribution one must first and foremost be a biologist, and only second a theoretician . . . . We have to develop better algorithms, we have to find ways to cope with the massive amounts of data, and above all we have to become better biologists. But that’s all it takes.” FOR MORE INFO... Explore my Web Home: http://bio.fsu.edu/~stevet/cv.html. Contact me (stevet@bio.fsu.edu) for specific long-distance bioinformatics assistance and collaboration.

  47. On to a demonstration of some of SeaView’s multiple sequence dataset capabilities — • The HPV L1 gene and its complete genome . . . the tutorial: • How to use SeaView with MAFFT.

More Related