390 likes | 523 Views
Biology 4900. Biocomputing. Chapter 6. Multiple Sequence Alignments. Relationships between biological sequences. Biological sequences tend to occur in families These may be related genes within an organism (paralogs) or between species (orthologs) Presumably derived from common ancestor
E N D
Biology 4900 Biocomputing
Chapter 6 Multiple Sequence Alignments
Relationships between biological sequences • Biological sequences tend to occur in families • These may be related genes within an organism (paralogs) or between species (orthologs) • Presumably derived from common ancestor • Nucleotides corresponding to coding regions are typically less well conserved than proteins due to degeneracy of genetic code • More difficult to align Sequences evolve faster than structures, but homologous sequences tend to retain similar structure and function (e.g., rat vs. human CaM)
Multiple sequence alignments • Homology can be observed through multiple sequence alignments (MSA) • MSA: 3 or more protein (or nucleic acid) sequences that are partially or completely aligned • Homologous residues are aligned in columns across the length of the sequences
Multiple sequence alignments • MSAs are powerful because they can reveal relationships between 2 sequences that can only be observed by their relationships with a third sequence AVGYDFGEKMLSGADDW LVGERADLTGAEIDE Seq 1 Seq 2 AVGYDFGEKMLSGA--DDW LVGYDRADK-LTGAE-DD- LVG-ERAD--LTGAEIDE- Seq 1 Seq 3 Seq 2
How MSAs are determined? MSAs can be determined based on: • Presence of highly-conserved residues such as cysteine • Conserved motifs and domains • Conserved features of protein secondary structure • Regions showing consistent patterns of insertions or deletions C-terminal domain of CaM (from 3cln.pdb) Conserved 2° structure (α-helices)
Why use MSAs? • If protein (or gene) you are studying is part of a larger group, you may be able to gain insight into structure, function and evolution of the sequence. • MSAs more sensitive than pairwise alignments to detect homologs. • MSAs can reveal conserved residues, motifs, domains. • Useful for generating phylogeny trees. • Regulatory regions of many genes contain conserved consensus sequences.
Benchmarking • Q: How good is a MSA? • A: Compare sequence alignment against known structure alignments (reference scores). • Measured by an objective scoring system such as sum-of-pairs scores (SPS). M Columns Sum of scores for all pairs in 1 column N Rows Sum of scores for all your aligned columns Sum of reference scores
Five MSA Approaches • Exact methods • Progressive alignment (e.g., ClustalW) • Iterative approaches (e.g., PRALINE, IterAlign, MUSCLE) • Consistency-based methods (e.g., MAFFT, ProbCons) • Structure-based methods (e.g., Expresso) Our Focus
Exact Methods • Exact methods, like Needleman and Wunsch, generate optimal alignments but aren’t feasible for alignments of many sequences. • Computational time for this approach is describe in Big O notation as O(2NLN). • Algorithm computational time (T = number of steps) has order O of (2NLN) complexity, where N is the number of sequences and L is the average sequence length.
Progressive Sequence Alignment (Feng-Doolittle) How it works: • Calculates pairwise sequence alignment scores between all proteins (or nucleic acid sequences) • Aligns 2 closest sequences using a guide tree • Progressively aligns more sequences to the first 2 • Advantages: Permits rapid alignment of 100s of sequences. • Disadvantages: May not provide most accurate alignment depending on how alignment is started. ClustalW MUSCLE
What these numbers mean… N=10 sequences, L=100 residues (Avg.) Needleman & Wunsch Too large to calculate ClustalW 20,000 MUSCLE 110,000
Progressive MSA stage 1 of 3:generate global pairwise alignments best score For n sequences, (n-1)(n) / 2 = number of alignments For 5 sequences, (4)(5) / 2 = 10 alignments *First find the two that produce the highest score
Tree Views of alignments • Alignments may be evaluated by either similarity or distance measures • A tree shows the distance between objects Closely-related Sequences Distantly-related Sequences
How to read tree views of alignments Closely-related Sequences
5 closely related globins
Feng-Doolittle stage 2: guide tree • Convert similarity scores to distance scores • Use unweighted pair group method of arithmetic averages UPGMA (defined in Chapter 7) • ClustalW output shown below. Use JalView in ClustalW to display tree view.
Feng-Doolittle stage 3: progressive alignment • Build MSA based on the order in the guide tree • Start with the two most closely related sequences • Then add the next closest sequence • Continue until all sequences are added to the MSA • Follows Rule: “once a gap, always a gap.” 2 closest alignments
Why “once a gap, always a gap”? • There are many possible ways to make a MSA • Where gaps are added is a critical question • Gaps are often added to the first two (closest) sequences • To change the initial gap choices later on would be to give more weight to distantly related sequences • To maintain the initial gap choices is to trust that those gaps are most believable • Insertions receive higher penalties than deletions, and are propagated throughout alignment Note placement of M and A at end of gap
ClustalW Output for CD2 Protein 1 2 3 4 5 Color coding indicates AA property class * Indicates 100% conserved over entire alignment : Conservative mutations . Less conservative mutations [blank] gap or least conserved mutations
Alignment Size Can use to build phylogeny tree Medium Medium Small
Clustal W alignment of 5 closely related globins * asterisks indicate identity in a column
Additional features of ClustalW improve its ability to generate accurate MSAs • Individual weights are assigned to sequences; • very closely related sequences are given less weight, • while distantly related sequences are given more weight • Scoring matrices are varied dependent on the presence • of conserved or divergent sequences, e.g.: • PAM20 80-100% id • PAM60 60-80% id • PAM120 40-60% id • PAM350 0-40% id • Residue-specific gap penalties are applied
In-Class AssignmentMultiple sequence alignments using ClustalW • Example of MSA using ClustalW: two data sets • Five distantly related globins (human to plant) • Five closely related beta globins • Obtain your sequences in the FASTA format! • You can save them in Notepad or other text editor.
MSA: Iterative Methods • Compute a sub-optimal solution and keep modifying that intelligently using dynamic programming or other methods until the solution converges. • Unlike progressive methods, iterative methods can dynamically correct alignment errors • Examples: • MUSCLE: Multiple Sequence Comparison by Log-Expectation (Edgar, 2004) • Iteralign: (Karlin and Brocchieri, 1998) • Praline: PRofile ALInNmEnt (Heringa, 1999; Simossis and Heringa, 2005) • MAFFT: Multiple Alignment using Fast Fourier-Transform (Katoh et al., 2005)
Iterative approaches: MAFFT • Available at http://mafft.cbrc.jp/alignment/software/ • Uses Fast Fourier Transform to speed up profile alignment • Uses fast two-stage method for building alignments using k-mer (matching 6-tuples) frequencies • Offers many different scoring and aligning techniques • One of the more accurate programs available • Available as standalone or web interface • Many output formats, including interactive phylogenetic trees
Iterative approaches: MUSCLE • Available at http://www.ebi.ac.uk/Tools/msa/muscle/ • 3 Stage approach • Stage 1: • Algorithm builds initial alignment based on similarities of paired alignments • Calculates distance matrix and generates rooted tree • Stage 2: • Improves tree by recalculating similarities • Stage 3: • Rescores pairs at branches
MSA: Consistency-based algorithms • Use database of both local high-scoring alignments and long-range global alignments to create a final alignment • Incorporates evidence from multiple sequences to guide pairwise alignment • In a sequence, if x is related to y, and y is related to z, then x should be related to z. • Fast and accurate • Examples: T-COFFEE, Prrp, DiAlign, ProbCons
Which methods are best? • Depends on: • Number of sequences to align. • What you are trying to do. • Level of user expertise. • Personal Preference. • Other Considerations: • Does method use benchmarking of multiple structures? • Do you want to evaluate 3D protein structures (e.g., try Expresso at http://www.tcoffee.org)? • You might want to: • Try making multiple sequence alignments with many different sets of proteins (very related, very distant, few gaps, many gaps, insertions, outliers). • Compare results.
Example: 5 alignments of 5 globins Let’s look at a multiple sequence alignment (MSA) of five globins proteins. We’ll use five prominent MSA programs: ClustalW, Praline, MUSCLE (used at HomoloGene), ProbCons, and TCoffee. Each program offers unique strengths. We’ll focus on a histidine (H) residue that has a critical role in binding oxygen in globins, and should be aligned. But often it’s not aligned, and all five programs give different answers. Our conclusion will be that there is no single best approach to MSA.
ClustalWResults Note how the region of a conserved histidine (▼) varies depending on which of five prominent algorithms is used
ClustalW Praline Muscle ProbCons
See Thompson et al. (1994) for an explanation of the three stages of progressive alignment implemented in ClustalW