Multiple Sequence Alignments

Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University

Overview MSCS 230: Bioinformatics I - Multiple Sequence Alignment

Example Multiple sequence alignment of 7 neuroglobins using clustalx MSCS 230: Bioinformatics I - Multiple Sequence Alignment

Example • Searching for domains with RPS-BLAST MSCS 230: Bioinformatics I - Multiple Sequence Alignment

Applications of Multiple Sequence Alignment • Identify conserved domains/elements in sequences • Compare regions of similarity among multiple organisms • Identify probes for similar sequences in other organisms • Develop PCR primers • Phylogenetic analysis MSCS 230: Bioinformatics I - Multiple Sequence Alignment

Definition • A multiple alignment of strings S1, … Sk is a series of strings with spaces S1’, …, Sk’ such that • |S1’| = … = |Sk’| • Sj’ is an extension of Sj by insertion of spaces • Goal: Find an optimal multiple alignment. MSCS 230: Bioinformatics I - Multiple Sequence Alignment

Scoring Alignments • In order to find an optimal alignment, we need to be able to measure how good an alignment is • Sum of pairs (SP) method: in a column, score each pair of letters and total the scores. Pairs of gaps score 0. • Total up scores for each column MSCS 230: Bioinformatics I - Multiple Sequence Alignment

Using BLOSUM62 matrix, gap penalty -8 In column 1, we have pairs -,S -,S S,S k(k-1)/2 pairs per column SP Method Example -8 - 8 + 4 = -12 MSCS 230: Bioinformatics I - Multiple Sequence Alignment

Dynamic Programming • The dynamic programming approach can be adapted to MSA • For simplicity, assume k sequences of length n • The dynamic programming array F is k-dimensional of length n+1 (including initial gaps) • The entry F(i1, …, ik) represents score of optimal alignment for s1[1..i1], … sk[1..ik] MSCS 230: Bioinformatics I - Multiple Sequence Alignment

Dynamic Programming • Letting i represent the vector (i1,…,ik) and b represent a nonzero binary vector of length k, we fill in the array with the formula where (selecting a column to score) MSCS 230: Bioinformatics I - Multiple Sequence Alignment

Example s1: MPE s2: MKE s3: MSKE s4: SKE • Let i=(1,1,1,1), b=(1,0,0,0) • Checking F(0,1,1,1) (i-b) • Column(s,i,b) is • SP-score is -24 (assuming gap penalty of -8) M - - - MSCS 230: Bioinformatics I - Multiple Sequence Alignment

Analysis • O(nk) entries to fill • Each entry combines O(2k) other entries • Costs O(k2) to calculate each SP score • Overall cost is O(k2 2k nk), or exponential in the number of sequences! • MSA with SP-score shown NP-complete MSCS 230: Bioinformatics I - Multiple Sequence Alignment

Star Alignments • Heuristic method for multiple sequence alignments • Select a sequence sc as the center of the star • For each sequence s1, …, sk such that index i  c, perform a Needleman-Wunsch global alignment • Aggregate alignments with the principle “once a gap, always a gap.” MSCS 230: Bioinformatics I - Multiple Sequence Alignment

Star Alignments Example MPE | | MKE MSKE - || MKE s1: MPE s2: MKE s3: MSKE s4: SKE s3 s1 s2 SKE || MKE -MPE -MKE MSKE -SKE -MPE -MKE MSKE MPE MKE s4 MSCS 230: Bioinformatics I - Multiple Sequence Alignment

Choosing a center • Try them all and pick the one with the best score • Calculate all O(k2) alignments, and pick the sequence sc that maximizes MSCS 230: Bioinformatics I - Multiple Sequence Alignment

Analysis • Assuming all sequences have length n • O(n2) to calculate global alignment • O(k) global alignments to calculate • Using a reasonable data structure for joining alignments, no worse than O(kl), where l is upper bound on alignment lengths • O(kn2+k2l) overall cost MSCS 230: Bioinformatics I - Multiple Sequence Alignment

Tree Alignments • Model the k sequences with a tree having k leaves (1 to 1 correspondence) • Compute a weight for each edge, which is the similarity score • Sum of all the weights is the score of the tree • Find tree with maximum score MSCS 230: Bioinformatics I - Multiple Sequence Alignment

Tree alignment example • Match +1, gap -1, mismatch 0 • If x=CT and y=CG, score of 6 CTG CAT y x CG GT MSCS 230: Bioinformatics I - Multiple Sequence Alignment

Analysis • The tree alignment problem is NP-complete • Hence, phylogenetic tree generation is NP-complete • Again, likely only exponential time solution available (for optimal answers) MSCS 230: Bioinformatics I - Multiple Sequence Alignment

Progressive Approaches • CLUSTALW • Perform pairwise alignments • Construct a tree, joining most similar sequences first (guide tree) • Align sequences sequentially, using the phylogenetic tree • PILEUP • Similar to CLUSTALW • Uses UPGMA to produce tree (chapter 6) MSCS 230: Bioinformatics I - Multiple Sequence Alignment

Progressive Approaches MSCS 230: Bioinformatics I - Multiple Sequence Alignment

Problems with Progressive Alignments • MSA depends on pairwise alignments • If sequences are very distantly related, much higher likelihood of errors • Care must be made in choosing scoring matrices and penalties • Other approaches using Bayesian methods such as hidden Markov models MSCS 230: Bioinformatics I - Multiple Sequence Alignment

When Craig Talks Next • Introduction to Bayesian Statistics • Profile and Block analysis • Expectation Maximization (MEME) • Introduction to HMMs • Multiple sequence alignments using HMMs MSCS 230: Bioinformatics I - Multiple Sequence Alignment

Multiple Sequence Alignments