230 likes | 385 Views
Multiple Sequence Alignment (I). (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct. 4, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign. Outline. Motivation Scoring of multiple sequence alignments Algorithms Dynamic programming
E N D
Multiple Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct. 4, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign
Outline • Motivation • Scoring of multiple sequence alignments • Algorithms • Dynamic programming • Progressive alignment (next class)
Why Multiple Alignments? • Characterize protein families: Identify shared regions of homology in a multiple sequence alignment • Determination of the consensus sequence of several aligned sequences. • Help predict the secondary and tertiary structures of new sequences • Help predict the function of new sequences • Preliminary step in molecular evolution analysis using phylogenetic trees.
Example of Multiple Alignment Multiple sequence alignment of 7 neuroglobins using clustalx (Slide from Craig A. Struble)
4 Basic Questions in Multiple Alignment Q1: How should we define s? Q2: How should we define A? Model: scoring function s: A X1=x11,…,x1m1 X1=x11,…,x1m1 Possible alignments of all Xi’s: A ={a1,…,ak} Find the best alignment(s) X2=x21,…,x2m2 X2=x21,…,x2m2 … … S(a*)= 21 XN=xN1,…,xNmN XN=xN1,…,xNmN Q4: Is the alignment biologically Meaningful? Q3: How can we find a* quickly?
Defining Multi-Sequence Alignment • We may generalize our definition of pairwise sequence alignment • Alignment of 2 sequences is represented as a 2-row matrix • In a similar way, we represent alignment of 3 sequences as a 3-row matrix A T _ G C G _A _ C G T _ AA T C A C _ A • A column must have at least one nucleotide • Question: How many possible global alignments are there for 3 sequences each of length 2?
Scoring a Multiple Alignment • Ideally, it should be based on evolutionary models • In practice, • We often assume columns are independent • Use “Sum of Pairs” (SP scores) G is the gap score
Minimum Entropy Scoring Intuition: A perfectly aligned column has one single symbol (least uncertainty) A poorly aligned column has many distinct symbols (high uncertainty) Count of symbol a in column i This is related to the HMM formulation of the alignment problem, which we will cover later …
Entropy: Example Best case Worst case
Entropy of an Alignment: Example column entropy: -( pAlogpA+ pClogpC + pGlogpG + pTlogpT) • Column 1 = -[1*log(1) + 0*log0 + 0*log0 +0*log0] = 0 • Column 2 = -[(1/4)*log(1/4) + (3/4)*log(3/4) + 0*log0 + 0*log0] = -[ (1/4)*(-2) + (3/4)*(-.415) ] = +0.811 • Column 3 = -[(1/4)*log(1/4)+(1/4)*log(1/4)+(1/4)*log(1/4) +(1/4)*log(1/4)] = 4* -[(1/4)*(-2)] = +2 • Alignment Entropy = 0 + 0.811 + 2 = +2.811
How can we find a multiple alignment quickly? Can we generalize the dynamic programming algorithm used for pairwise alignment?
Alignments = Paths in… • Align 3 sequences: ATGC, AATC,ATGC
Alignment Paths x coordinate
Alignment Paths • Align the following 3 sequences: ATGC, AATC,ATGC x coordinate y coordinate
Alignment Paths x coordinate y coordinate z coordinate • Resulting path in (x,y,z) space: • (0,0,0)(1,1,0)(1,2,1) (2,3,2) (3,3,3) (4,4,4)
2-D vs 3-D Alignment Grid V W 2-D edit graph 3-D?
Architecture of 3-D Alignment Grid In 2-D, 3 edges in each unit square In 3-D, 7 edges in each unit cube
A Cell of 3-D Alignment Grid (i-1,j,k-1) (i-1,j-1,k-1) (i-1,j,k) (i-1,j-1,k) (i,j,k-1) (i,j-1,k-1) (i,j,k) (i,j-1,k)
si-1,j-1,k-1 + (vi, wj, uk) si-1,j-1,k + (vi, wj, _ ) si-1,j,k-1 + (vi, _, uk) si,j-1,k-1 + (_, wj, uk) si-1,j,k + (vi, _ , _) si,j-1,k + (_, wj, _) si,j,k-1 + (_, _, uk) Multiple Alignment: Dynamic Programming cube diagonal: no indels • si,j,k = max • (x, y, z) is an entry in the 3-D scoring matrix and can be computed using sum of pairs or entropy face diagonal: one indel edge diagonal: two indels
Multiple Alignment: Running Time • For 3 sequences of length n, the run time is 7n3; O(n3) • For ksequences, building a k-dimensional edit graph has run time (2k-1)(nk); O(2knk) • Conclusion: dynamic programming approach for alignment between two sequences is easily extended to k sequences but it is impractical due to exponential running time
In the next class, we will cover more efficient algorithms -- progressive alignment ….
What You Should Know • How to score a multi-sequence alignment • How the dynamic programming algorithm works • Computational complexity of dynamic programming algorithms