290 likes | 427 Views
Trees, Stars, and Multiple Biological Sequence Alignment. Jesse Wolfgang CSE 497 February 19, 2004. Importance?. Molecular evolution (Dayhoff). RNA folding (Trifonov, Bolshoi). Gene regulation (Galas et al.). Protein structure-function relationships (Wu, Kabat). Introduction.
E N D
Trees, Stars, and Multiple Biological Sequence Alignment Jesse Wolfgang CSE 497 February 19, 2004
Importance? • Molecular evolution (Dayhoff) • RNA folding (Trifonov, Bolshoi) • Gene regulation (Galas et al.) • Protein structure-function relationships(Wu, Kabat)
Introduction • Original sequence unknown • Must consider all possible transformations • Including insertions, deletions, and replacements • Choose the most likely set of transformations • With a given model of protein evolution
K-sequence: sequence of k characters • An alignment of the sequences is written as • Each is obtained from • Blanks are inserted in positions where some of the other sequences have a nonblank character • At least one must be nonblank for each • is the length of the aligned sequences Sequences and Alignments
D Q L F D N V QQ G L D - - Q – L F D N V Q - - - - - - Q G L - Alignments • Ex: sequences DQLF, DNVQ, QGL
A lattice of sequences with lengths n • Consists of -dimensional hypercubes • Cartesian product of strings of squares • Forms an -dimensional parallelepiped • A path between the sequences is a set of connected line segments (connected broken line) Lattices and Paths
Paths = 2n-1 = O(2n) 2 dimensions 3 dimensions 3 possible paths 7 possible paths
sublattice F L Q D D D - - N - - V - Q Q Q - - G L - L F - - Q D N V Q G L 3-dimensional parallelepiped Paths • Sequences DQLF, DNVQ, QGL
Note: • Where is the length of D B A B C D A B – D - B C D A B A B C D C D Paths and Sequence Length • Sequences: ABCD, ABD, BCD
Note: • Where is the length of D C B A I E F G H A B C D – - - - - - - - - - - E F G H - - - - - - - - - - - I J K J K Paths and Sequence Length • Sequences: ABCD, EFGH, IJK
denotes an alignment of and D Q L F F Q L Q G D Q D N V Q L G D Q – L F - Q G L - L Projections • Sequences DQLF, DNVQ, QGL
is a measure assigned to • Measure of the similarity among based upon a particular metric • For each measure there is at least one path with attaining a minimum value at , the optimal path Optimal Paths
Each vertex in L is an end corner of the sublattice • First: compute score of each of the possible paths for the cube that has a vertex at the original corner F L Q D Q D N V Q • Next: using this information, compute minimum score to reach the vertices of the adjacent cubes to the original corner G L Calculating Optimal Paths
Problems with This Algorithm • Calculates a weighted sum of its projected pairwise alignments • Called “Sum-of-the-Pairs” (SP) • Other methods fit biological intuition more closely
Tree-Alignment • Treat sequences as leaves of an evolutionary tree • Reconstruct ancestral sequences which minimize the cost of the tree • Must assign sequences to internal nodes • Align the given and reconstructed sequences • Star-alignment: only one internal node
Tree-Alignment • Many different methods for calculating tree alignments • Discuss version used by ClustalX
Tree-Alignment in ClustalX • Three main parts • Perform pairwise alignment on all sequences to calculate a distance matrix • Use distance matrix to calculate a guide tree • Sequences are progressively aligned using the branching order in the guide tree http://bimas.dcrt.nih.gov/clustalw/clustalw.html
Calculating Distance Matrix • Use standard dynamic programming to find the best alignment • Gap penalties for opening a gap and continuing a gap (possibly different) • Divide number of matches by total number of residues compared (excluding gaps) • Convert to distances by dividing by 100 and subtracting from 1 • Gives one entry in the n by n matrix
A T C GA T C C = 3/4 = .75/100 = 1-.0075 = .9925 A T C G A G G C = 1/4 = .25/100 = 1-.0025 = .9975 Calculating Distance Matrix • Ex: sequences ATCG, ATCC, AGGC, AGCC
Calculating a Guide Tree • Using Nearest-Neighbor method to group sequences • Results in an unrooted tree • Branch lengths proportional to estimated divergence • “Mid-point” method used to determine root • Means of the branch lengths to each side of the root are equal (or approximately equal)
ATCG = 1.8245 ATCT = 1.8245 AGGC = 1.3308 1/3 1 1.6599 GCAA = 1 .9975/2 .9975 .9925 .9925 Calculating a Guide Tree AGAA GCAA AGCC AGGC ATCG ATCT ATCG
ATCG = 1.4911 ATCT = 1.4911 1.4911 1 1 AGCC AGGC = 1.4986 .9975/2 .9975/2 GCAA = 1.4986 1.4986 ATCG AGAA .9925 .9925 ATCT ATCG AGGC GCAA Calculating a Guide Tree
Progressive Alignment • Perform a series of pairwise alignments • Slowly align larger and larger groups of sequences • Follow the branching order of the tree • From leaves to root
ATCT ATCG AGGC GCAA Progressive Alignment AGCC ATCG AGAA
A A C A A A C A C C A A A C A C A A C Traditional (SP) Tree-Alignment Star-Alignment Input seq Reconstructedseq Missmatches -- A, A, C A 6 1 2 Alignment Costs Traditional A, A, A, C, C A, A, A, C, C A, A, A, C, C
Alignment Inconsistencies • Different definitions of multiple alignments can yield different optimal alignments • Optimal tree-alignments minimize number of mutations from theorized common ancestors • SP-alignments maximize number of positions where aligned sequences agree • Sometimes makes more biological sense since certain regions of proteins more likely to mutate
Traditional (SP) Star-Alignment - A C C- A C C- T C TA T C T A C C -A C C -T C T -A T C T -- A C C - Alignment Inconsistencies • Ex: cost of 1 for aligning two different letters, cost of 2 for aligning a letter with a null • Sequences: ACC, ACC, TCT, ATCT Input sequences Reconstructedsequences
ClustalX Demo • Multiple sequence alignment program • For more information on ClustalX • http://www.at.embnet.org/embnet/progs/clustal/clustalx.htm