240 likes | 452 Views
Bioinformatics and Molecular evolution Paul G. Higgs and Teresa K. Attwood Chapter 6. Sequence alignment algorithms. Jeong Da Geum. Chapter 6. Sequence alignment algorithms. Chapter preview █ Dynamic programming algorithms gap cost functions global and local alignments
E N D
Bioinformatics and Molecular evolution Paul G. Higgs and Teresa K. Attwood Chapter 6. Sequence alignment algorithms Jeong Da Geum
Chapter 6. Sequence alignment algorithms • Chapter preview • █ Dynamic programming algorithms • gap cost functions • global and local alignments • -sensitivity of the results to the scoring parameters used • -heuristic algorithms for multiple sequence alignment
Chapter 6. Sequence alignment algorithms 6.1 What is an Algorithm? █Alignments show us which bits of a sequence are variable and conserved indicate the positions of insertions and deletions allows us to identify important functional motifs starting point for evolutionary studies using phylogenetic methods █Algorithm is a series of instructions that explain how to solve a particular problem example: see, Figure 6.1 important properties: Trade –Off between Time and Memory
Chapter 6. Sequence alignment algorithms 6.1 What is an Algorithm? █Insertion sort * To sort First element: 71 Second element: 22 22(second)<71(first) then swap N: the number of elements Time taken: proportional N2 Inefficient algorithm for sorting large array █Counting sort N: the number of elements Memory: 1)2N: Storage arrays of size N 2)N: Store the data Total: 3 time memory Fig 6.1 (a) Sorting a list of numbers into ascending order (b) The traveling salesman problem
Chapter 6. Sequence alignment algorithms 6.1 What is an Algorithm? TSP: exact enumeration: N! Time: N! The number of cases to be considered N! - factorially αN – exponentially Polynomial time algorithm Nβ : the better, the smaller β Best algorithm (β=2) Fig 6.1 (a) Sorting a list of numbers into ascending order (b) The traveling salesman problem TSP: NP-complete (where NP stands for nondeterministic polynomial) Get the good, appropriate answer: Heuristic algorithms are used Greedy algorithm: Shortsighted, By Better heuristic algorithms, it can be improved
Chapter 6. Sequence alignment algorithms 6.2 Pairwise Sequence Alignment – The Problem • █ Pairwise alignment of 2 DNA sequences • Sequence a: CAGT-AGATATTTACGGCAGTATC---- • Sequence b: CAATCAGGATTTT—-GGCAGACTGGTTG • █ Scoring system • 1) Define a score S(α,β) • -DNA: same nucleotide : S(α,β) = 1 ; different nucleotide : S(α,β) = 0 • -Protein: high positive score; • slightly positive score (D&E or I&L); • slightly negative score (D&I) • (PAM, BLOSUM) • 2) Penalty of Gaps • W(l) : a penalty for gap of length l characters. • W(l) = gl : Each individual gap character has a cost g, linear function • Mutational Process: • - The error in DNA replication, Selection process. • - ex)Indel -> non functional , loop region from G to A or from A to G insertion of C or deletion
Score = ∑ S(α,β) - ∑ W(l) (6.1) aligned gaps pairs Chapter 6. Sequence alignment algorithms 6.2 Pairwise Sequence Alignment – The Problem linear █Affine Gap penalty Function gopen: the penalty for opening a new gap gext: the penalty for extending the gap for each subsequent step W(l) = gopen + gext(l-1) affine general A score for any pairwise alignment Fig6.2 Gap cost functions W(l)- linear, affine, and general gap functions are shown
Chapter 6. Sequence alignment algorithms 6.3 Pairwise sequence alignment-Dynamic Programming methods 6.3.1 Algorithm 1- Global alignment with linear gap penalty █ Calculate the socres. 1st sequence N1 (1=< i =<N1), ai 2nd sequence N2 (1=< j =<N2), bj H (i,j) => H(N1, N2) █ Needleman-Wunsch algorithm – Recursion Relation 1: ai and bj are aligned with each other 2: ai is aligned with a gap 3: bj is aligne with a gap █ (6.2) Maximum scores H(i-1, j-1) + S(ai + bj) diagonal H(i,j) = max H(i-1, j) - g vertical H(i, j-1) – g horizontal █ Initial Conditions H(i,0) = -gi, H(0,j) = -gj, H(0,0) = 0
Chapter 6. Sequence alignment algorithms 6.3 Pairwise sequence alignment-Dynamic Programming methods 6.3.1 Algorithm 1- Global alignment with linear gap penalty • Fig 6.3 Pairwise alignment of SHAKE and SPEARE • Pairwise amino acid scores taken from the PAM250 matrix. • Alignment scores H(i,j)using algorithm 1 and g = 6 • Alignment scores H(i,j)using algorithm 2 and g = 6 • The pathways through the matrix corresponding to the optimal • alignments in (b) and (c) are indicated by the thick arrows. SHA and -- => H(3,0) = -18, H(0,2) = -12 ---SP S(0+2=2) or –S(-6-6=-12) or S-(-6-6=-12) S S- -S
Chapter 6. Sequence alignment algorithms 6.3 Pairwise sequence alignment-Dynamic Programming methods 6.3.1 Algorithm 1- Global alignment with linear gap penalty H(5,6) • Fig 6.3 Pairwise alignment of SHAKE and SPEARE • Pairwise amino acid scores taken from the PAM250 matrix. • Alignment scores H(i,j)using algorithm 1 and g = 6 • Alignment scores H(i,j)using algorithm 2 and g = 6 • The pathways through the matrix corresponding to the optimal • alignments in (b) and (c) are indicated by the thick arrows. Algorithm1 : A practical way of aligning long sequences -S(-6 + 1 =-5) or -–S(-12-6=-12) or S-(2-6=-4) SP SP- SP\ Backtracking result: S-HAKE SPEARE Memory : 3N2, Time: N2
Chapter 6. Sequence alignment algorithms 6.3 Pairwise sequence alignment-Dynamic Programming methods 6.3.2 Algorithm 2- Local alignment with linear gap penalty █ Case 1: two sequences that share only common domain 2: a fragment of a gene from one species vs. complete gene from another species █ The simplest local alignment algorithm 1: Just adds a fourth option to Algorithm 1 2: The best of the 3 scores is negative, then we assign 0 in that cell. H(i-1, j-1) + S(ai + bj) diagonal H(i,j) = max H(i-1, j) - g vertical H(i, j-1) – g horizontal 0 Start again The resulting matrix for the alignment => SHAKE and SPEARE SHAKE PEARE scores 11 SHA ARE scores 4
Chapter 6. Sequence alignment algorithms 6.3 Pairwise sequence alignment-Dynamic Programming methods 6.3.3 Algorithm 3- General gap penalty █ (6.4) Maximum scores, W(l) = gl H(i-1, j-1) + S(ai + bj) diagonal H(i,j) = max max(H(i-1, j) – W(l)) vertical 1=<l=<i max(H(i, j-1) – W(l)) horizontal 1=<l=<i N calculation, , N2 matrix, Time N3 * In global version, Initial conditions H(i,0) = -W(i), H(0,j) = -W(j) * In local version, Initial conditions H(i,0) = H(0,j) = 0 * W(l1 + l2) =< W(l1) + W(l2)
Chapter 6. Sequence alignment algorithms 6.3 Pairwise sequence alignment-Dynamic Programming methods 6.3.4 Algorithm 4- Affine gap penalty █ Smith-Waterman alignment W(l) = gopen + gext(l-1) M(i,j) : the score of the best alignment up to point i on the N1, j on the N2 ai & bj are aligned each other I(i,j) : the score of the best alignment up to this point, ai is aligned with a gap J(i,j) : the score of the best alignment up to this point, bj is aligned with a gap █ (6.2) Maximum scores H(i,j) = max(M(i,j), I(i,j), J(i,j)) M(i-1, j-1) + S(ai + bj) M(i,j) = max I(i-1, j-1) + S(ai + bj) J(i-1, j-1) + S(ai + bj) (6.7) M(i-1, j) - gopen I(i,j) = max I (i -1, j) –gext (6.8) M(i, j-1) - gopen J(i,j) = max J (i, j-1) –gext (6.9) - local and global versions of algorithm (Durbin et al.,1998) • S-—HAKE • SPE-ARE • (ii) S-HAKE • SPEARE
Chapter 6. Sequence alignment algorithms 6.4 The Effect of Scoring Parameters on the Alignment █ To detect exact Algorithms 1. The highest-scoring alignment ≠ correct alignment -> just want to get an evolutionary meaningful alignment. 2. Different sets of matrices for A.A ->PAM & BLOSUM Q: Which one is the best algorithm? A: To look at the alignments To use some biological intuition █ The value of the gap penalties 1. Important role for alignment ex) Hexokinase: in glycolytic pathway, glucose -> glucose-6-phosphate 2. Sequence: Table 6.1, ClustalX program / pairwise alignment, Gonnet 250 matrices, gopen: 10, gext:0.1 Result: fig 6.4(a)
Chapter 6. Sequence alignment algorithms 6.4 The Effect of Scoring Parameters on the Alignment gopen = 10 Loss of a large number of identical pairs Gain of a number of additional short gaps gopen = 50 Q: Which one is the best alignment? A: It seems rater subjective. a<b: non-identical pairs a<c: gap it may be a, *solution: - different scoring system. - check alignment (eg. CINEMA) gopen = 2 Fig. 6.4 Global pairwise alignments of hexokinase proteins from human and Schistosoma mansoni using an affine gap penalty function. The three parameters used for the three alignments differ only in the value of the gap opening parameter. Regions of alignments(b) and (c) that differ from alignment (a) are written in bold.
Chapter 6. Sequence alignment algorithms 6.5 Multiple Sequence Alignment 6.5.1 The progressive alignment method ■ Dynamic programming algorithm: - Proposed for a few sequences alignment, - Recursion relation: more complicated, long running time(Ns) - Very slow, not practical ■ Progressive alignment - Multiple sequence alignment - Aligning families of sequences that are evolutionarily related. * Principle 1) To construct an approximate a phylogenetic tree 2) To build up the alignment by progressively adding sequences in the order specified by the tree
Chapter 6. Sequence alignment algorithms 6.5 Multiple Sequence Alignment 6.5.1 The progressive alignment method ■ Fig 6.5, phylogenetic tree of hexokinase. - In Mammals, Drosophila and yeast each of three groups are more closely related -> independent gene-duplication events *case): Mammal-> before divergence of human and rat *Form clusters -ex) HXKG Yeast – (HXKA, HXKB) -same as the pairwise alignment algorithm * The Score The score for aligning PA with IR (S(P,I) + S(P,R) + S(A,I) +S(A,R)/4) S-HAKE S-HA-KE SH-A-KE SPEARE SPEA-RE SPEA-RE THEATRE THEATRE Gene Duplication? Fig.6.5 Phylogenetic tree of hexokinase sequences from human, rat, Schistosoma mansoni, Drosophila melanogaster, Saccharomyces cerevisiae, and Plasmodium falciparum. This tree is produced by Clustal and used as a guide tree during progressive multiple alignment
Chapter 6. Sequence alignment algorithms 6.5 Multiple Sequence Alignment 6.5.1 The progressive alignment method ■ Progressive alignment 1)To construct a guide tree Pairwsie alignment Calculate the distances : Fraction D of the non-gap siges *Distance ( Feng & Doolittle, 1987) d = -100ln( S-Srand/ Sident-Srand) Srand: the average score for alignment of two random sequences Sident: the average score for alignment of two identical sequences Obtain a matrix of distances (chapter 8) Use neighbor-joining method and midpoint method
■ Fig 6.6 Multiple alignment of hexokinase sequences constructed by Clustal using the guid tree in Fig6.5. Bold sections illustrate points discussed in the text Chapter 6. Sequence alignment algorithms 6.5 Multiple Sequence Alignment 6.5.1 The progressive alignment method • ■ Progressive alignment • Compare the gap position • between Q and V/ 2 gaps between Ds • Fig 6.6 differ from the fig 6.4(a) • : more reliable alignment with Schistosoma sequence • - 2Drosophila and 2yeast • : depend on alignment procedure
Chapter 6. Sequence alignment algorithms 6.5 Multiple Sequence Alignment 6.5.2 Improving progressive alignments • ■ Clustal software • - improve the accuracy of the progressive alignment procedure • Score: overemphasize the influence of closely related sequences • - closely related sequence: share a lot of evolutionary history. • ex) 8 mammal sequence vs. Schistosoma sequence • * Weightng Scheme: reduces the weight of closely related seuqnces • * Position –specific gap penalties: in protein, loop region. • - more accurate for closely related sequences • Progressive alignment: • Time sclae : SN2, S sequences of length N : easily dealt with. • 2 Warning for Clustal softlware • Clustal, the Global alignment algorithm, is not appropriate for sequences with very different lengths • Guide tree is a rough phylogeny, do not consider it seriously as a true phylogeny. • - Clustal has an option to recalculate the neighbor-joining tree from the alignment • -> many sophisticated ways of calculating molecular phylogenies( chapter 8) • -> recommend using specialized programs
Chapter 6. Sequence alignment algorithms 6.5 Multiple Sequence Alignment 6.5.2 Improving progressive alignments • ■Solution for progressive multiple alignment • To recognize that automated multiple sequence alignments are never completely reliable and that effort must be put into adjusting alignments manually in order to get meaningful results • To check that active sites and secondary structure elements are correctly aligned. • * The Complexity is the difficulty of constructing a probabilistic model of indels.
Chapter 6. Sequence alignment algorithms 6.5 Multiple Sequence Alignment 6.5.3 Recent developments in multiple sequence alignment ■ The divid and conquer method -> An alternative heuristic method : long sequences -> short sequence 1, short sequence 2…. -> alignment -> recombine -> one long sequences :To cut sequences is important
Chapter 6. Sequence alignment algorithms 6.5 Multiple Sequence Alignment 6.5.3 Recent developments in multiple sequence alignment • ■ T-coffee ( A library of pairwise alignment ) • -2 different pairwise programs are used: on local and one global. • A weight is attached -> reflect the reliability of each alignment • 1)Wij: the initial value for the weight, for two sequences i and j is set to the percentage identity of i and j • 2) Library extension: to calculate a weight associated with the alignment of each pair of residues in two different sequences • W(X,Y) = Wij, • by adding Z residue, sequence k • The weight W(X,Y) is increased by the minimum of Wik and Wjk • *- no need to introduce any extra gap penalty parameters • *- Reflect the information in all the sequences in the set