510 likes | 614 Views
LSM2104 Part II Lectures 3 and 4. Tan Tin Wee Dept of Biochemistry NUS. Topics. Derivation of PAM and BLOSUM and Scoring Matrices Basic Introduction to Dynamic Programming Techniques Details of Needleman-Wunsch and Smith-Waterman Algoriths FASTA and BLAST Heuristics
E N D
LSM2104Part IILectures 3 and 4 Tan Tin Wee Dept of Biochemistry NUS
Topics • Derivation of PAM and BLOSUM and Scoring Matrices • Basic Introduction to Dynamic Programming Techniques • Details of Needleman-Wunsch and Smith-Waterman Algoriths • FASTA and BLAST Heuristics • Database Similarity Searches and Statistics of the E Value • Multiple Sequence Alignments
PAM • PAM (Percent Accepted Mutations) similarity matrix • PAM matrices pertaining to protein sequences are constructed using amino acid similarities in evolutionarily related sequences. They show probability scores of replacement of amino acids by each other based on natural mutation rates in related protein families. Hence, these matrices are sometimes called "substitution matrices". • A score above zero assigned to two amino acids indicates that these two replace each other more often than expected by chance alone. ie., they are functionally exchangeable. • A negative score (below zero) indicates that the two amino acids are rarely interchangeable. eg., a basic amino acid for an acidic one or one with an aromatic side chain for one with aliphatic side chain. • The number 250 in PAM250 corresponds to an average of 250 amino acid replacements per 100 residues from a data set of 71 aligned sequences [Ref: Dayhoff, M, Schwartz, RM, Orcutt, BC (1978) A model of evolutionary change in proteins. in Atlas of Protein Sequence and Structure, vol 5, sup. 3, pp 345-352. M. Dayhoff ed., National Biomedical Research Foundation, Silver Spring, MD.]. The higher the matrix number, the farther the evolutionary distance between the compared sequences
PAM • Margaret Dayhoff et al (1970s) • First to assemble sequences into protein seq atlas – families and superfamilies based on seq similarity • Tables of frequency of changes/mutations observed in the sequences of a family derived. • Percent amino acid mutations accepted by evolutionary selection or PAM Tables derived. • Shows probability that one amino acid change into any other in these families • Shows which amino acids are most conserved at the corresponding position in two sequences. • Assess the probabilities of residue replacement in related proteins • Model of molecular evolution • 1 PAM – average change in 1% of all amino acid possibilities • 100 PAMs does not mean every residue is changed • Time is not correlated with PAM ie. Different families of proteins evolve at different rates
BLOSUM • BLOSUM (Blocks Substitution Matrix) matrix • These are substitution matrices derived from the observed frequencies of amino acid replacements in highly conserved regions of ungapped local alignments. The data for the substitution scores in these matrices come from about 2000 blocks of aligned sequence segments characterizing more than 500 groups of related proteins [Ref: Henikoff, S., and Henikoff, J. G. (1992) Proc. Natl. Acad. Sci. USA 89: 10915-10919] • Dayhoff-like matrices derive their initial substitution frequencies from global alignments of very similar sequences. An alternative approach has been developed by Henikoff and Henikoff using local multiple alignments of more distantly related sequences First a database of multiple alignments without gaps for short regions of related sequences was derived. Within each alignment in the database, the sequences were clustered into groups where the sequences are similar at some threshold value of percentage identity. Substitution frequencies for all pairs of amino acids were then calculated between the groups and this used to calculate a log odds BLOSUM (blocks substitution matrix) matrix. Different matrices are obtained by varying the clustering threshold. • Different levels of the BLOSUM matrix can be created by differentially weighting the degree of similarity between sequences. For example, a BLOSUM62 matrix is calculated from protein blocks such that if two sequences are more than 62% identical, then the contribution of these sequences is weighted to sum to one.
BLOSUM Scoring Matrices • Block Substitution Matrix • Number indicate percent identity within set eg. BLOSUM62 means 62% identity • Finds short highly similar sequences • If the BLOSUM62 matrix is compared to PAM160 (its closest equivalent) then it is found that the BLOSUM matrix is less tolerant of substitutions to or from hydrophilic amino acids, while more tolerant of hydrophobic changes and of cysteine and tryptophan mismatches e.g. see http://helix.biology.mcmaster.ca/721/distance/node10.html
Use of these matrices • PAM, BLOSUM, and other matrices • Used as scoring matrices for sequence alignment programs • The BLAST server from NCBI and the search servers from EBI use different versions of the PAM and BLOSUM matrix for protein similarity searches and alignments.
Elements of Dynamic Programming • Dynamic programming is a computational method that is used in bioinformatics and computational biology to align two protein or nucleic acid sequences • Every pair of characters in the two sequences are compared pairwise • O(MN) • Dynamic programming algorithms proven mathematically to generate the best or OPTIMAL alignment between two sequences under a given set of conditions • http://bioinformatics.weizmann.ac.il/courses/BCG/lectures/02_pairwise/2.4tools/index.html • http://www.sbc.su.se/~per/molbioinfo2001/dynprog/dynamic.html
Example of Dynamic programming algorithm • Needleman-Wunsch techniques. For this example, the two sequences to be globally aligned are G A A T T C A G T T A (sequence #1) G G A T C G A (sequence #2) • So M = 11 and N = 7 (the length of sequence #1 and sequence #2, respectively) • A simple scoring scheme is assumed where • Si,j = 1 if the residue at position i of sequence #1 is the same as the residue at position j of sequence #2 (match score); otherwise • Si,j = 0 (mismatch score) • w = 0 (gap penalty)
Three steps in dynamic programming • Initialization • Matrix fill (scoring) • Traceback (alignment) • Good detailed description • http://www.sbc.su.se/~per/molbioinfo2001/dynprog/adv_dynamic.html
Needleman-Wunsch • S.B. Needleman and C.D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48:443--453, 1970 • General Algorithm for Sequence Comparison • Compuationally expensive O(M,N) where M and N are lengths of sequences being pairwise compared • To Calculate the alignment score S(i,j), you only need to enumerate and score all ways in which one aligned pair can be added to a shorter alignment to produce an alignment of the first I residues of sequence A and the first j residues of sequence B
All possible pairs are calculated in a two-D array • All possible alignments are represented by pathways through this array, the shortest being the most optimal • Global alignments – every residue of the two sequences A and B is involved in the calculation. • Misses out on specific motifs or active sites homology, or domains within a longer sequence which match each other, where other parts don’t match
Step 1 – assign similarity values • Assign Similarity Scores • A numerical value/score is assigned to every cell in the 2D array, based on the scoring matrix. • The similarity scores e.g. unitary soring matrix, PAM, BLOSUM etc.
Step 2 – for each cell trace back all possible pathways back to the start allowing for indels, giving cell the value of the pathway with the max score • Score pathways through the 2D Array • For each cell, calculate the maximum score for an alignment ending at that point. • Compute Cumulative score by adding scores in a path through the matrix • Search subrow and subcolumn for the highest score • Include Gap penalty and Gap extension penalty • The best optimal alignment is the pathway with the highest score • Max match is the largest number of amino acids of one protein that can be matched with those of another protein while allowing for gaps
Step 3 traceback • Construct the alignment or pathway back from the highest scoring cell to give the highest scoring alignment • Examples of implementations: GAP (in GCG package)
N-W statistical significance: • Probability of obtaining the same result of the maximum score from a pair of random sequences • Estimated using: Form pairs of random sequences by randomly drawing one member from each set e.g. with the same amino acid composition as the real sequences If the value calculated for real proteins is significantly different from that of the random proteins, then it is likely that the alignment is due to the specific sequences rather than accidentally due to their composition.
Smith Waterman Algorithm • Smith, T. F., Waterman, M. S., J. Mol. Biol. (1981) 147:195-197 • Based on Needleman Wunsch Algorithm • Local Alignment instead of N-W global alignment • Examples of implementations – BESTFIT (in GCG package) • S-M Algo compares segments of all possible lengths (local alignments) rather than entire length (global alignments) in the NW algo and chooses whichever optimises the similarity score of these local alignments
Comparing two sequences A (=a1 a2 a3 ... an) and B (=b1 b2 b3 ... bm) • Hij=max{Hi-1, j-1 +s(ai,bj), max{Hi-k,j -Wk}, max{Hi, j-l -Wl}, 0} • Hij is the maximum similarity of two segments ending in ai and bj respectively • Including the alternative that the similarity score be zero in the expression for Hij allows the local alignment to restart at any pair of aligned residues. In addition, it makes the calculation much more sensitive to the precise match and mismatch scores and gap penalties.
For every cell, the SW algo computes all possible paths leadng to it, including paths of any length, and not necessarily leading from the beginning of the sequence (unlike the global alignment). • So paths can be of any length, starting anywhere in the sequences being compared, and can contain insertions or deletions. • Relies on gap penalties to inhibit extensions
Forming the optimal alignment diagonal path • For every residue in the query sequence • Align with next residue of the db target sequence – Previous score plus similarity score for the two residues • Deletion score is previous score minus gap penalty and gap size penalty – query seq with gap • Insertion score is previous score minus gap penalty and gap size penalty – db target with gap • Stop when score is zero. • Choose whichever above is highest
SW alignment • Score in each cell is max possible score for an alignment of any length (not necessarily from the beginning unlike the NW) ending at that cell • Trace pathway back form the highest scoring cell • This cell can be anywhere in the 2D array • Align the highest scoring segment • http://www.maths.tcd.ie/~lily/pres2/
NW vs SW • Global vs local alignment • NW Alignment score for a pair of residues must be >= 0 while SW can be + or – • No gap penalty required for NW to work properly, but SW needs gap penalty • Score cannot decrease between two cells of a pathway for NW, but for SW, in getting to an alignment, score can increase, decrease or stay level between two cells of a pathway
BLAST and FASTA • Heuristic approximations of N-W and S-M algorithms • Reduce computation and increase speed • NW and SM exhaustive, guarantee optimal • BLAST is statistically sound. But it is not mathematically guaranteed to produce the best optimal alignment
FASTA • Pearson, W. R., Lipman, D. J., P.N.A.S. (1988) 85:2444-2448 • 1. Lookup Table • http://acer.gen.tcd.ie/~amclysag/fasta.html • FastA is a family of heuristic algorithms developed by William Pearson of the University of Virginia. FastA lies between BLAST and Smith-Waterman in both accuracy and speed. An optimized FastA option makes use of Smith-Waterman for part of the alignment process. The FastA family includes DNA to DNA, protein to protein, and translation searches.
FASTA • http://workbench.sdsc.edu • Rapid Global Alignment • Allows alignments to shift frames • LALIGN – a FASTA derivative for local alignments • Works for internal repeats missed by FASTA because of gaps (now there is gapped BLAST – BLAST 2.0)
Basic Local Alignment Search Tool BLAST • Altschul, S. F., Gish, W., Miller, W., Myers, E. W., Lipman, D. J., J. Mol. Biol. (1990) 215:403-410 • Strategy is to Optimise Maximal Segment Pair (MSP) score • Heuristic modification of the SW dynamic programming algo • Ungapped alignments only • BLAST1 uses PAM120 matrix for protein alignments • DNA alignment score matrix +5 for match and –4 for mismatch
BLAST is heuristic • BLAST (Basic Local Alignment Search Tool) is a heuristic algorithm first released in 1990 by the National Center for Biotechnology Information (NCBI). The original BLAST compares very short segments of the query and database sequences in search of alignments that exceed a particular score. No insertions or deletions are considered during this process. If the required score is exceeded, BLAST investigates whether the aligned segments can be lengthened to produce a higher-scoring alignment. In the last two years, new and improved versions of BLAST have been released. These newer releases consider insertions and deletions to some extent producing better results. BLAST offers several search types: DNA to DNA, protein to protein, and translation searches.
MSP max segment pair • Similarity Score for a segment is the Sum of Scores for individual pairs of residues. • MSP is highest scoring segment • MSP is locally max if score cannot be increased by extending or shortening both segments • Search for all locally maximal MSPs above some cutoff/threshold. • Word pair of fixed length w • Look for scores above threshold T • Scan target db sequence for word of length w with score geT - Find word hits • Extend word hits to see if it is part of a longer segment with the segment pair with score at least S where S is highest MSP score at which chance similarities are likely to appear
Algorithm • Compile a list of high scoring words of length w (w=4 for proteins and 12 for nt) for query sequence • Scan for word hits of score greater than threshold T against db target sequence • Extend word hit in both directions to find High Scoring Pairs with scores greater than S
Step 1 – compile list of high scoring words • Generate All words of w length – w-mers in query sequence • Look for w-mers with score at least T when matched with target sequence • Speed of program is dependent on list generated in a time proportional to the length of the list unlike SW or NW which is proportional to the product of the lengths.
Step 2. Scanning Phase • Scan DB • Search long sequences for occurrences of words in the list • Use deterministic finite automaton or finite state machine at approx 0.5M residues a sec
Step 3. Extending Hits • Extend w-mers • Stop when score falls a certain distance below highest so far.ie the highest score for shorter or no extensions. • http://acer.gen.tcd.ie/~amclysag/blast.html for DNA
Other more advanced programs • See http://www.paracel.com/faq/faq_algorithm_primer.html • Gapped Blast (BLAST 2.0) extend words from no-gap to gapped, can generate gapped alignments • PSI Blast – Position specific iterated BLASTie. Use gapped BLAST, generate a profile from multiple iterations. This profile is used instead of the input and distance matrix
Limitations to BLAST for sequence database searching • Needs islands of strong similarity • Limits on scoring and penalty values • Blastx, tblastn and tblastx use 6 frame translation but miss sequences with frameshifts • Finds and reports only local alignments
BLAST rules of thumb • For short aas 20-40, 50% identity happens by chance • If A homologus to B, B to C then A and C are homologous • Similarity can be present even if there is absence of homology e.g. low complexity transmembrane and coiled coil regions.
BLAST significance • S’ = (lambdaS-lnK)/ln2 • Lambda and K are associated with the scoring system S’ with a given E, is sigificant if it is greater than N/E where N is the size of the search space.
Significance • A common and simple test to determine if the alignment of two sequences is statistically significant is to carry out a simple permutation test. This consists of • Randomly rearrange the order of one or both sequences • Align the permuted sequences • Record the score for this alignment • Repeat steps 1-3 a large number of times.
http://helix.biology.mcmaster.ca/721/outline2/node40.html#SECTION00630000000000000000http://helix.biology.mcmaster.ca/721/outline2/node40.html#SECTION00630000000000000000 • For protein sequences Doolittle's rule of thumb is that greater than 25% identity will suggest homology, less than 15% is doubtful and for those cases between 15-25% identity a strong statistical argument is required. Personally, I would prefer the statistical test in all cases, since they are easy to do and things such as internal repeats and unusual amino acid compositions can sometimes confuse the picture.
Gumbel Extreme Value Distribution – • Make many random protein sequences of varying lengths • Locally align two sequences of a given length with the same scoring matrix and gap penalty • Repeat for many pairs of sequences of approx same length • Plot distribution of scores • See variation from normal distribution curve – skew – the Gumbel extreme value distribution
Probability that a score S between two unrelated sequence is equal to or greater than a value x, P(S>=x) is given by P= 1- exp (-Kmn e -lambda*x) • Where m and n are sequence lengths, and lambda is scaling factor for the scoring matrix used, • K is a constant that depends on the scoring matrix and the gap penalty combination used. • For significant matches, we are looking for v. low value of P, less than 0.01 to 0.05. • Calculate another value E, the expect value for the alignment score that depends on how we found P. E is used by BLAST. • For BLOSUM62, gap –11/-1, lambda =0.3 and K=0.1 approx
Sequence scrambling method shuffling • Align two sequences and obtain optimal score S • Scramble one of the sequences thousands of times (N) and align it with the first sequence and obtain a distribution of scores, scrambled sequences should preserve same composition • Fit the scores into an extreme value distribution and calculate lambda and K • Calculate P for probability that one of the scrambled pairs will exceed optimal score S • Calculate E(expect value), which is usually P x the number of sequence pairs compared. Eg. If P = 10-6, and N =10,000 then E=10-2 • E is the number we want to be <0.01 to 0.05
Tips on how to assess statistical significance • http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Blast_output.html
Extreme-Value Distribution: Probability density function for distribution of local alignment score. • Probability density function • Several illustrative visual examples may be found in: • Karlin, Samuel and Altschul, Stephen, “Applications and statistics for multiple high-scoring segments in molecular sequences” Proc. Natl. Acad. Sci. USA Vol. 90, pp. 5873-5877, June 1993.
Take an example (ChenXin): • Suppose the genome of B.pseudomallei has some base pairs that could code for 10,000,000 amino acids. Now, we want to know if this genome could code for a very short sequence "MF". What would be the possibility of finding a sequence "MF" in the sequence coded by the B.Pse... genome? • Math have it: 1 - ( (1- (1/20)*(1/20) ) ^ 9,999,999 ) • Since we have 20 amino acids, the possibility of getting a sequence of "MF" is (1/20)*(1/20). The possibility of the sequence is not "MF" is 1-(1/20)*(1/20). In the genome, there are 9,999,999 possible starting point of a length-two sequence. So, the possibility that they are all not "MF" is (1- (1/20)*(1/20) ) ^ 9,999,999. So the possibility of finding a "MF" is 1 - ( (1- (1/20)*(1/20) ) ^ 9,999,999 ). If you know well about math, you will know this value is nearly 1(100%) which means you are very, very likely to find the sequence. • The sequence "MF" is arbitrarily dictated by me which have no biological significance. That means the existence of "MF" is because of chance (by distribution), not because of the evolutionary pressure. The sequence we are interested are those developed and fixed through the evolutionary process by the force of natural selection (survival of the fittest). So, we wish to calculate how possible the match is because of chance (by distribution). • This is the E value. So the smaller the E value is (The less possible the match can occur by distribution), the more biological significance it have (survived through selection).
Multiple Sequence Alignments • NW algorithm can be extended from pairwise to > 2 sequences. Matrix become n dimensional. • O(MN) to O(Nm) • just 100 nucleotides from 5 species, this is • 100e5 = 10,000,000,000 operations • Eg. Like alignment of 2 seqs 100,000 nt long • Most popular is Higgins and Sharp (1989) Clustal
4 Steps in CLustal • All pairwise similarity scores calculated using Rapid alignment methods • Create a similarity matrix and to cluster the sequences based on this similarity using a cluster algorithm • Create an alignment of clusters via a consensus method. • Create a progressive multiple alignment by sequentially aligning groups of sequences according to their branching order in the clustering. • ClustalW – a local alignment algorithm • ClustalV – the older global alignment method
MSA ( Gupta, Kececioglu & Schaffer, 1995) • MACAW ( Schuler, Altschul & Lipman, 1991). • a multiple dimensional dot plot and then look for dots that are common to each group ( Vingron & Argos 1991).)
Need to construct alignment and phylogeny at the same time. • Eg TREEALIGN • When all is said and done, people will still find that the alignments produced by the programs can be improved by a judicious and critical examination by eye. Spending time to slowly and carefully re-examine your alignments by hand is recommended • http://helix.biology.mcmaster.ca/721/outline2/node37.html
FastDNAML Phylogenetic analysis Construction of a Phylogenetic tree http://www.bic.nus.edu.sg/ biogrid/phylo/cascon.html