300 likes | 313 Views
Learn about the important operation of sequence comparison in computational biology, including similarity, alignment, scoring, dynamic programming, and optimal alignments. Explore local and semi-global comparisons, gap penalty functions, multiple sequence alignment, and the use of heuristics for efficient computation.
E N D
Chapter 3 Computational Molecular Biology Michael Smith msmith@cs.fit.edu
Sequence Comparison • Sequence comparison is the most important operation in computational biology • Consists of finding which parts of the sequences are alike and which parts differ
Similarity and Alignment • Similarity • Gives a measure of how similar sequences are • Alignment • A way of placing sequences one above the other in order to make clear the correspondence between similar characters or substrings
Sequence Comparison • Want best alignment between two or more sequences • Global Comparison • Alignment involving entire sequences • Local Comparison • Alignment involving substrings • Semi-Global Comparison • Aligning prefixes and suffixes of the sequences • All can be solved by Dynamic Programming
Global Comparison • Consider the following DNA sequences GACGGATTAG GATCGGAATAG • Are they similar? • After alignment, similarities are more obvious GA-CGGATTAG GATCGGAATAG
Alignment and Score • Alignment, more precise definition • Insertion of spaces in arbitrary locations along the sequences so that they end up with the same size • No column can be entirely composed of spaces • Score • Measure of similarity • Each column receive +1, for a match, -1 for a mismatch or -2 for a space • Sum values to get score
Dynamic Programming • Solving an instance of a problem by taking advantage of already computed solutions for smaller instances of the problem • Main algorithmic approach used in sequence alignment • Figure 3.1, 3.2
Optimal Alignments • From Figure 3.1, start at (m,n) and follow arrows to (0,0) • Each arrow gives one column of the alignment • If arrow is horizontal, it corresponds to a column with a space in s matched with t[j] • If arrow is vertical, it corresponds to s[i] matched with a space in t • If arrow is diagonal, s[i] is matched with t[j]
Optimal Alignments • Many alignments are possible, depending on which arrow is given priority
Local Comparison • A local alignment between s and t is an alignment between a substring of s and a substring of t • Goal : find the highest scoring local alignment between two sequences • Variation of basic algorithm (Figure 3.2) • Each entry holds highest score of an alignment between suffixes of s and t (page 55)
SemiGlobal Comparison • Score alignments ignoring some of the end spaces in the sequences • End spaces are those that appear before the first or after the last character in a sequence • For example, CAGCA-CTTGGATTCTCGG ---CAGCGTGG-------- • If we aligned the sequences in the usual way, then CAGCACTTGGATTCTCGG CAGC-----G-T----GG
Extensions to Basic Algorithm • Basic algorithm has O(mn) complexity and uses space on the order of O(mn) • Possible to improve complexity from quadratic to linear at the expense of doubling processing time • Can be accomplished by using a Divide and Conquer strategy • Divide the problem into small subproblems and later combine the solutions to obtain a solution for the whole problem
Gap Penalty Functions • A gap is a consecutive number of spaces • When mutations occur, it is more likely to have a block of gaps verses a series of isolated gaps • Previous discussed scoring method is not appropriate in this case
Gap Penalty Functions • For example, A------ATTCCTTCCTTCC AAAGAGAATTCCTTCCTTCC • Scoring is done at a block level, not a column level A ------ ATTCCTTCCTTCC A AAGAGA ATTCCTTCCTTCC
Multiple Sequences • Multiple sequence alignment is a generation of the two sequence case • Multiple alignment of s1,s2…..sk is obtained by inserting spaces in the sequences in such a way to make them all the same size • No column is made entirely of spaces • Figure 3.10
Scoring Multiple Sequences • Need a function that inputs amino acid sequences and returns a score • The function must have two properties • Order of arguments must be independent. For example if a column has I,V,- the same score should be produced if the order is -,V,I • Should reward the presence of many equal resides and penalize unequal residues and spaces
Sum-of-Pairs (SP) • Sum-of-Pairs (SP) satisfies the properties • Sum of pairwise scores of all pairs of symbols in a column • SP-score(I,-,I,V) = p(I,-) + p(I,I) + p(I,V) + p(-,I) + p(-,V) + p(I,V) where p(a,b) is pairwise score of a and b
Algorithm Paradigm • Dynamic programming is used again • Basic algorithm can be used, but there will be problems • In two sequence case, complexity is O(n2) • For k sequence case, complexity is O(nk) • Can take a really long time if k is large
Algorithm Paradigm • Must reduce the amount or number of cells to compute • Apply a heuristic to reduce the number of computed cells
Star Alignments • Building a multiple alignment based on pairwise alignments between a fixed sequence and all others • Fixed sequence is the center of the star
Star Alignments • Example a = ATTGCCATT b = ATGGCCATT c = ATCCAATTTT d = ATCTTCTT e = ACTGACC Select a as the center of the star
Star Alignments • Align a with b a with c a with d a with e
Star Alignments • ATTGCCATT • ATGGCCATT • ATTGCCATT-- • ATC-CAATTTT • ATTGCCATT • ATCTTC-TT • ATTGCCATT • ACTGACC--
Star Alignments • Combine results • ATTGCCATT-- • ATGGCCATT-- • ATC-CAATTTT • ATCTTC-TT-- • ACTGACC----
Database Search • Database exist for searching and comparing protein and DNA sequences • Methods described work, but may take to long and be impractical for searching large databases • Novel and faster methods have been developed
PAM Matrix • When scoring protein sequences, the +1,-1,-2 may not be sufficient • Amino acids have properties that influence the likelihood that they will be substituted in an evolutionary scenario
PAM Matrix • Point Accepted Mutations • A 1-PAM matrix is suitable for comparing sequences that are 1 unit of evolution apart • A 250-PAM matrix is suitable for comparing sequences that are 250 units of evolution apart
PAM Matrix • Markovian in nature • Need the probability of for each amino acid • Probability transition matrix • Score matrix
BLAST • Most frequently programs used to search sequence databases • Acronym for Basic Alignment Search Tool • Returns a list of high scoring segment pairs between the query sequence and sequences in the database • http://www.ncbi.nlm.nih.gov
FAST • Another family of programs for sequence database search • http://www.rcsb.org/pdb/index.html • BLAST and FAST use PAM matrices