200 likes | 516 Views
Sequence Alignment III. CIS 667 February 10, 2004. Extensions to the Basic Algorithm. We have seen that the basic dynamic programming algorithm can be used to solve Global alignment Semi-global alignment Local alignment
E N D
Sequence Alignment III CIS 667 February 10, 2004
Extensions to the Basic Algorithm • We have seen that the basic dynamic programming algorithm can be used to solve • Global alignment • Semi-global alignment • Local alignment • We can extend the algorithm to more accurately reflect the cost of gap penalties
General Gap Penalty Functions • Gaps are caused by mutations • It is more likely to have a single large gap than several smaller ones • Gap penalties should reflect that • Let w(k) denote a gap penalty function (the cost of a gap of k spaces) • We have been using w(k) = bk - a linear function
General Gap Penalty Functions • We can modify the basic algorithm to compute a score with a general gap penalty w (i.e. any function) • The modified algorithm is slower, however - O(n3) • The new algorithm scoring scheme is no longer additive • The space required is also larger
Affine Gap Penalty Functions • Can we do better than O(n3) and still have a reasonable function? • Yes. We need to have w(k) kw(1) • An affine function - w(k) = h + gk with w(0) = 0 and h, g > 0 works • Think of h as the cost of opening a gap and g as the cost of extending a gap • We can develop an algorithm with time complexity O(n2)
-12 -16 -9 -16 Gap Penalties - Overview CAGT CCAAGGTTCAGT Imagine we want to align: C-A-G-T----- CCAAGGTTCAGT Bad alignment: --------CAGT CCAAGGTTCAGT Better alignment: Gap cost with linear gap penalty (-2) Gap cost with affine gap penalty (h = -2, k = -1)
Multiple Sequence Alignment • Once a protein sequence is newly determined, an important goal is to assign possible functions to it • First search for similar sequences in the DNA and protein sequence databases • If more than one similar sequence is found, the next step is to multiply align all of the sequences
Multiple Sequence Alignment • Multiple alignments are key starting point for • Protein secondary structure prediction • Residue accessibility • Function • Also provide the basis for the most sensitive sequence searching algorithms
Multiple Sequence Alignment • A multiple sequence alignment is simply an alignment that contains more than two sequences MPQILLL MLR-LL- MK-ILLL MPPVLIL
Multiple Sequence Alignment • We must decide how to score a multiple alignment • One possibility is the sum-of-pairs function • Simply add up the pairwise scores of all pairs in a column to get the score of the column • Note that in multiple sequence alignment we may have two spaces in a column - the score of (-,-) then is usually set to 0
Multiple Sequence Alignment • A straightforward dynamic programming approach to multiple sequence alignment results in an exponential algorithm • Heuristics can be used to reduce the complexity in most cases
Multiple Sequence Alignment • Automatic alignment programs such as CLUSTAL W can be used to produce multiple alignments • The PSI-BLAST program uses multiple sequence alignments to make more sensitive searches of protein sequence databases than is possible with a single sequence
PAM Matrices • When comparing protein sequences, we need a more complex scoring scheme • A mismatch with two amino acids with similar biochemical properties should score higher than one with two dissimilar ones • Evolution is more likely to result in a similar amino acid (e.g. same size, both hydrophobic, etc.) replacing another
PAM Matrices • PAM - Point Accepted Mutations or Percent of Accepted Mutations • 1-PAM matrix reflects an amount of evolution producing on average one mutation per hundred amino acids • 250-PAM matrix is suitable for comparing sequences that are 250 units of evolution apart • Works well for long, weakly similar sequences • Small values good for short, similar sequences
BLOSUM Matrices • Another widely used set of matrices is BLOSUM - Blocks Substitution Matrix • BLOSUM is often better for highly divergent sequences • PAM better for more highly similar sequences
BLAST • BLAST - Basic Local Alignment Search Tool is a family of sequence similarity tools • Can be used to search sequence databases worldwide • Can be run locally, or via web-based interface on a server • Given a query sequence, BLAST returns all matches above a user-defined threshold
BLAST • BLAST uses a heuristic technique • Compile list of high-scoring words (use PAM matrix to score words w characters long) • Search for matches in the database (use a hash table to speed up search) - call a match a seed • Extend the seeds in both directions until the score of the extension falls below a limit