1 / 18

Sequence Alignment III

Sequence Alignment III. CIS 667 February 10, 2004. Extensions to the Basic Algorithm. We have seen that the basic dynamic programming algorithm can be used to solve Global alignment Semi-global alignment Local alignment

analu
Download Presentation

Sequence Alignment III

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequence Alignment III CIS 667 February 10, 2004

  2. Extensions to the Basic Algorithm • We have seen that the basic dynamic programming algorithm can be used to solve • Global alignment • Semi-global alignment • Local alignment • We can extend the algorithm to more accurately reflect the cost of gap penalties

  3. General Gap Penalty Functions • Gaps are caused by mutations • It is more likely to have a single large gap than several smaller ones • Gap penalties should reflect that • Let w(k) denote a gap penalty function (the cost of a gap of k spaces) • We have been using w(k) = bk - a linear function

  4. General Gap Penalty Functions • We can modify the basic algorithm to compute a score with a general gap penalty w (i.e. any function) • The modified algorithm is slower, however - O(n3) • The new algorithm scoring scheme is no longer additive • The space required is also larger

  5. Affine Gap Penalty Functions • Can we do better than O(n3) and still have a reasonable function? • Yes. We need to have w(k)  kw(1) • An affine function - w(k) = h + gk with w(0) = 0 and h, g > 0 works • Think of h as the cost of opening a gap and g as the cost of extending a gap • We can develop an algorithm with time complexity O(n2)

  6. -12 -16 -9 -16 Gap Penalties - Overview CAGT CCAAGGTTCAGT Imagine we want to align: C-A-G-T----- CCAAGGTTCAGT Bad alignment: --------CAGT CCAAGGTTCAGT Better alignment: Gap cost with linear gap penalty (-2) Gap cost with affine gap penalty (h = -2, k = -1)

  7. Multiple Sequence Alignment • Once a protein sequence is newly determined, an important goal is to assign possible functions to it • First search for similar sequences in the DNA and protein sequence databases • If more than one similar sequence is found, the next step is to multiply align all of the sequences

  8. Multiple Sequence Alignment • Multiple alignments are key starting point for • Protein secondary structure prediction • Residue accessibility • Function • Also provide the basis for the most sensitive sequence searching algorithms

  9. Multiple Sequence Alignment • A multiple sequence alignment is simply an alignment that contains more than two sequences MPQILLL MLR-LL- MK-ILLL MPPVLIL

  10. Multiple Sequence Alignment • We must decide how to score a multiple alignment • One possibility is the sum-of-pairs function • Simply add up the pairwise scores of all pairs in a column to get the score of the column • Note that in multiple sequence alignment we may have two spaces in a column - the score of (-,-) then is usually set to 0

  11. Multiple Sequence Alignment • A straightforward dynamic programming approach to multiple sequence alignment results in an exponential algorithm • Heuristics can be used to reduce the complexity in most cases

  12. Multiple Sequence Alignment • Automatic alignment programs such as CLUSTAL W can be used to produce multiple alignments • The PSI-BLAST program uses multiple sequence alignments to make more sensitive searches of protein sequence databases than is possible with a single sequence

  13. PAM Matrices • When comparing protein sequences, we need a more complex scoring scheme • A mismatch with two amino acids with similar biochemical properties should score higher than one with two dissimilar ones • Evolution is more likely to result in a similar amino acid (e.g. same size, both hydrophobic, etc.) replacing another

  14. PAM Matrices • PAM - Point Accepted Mutations or Percent of Accepted Mutations • 1-PAM matrix reflects an amount of evolution producing on average one mutation per hundred amino acids • 250-PAM matrix is suitable for comparing sequences that are 250 units of evolution apart • Works well for long, weakly similar sequences • Small values good for short, similar sequences

  15. PAM-250 Matrix

  16. BLOSUM Matrices • Another widely used set of matrices is BLOSUM - Blocks Substitution Matrix • BLOSUM is often better for highly divergent sequences • PAM better for more highly similar sequences

  17. BLAST • BLAST - Basic Local Alignment Search Tool is a family of sequence similarity tools • Can be used to search sequence databases worldwide • Can be run locally, or via web-based interface on a server • Given a query sequence, BLAST returns all matches above a user-defined threshold

  18. BLAST • BLAST uses a heuristic technique • Compile list of high-scoring words (use PAM matrix to score words w characters long) • Search for matches in the database (use a hash table to speed up search) - call a match a seed • Extend the seeds in both directions until the score of the extension falls below a limit

More Related