1 / 30

Computational Sequence Comparison in Molecular Biology

Learn about the important operation of sequence comparison in computational biology, including similarity, alignment, scoring, dynamic programming, and optimal alignments. Explore local and semi-global comparisons, gap penalty functions, multiple sequence alignment, and the use of heuristics for efficient computation.

noreent
Download Presentation

Computational Sequence Comparison in Molecular Biology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 3 Computational Molecular Biology Michael Smith msmith@cs.fit.edu

  2. Sequence Comparison • Sequence comparison is the most important operation in computational biology • Consists of finding which parts of the sequences are alike and which parts differ

  3. Similarity and Alignment • Similarity • Gives a measure of how similar sequences are • Alignment • A way of placing sequences one above the other in order to make clear the correspondence between similar characters or substrings

  4. Sequence Comparison • Want best alignment between two or more sequences • Global Comparison • Alignment involving entire sequences • Local Comparison • Alignment involving substrings • Semi-Global Comparison • Aligning prefixes and suffixes of the sequences • All can be solved by Dynamic Programming

  5. Global Comparison • Consider the following DNA sequences GACGGATTAG GATCGGAATAG • Are they similar? • After alignment, similarities are more obvious GA-CGGATTAG GATCGGAATAG

  6. Alignment and Score • Alignment, more precise definition • Insertion of spaces in arbitrary locations along the sequences so that they end up with the same size • No column can be entirely composed of spaces • Score • Measure of similarity • Each column receive +1, for a match, -1 for a mismatch or -2 for a space • Sum values to get score

  7. Dynamic Programming • Solving an instance of a problem by taking advantage of already computed solutions for smaller instances of the problem • Main algorithmic approach used in sequence alignment • Figure 3.1, 3.2

  8. Optimal Alignments • From Figure 3.1, start at (m,n) and follow arrows to (0,0) • Each arrow gives one column of the alignment • If arrow is horizontal, it corresponds to a column with a space in s matched with t[j] • If arrow is vertical, it corresponds to s[i] matched with a space in t • If arrow is diagonal, s[i] is matched with t[j]

  9. Optimal Alignments • Many alignments are possible, depending on which arrow is given priority

  10. Local Comparison • A local alignment between s and t is an alignment between a substring of s and a substring of t • Goal : find the highest scoring local alignment between two sequences • Variation of basic algorithm (Figure 3.2) • Each entry holds highest score of an alignment between suffixes of s and t (page 55)

  11. SemiGlobal Comparison • Score alignments ignoring some of the end spaces in the sequences • End spaces are those that appear before the first or after the last character in a sequence • For example, CAGCA-CTTGGATTCTCGG ---CAGCGTGG-------- • If we aligned the sequences in the usual way, then CAGCACTTGGATTCTCGG CAGC-----G-T----GG

  12. Extensions to Basic Algorithm • Basic algorithm has O(mn) complexity and uses space on the order of O(mn) • Possible to improve complexity from quadratic to linear at the expense of doubling processing time • Can be accomplished by using a Divide and Conquer strategy • Divide the problem into small subproblems and later combine the solutions to obtain a solution for the whole problem

  13. Gap Penalty Functions • A gap is a consecutive number of spaces • When mutations occur, it is more likely to have a block of gaps verses a series of isolated gaps • Previous discussed scoring method is not appropriate in this case

  14. Gap Penalty Functions • For example, A------ATTCCTTCCTTCC AAAGAGAATTCCTTCCTTCC • Scoring is done at a block level, not a column level A ------ ATTCCTTCCTTCC A AAGAGA ATTCCTTCCTTCC

  15. Multiple Sequences • Multiple sequence alignment is a generation of the two sequence case • Multiple alignment of s1,s2…..sk is obtained by inserting spaces in the sequences in such a way to make them all the same size • No column is made entirely of spaces • Figure 3.10

  16. Scoring Multiple Sequences • Need a function that inputs amino acid sequences and returns a score • The function must have two properties • Order of arguments must be independent. For example if a column has I,V,- the same score should be produced if the order is -,V,I • Should reward the presence of many equal resides and penalize unequal residues and spaces

  17. Sum-of-Pairs (SP) • Sum-of-Pairs (SP) satisfies the properties • Sum of pairwise scores of all pairs of symbols in a column • SP-score(I,-,I,V) = p(I,-) + p(I,I) + p(I,V) + p(-,I) + p(-,V) + p(I,V) where p(a,b) is pairwise score of a and b

  18. Algorithm Paradigm • Dynamic programming is used again • Basic algorithm can be used, but there will be problems • In two sequence case, complexity is O(n2) • For k sequence case, complexity is O(nk) • Can take a really long time if k is large

  19. Algorithm Paradigm • Must reduce the amount or number of cells to compute • Apply a heuristic to reduce the number of computed cells

  20. Star Alignments • Building a multiple alignment based on pairwise alignments between a fixed sequence and all others • Fixed sequence is the center of the star

  21. Star Alignments • Example a = ATTGCCATT b = ATGGCCATT c = ATCCAATTTT d = ATCTTCTT e = ACTGACC Select a as the center of the star

  22. Star Alignments • Align a with b a with c a with d a with e

  23. Star Alignments • ATTGCCATT • ATGGCCATT • ATTGCCATT-- • ATC-CAATTTT • ATTGCCATT • ATCTTC-TT • ATTGCCATT • ACTGACC--

  24. Star Alignments • Combine results • ATTGCCATT-- • ATGGCCATT-- • ATC-CAATTTT • ATCTTC-TT-- • ACTGACC----

  25. Database Search • Database exist for searching and comparing protein and DNA sequences • Methods described work, but may take to long and be impractical for searching large databases • Novel and faster methods have been developed

  26. PAM Matrix • When scoring protein sequences, the +1,-1,-2 may not be sufficient • Amino acids have properties that influence the likelihood that they will be substituted in an evolutionary scenario

  27. PAM Matrix • Point Accepted Mutations • A 1-PAM matrix is suitable for comparing sequences that are 1 unit of evolution apart • A 250-PAM matrix is suitable for comparing sequences that are 250 units of evolution apart

  28. PAM Matrix • Markovian in nature • Need the probability of for each amino acid • Probability transition matrix • Score matrix

  29. BLAST • Most frequently programs used to search sequence databases • Acronym for Basic Alignment Search Tool • Returns a list of high scoring segment pairs between the query sequence and sequences in the database • http://www.ncbi.nlm.nih.gov

  30. FAST • Another family of programs for sequence database search • http://www.rcsb.org/pdb/index.html • BLAST and FAST use PAM matrices

More Related