1 / 24

Class 3: Sequence similarity

Class 3: Sequence similarity. Motivation. Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar substring of A, B Longest similar substring of A, B..Z For each, How big? How similar?. Define alignment.

arissa
Download Presentation

Class 3: Sequence similarity

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Class 3: Sequence similarity

  2. Motivation • Same gene, or similar gene • Suffix of A similar to prefix of B? • Suffix of A similar to prefix of B..Z? • Longest similar substring of A, B • Longest similar substring of A, B..Z • For each, How big? How similar?

  3. Define alignment • Align these two sequences optimally GACGGATT GATCGGTT • Define precisely what an alignment is

  4. Definition of alignment • Insert spaces so that the letters line up, or letters align with spaces GA-CGGATT GATCGG-TT • Don’t allow spaces to line up • Allow spaces even at beginning and end GCAT- -CATG

  5. Define similarity • Given an alignment, compute a similarity score • Three possibilities for each column letter-letter match letter-letter mismatch letter-space mismatch

  6. Optimal alignment • Create score function • Conventionally: +1 bonus for match -1 penalty for letter-letter mismatch -2 penalty for letter-space mismatch

  7. Dynamic programming solution • Given sequences s,t of length m,n • Strategy: build up optimal alignment of prefixes • Base case? • Recurrence relation?

  8. Recurrence • Given opt alignment of prefixes of s,t shorter than i,j, find opt of s[1..i], t[1..j] • Three possibilities: • extend s by a letter, t by a space • extend s by a letter, t by a letter • extend s by a space, t by a letter

  9. Tiny instance -- AGC, AAAC 0 -2 -4 -6 -8 -2 -4 -6

  10. Some dp details • What is a good order to fill the array? • How do you recover the opt alignment? • What do you do about ties? • What is the space complexity of this algorithm? • What is the time complexity of this algorithm?

  11. The gap penalty • Model above assumes two gaps of size 1 are equivalent to one gap of size 2 • Is this realistic? Why or why not?

  12. General gap penalties • Alignments can no longer be scored as the sum of their parts • They still are the sum of blocks with one matched letter or one gap each • Blocks are: matched letters, s-gap, t-gap A|A|C|---|A|GAT|A|A|C A|C|T|CGG|T|---|A|A|T

  13. DP for general gaps • Requires three array, one for each block type • Time complexity is cubic • This is expensive at best, prohibitive for large problems • See Setubal/Meidanis 3.3.2 for details

  14. Affine gap penalty • Charge h for each gap, plus g * (len(gap)) • This still has quadratic complexity! • See Setubal/Meidanis

  15. Point accepted mutations • Some mutations are more likely than others • In proteins, some amino acids are more similar than others (size, charge, hydrophobicity) • A point accepted mutation matrix is a table with probabilityof each transition in fixed time

  16. PAM matrices • The entire matrix sums to 1 • A ‘unit of evolution’ is time in which 1/100 amino acids is expected to change

  17. Scoring matrix • Consider aligned letters a,b • Pr(b is a mutation of a) = Mab • Pr(b is a random occurrence) = pb • Score(a,b) = 10log(Mab /pb)

  18. Blast • Basic Local Alignment Search Tool • Def: ‘segment’ is a subsequence (without gaps) • Def: ‘segment pair’ is two segments of equal length • Rem: the score of a segment pair is the sum of its aligned letters

  19. What Blast does • Input: • a PAM matrix • a database of sequences B • a query sequence A • a threshhold S • Output: • all segment pairs(A,B) with score > S

  20. How Blast works • Compile short, high-scoring strings (words) • Search for hits -- each hit gives a seed • Extend seeds

  21. Blast on proteins • Words are w-mers which score at least T against A • Use hashing or dfa to search for hits • Extend seed until heuristically determined limit is reached

  22. Blast on nucleic acids • Words are w-mers in query A • Letters compressed, four to byte • Filter database B for very common words to avoid false positives • Extend seeds as in proteins

  23. What does Blast give you? • Efficiency • A rigorous statistical theory which gives the probability of a segment pair occurring by chance

  24. Homework • Given sequences s,t of length m,n, how many alignments do they have? • Setubal/Meidanis, pp. 101, 102. Problems 2, 3, 4, 8, 16.

More Related