240 likes | 351 Views
Class 3: Sequence similarity. Motivation. Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar substring of A, B Longest similar substring of A, B..Z For each, How big? How similar?. Define alignment.
E N D
Motivation • Same gene, or similar gene • Suffix of A similar to prefix of B? • Suffix of A similar to prefix of B..Z? • Longest similar substring of A, B • Longest similar substring of A, B..Z • For each, How big? How similar?
Define alignment • Align these two sequences optimally GACGGATT GATCGGTT • Define precisely what an alignment is
Definition of alignment • Insert spaces so that the letters line up, or letters align with spaces GA-CGGATT GATCGG-TT • Don’t allow spaces to line up • Allow spaces even at beginning and end GCAT- -CATG
Define similarity • Given an alignment, compute a similarity score • Three possibilities for each column letter-letter match letter-letter mismatch letter-space mismatch
Optimal alignment • Create score function • Conventionally: +1 bonus for match -1 penalty for letter-letter mismatch -2 penalty for letter-space mismatch
Dynamic programming solution • Given sequences s,t of length m,n • Strategy: build up optimal alignment of prefixes • Base case? • Recurrence relation?
Recurrence • Given opt alignment of prefixes of s,t shorter than i,j, find opt of s[1..i], t[1..j] • Three possibilities: • extend s by a letter, t by a space • extend s by a letter, t by a letter • extend s by a space, t by a letter
Tiny instance -- AGC, AAAC 0 -2 -4 -6 -8 -2 -4 -6
Some dp details • What is a good order to fill the array? • How do you recover the opt alignment? • What do you do about ties? • What is the space complexity of this algorithm? • What is the time complexity of this algorithm?
The gap penalty • Model above assumes two gaps of size 1 are equivalent to one gap of size 2 • Is this realistic? Why or why not?
General gap penalties • Alignments can no longer be scored as the sum of their parts • They still are the sum of blocks with one matched letter or one gap each • Blocks are: matched letters, s-gap, t-gap A|A|C|---|A|GAT|A|A|C A|C|T|CGG|T|---|A|A|T
DP for general gaps • Requires three array, one for each block type • Time complexity is cubic • This is expensive at best, prohibitive for large problems • See Setubal/Meidanis 3.3.2 for details
Affine gap penalty • Charge h for each gap, plus g * (len(gap)) • This still has quadratic complexity! • See Setubal/Meidanis
Point accepted mutations • Some mutations are more likely than others • In proteins, some amino acids are more similar than others (size, charge, hydrophobicity) • A point accepted mutation matrix is a table with probabilityof each transition in fixed time
PAM matrices • The entire matrix sums to 1 • A ‘unit of evolution’ is time in which 1/100 amino acids is expected to change
Scoring matrix • Consider aligned letters a,b • Pr(b is a mutation of a) = Mab • Pr(b is a random occurrence) = pb • Score(a,b) = 10log(Mab /pb)
Blast • Basic Local Alignment Search Tool • Def: ‘segment’ is a subsequence (without gaps) • Def: ‘segment pair’ is two segments of equal length • Rem: the score of a segment pair is the sum of its aligned letters
What Blast does • Input: • a PAM matrix • a database of sequences B • a query sequence A • a threshhold S • Output: • all segment pairs(A,B) with score > S
How Blast works • Compile short, high-scoring strings (words) • Search for hits -- each hit gives a seed • Extend seeds
Blast on proteins • Words are w-mers which score at least T against A • Use hashing or dfa to search for hits • Extend seed until heuristically determined limit is reached
Blast on nucleic acids • Words are w-mers in query A • Letters compressed, four to byte • Filter database B for very common words to avoid false positives • Extend seeds as in proteins
What does Blast give you? • Efficiency • A rigorous statistical theory which gives the probability of a segment pair occurring by chance
Homework • Given sequences s,t of length m,n, how many alignments do they have? • Setubal/Meidanis, pp. 101, 102. Problems 2, 3, 4, 8, 16.