BIC I, Week 4 lectures

BIC I, Week 4 lectures Rhys Price Jones and Anne Haake Rochester Institute of Technology rpjavp@rit.edu, arh@it.rit.edu

Overview of the need for Dynamic Programming • Consider Fibonacci • The obvious algorithm is elegant, easily derived from the definition, and clearly correct.(define fib (lambda (n) (if (<= n 1) 1 (+ (fib (- n 2)) (fib (- n 1)))))) • But it’s hopelessly inefficient • Why? • Because it makes repeated recursive calls with the same argument

The Traditional Solution • Change the order in which the computations are performed • Change the logic of the program • So that it works “bottom up” instead of “top down” • Fill an array with calculated values starting with (fib 0), then (fib 1) then (fib 2), etc. • You can do it manually, as in fib.ss • That is dynamic programming! • The main problem is that it requires thought and programming and hence may introduce error.

It’s not just Fibonacci • Many programs “write themselves” from the specification of the problem. • When that happens, we are extremely pleased • Sadly, the resulting program is often inefficient • But dynamic programming is a technique to make it efficient again.

Memo-izing • Redefine the function calling mechanism so that: • We first check to see if we’ve made that calculation before • If no, go ahead and compute it but store the result in a hash table • If yes, look up the previously computed value in the hash table • Do it once • Inefficient code becomes efficient automatically with no re-programming

Another Example • Pascal’s triangle • Each entry is the sum of its parents • Cn,k = Cn-1,k-1 + Cn-1,k • C0,k = Cn,0 = 1 • Leading to program • Runs really slowly • Replace lambda by memolambda

Review of Pattern Matching • Does CGGA appear within the sequence ATCGCGTAACGGAGATAGGCTTA ? • More generally, where does pattern p (length n) appear within text t (length m) • Boyer-Moore, or Knuth-Morris-Pratt give O(m+n) search • If p is going to change a lot and t stay the same, suffix tree can be built in O(m), each search is then O(n) • If p is stable and there are lots of different t, virtual machine can be built in O(n) and then each search is O(m)

Build a Virtual Machine • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

First Step • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

Second Step • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

Third Step • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

Fourth Step • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

Fifth Step • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

Sixth Step • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

Seventh Step • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

Eighth Step • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

Ninth Step • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

Tenth Step • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

Eleventh Step • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

Twelfth Step • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

Thirteenth Step • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

Fourteenth Step • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

Fifteenth Step • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

Sixteenth Step • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

17th – 23rd Steps • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

Pattern Matching – Conclusion • Exact pattern matching is easy • Often the naive algorithm is good enough • Fast algorithms are readily available • Sadly, not much use for biological tasks

Why not? • What’s the difference? • Mutation • Insertion/deletion gaps • We need an inexact way to compare two (or more) biological sequences

Pattern Matching vs. Sequence Alignment • In the CS world, we talk of comparing strings, or matching patterns of characters within strings • For biological applications, we talk of comparing sequences, or aligning sequences of nucleotides (or amino acids) to each other

Evolutionary Relatedness • Consider ACCGT and CACGT • How likely is it that they are “related”? • Possible alignments: • ACCGT AC-CGTXX||| -|-|||CACGT -CACGT • Which is better?

It Depends • ACCGT AC-CGTXX||| -|-|||CACGT -CACGT • Scoring 2 for a match, -2 for a mismatch, and –1 for a gap, 2 versus 6 • Scoring 2 for a match, 0 for a mismatch and –2 for a gap, 6 versus 4 • And we haven’t even begun to consider experimental evidence that might cause us to rank some mutations better than others!

Distance measure • Score 0 for a match • 1 for a mismatch or gap • Low score best! • ACCGT AC-CGTXX||| -|-|||CACGT -CACGT • Now it’s 2 versus 2

Global alignment • For two sequences • - A C C A C C-ACACC • Use the scoring scheme to fill in the table, starting with first row and first column

First entries • Using the distance measure • - A C C A C C- 0 1 2 3 4 5 6A 1 C 2A 3C 4A 5 • Each nucleotide<->gap costs 1 point

Extending inwards • Extending the distance measure • - A C C A C C- 0 1 2 3 4 5 6A 1 0 1 2 3 4 5 C 2 1 A 3 2 C 4 3 A 5 4 • Extending from North or West costs 1 point, from NW costs 0 (match) or 1 (mismatch) • Pick cheapest of the three

More extension • - A C C A C C- 0 1 2 3 4 5 6A 1 0 1 2 3 4 5 C 2 1 0 1 2 3 4 A 3 2 1 1 C 4 3 2 1 A 5 4 3 2 • mi,j = min (mi,j-1+gmi-1,j+gmi-1,j-1+cij) • where cij = 0 for a match, 1 for a mismatch

Getting there... • - A C C A C C- 0 1 2 3 4 5 6A 1 0 1 2 3 4 5 C 2 1 0 1 2 3 4A 3 2 1 1 1 2 3C 4 3 2 1 2 A 5 4 3 2 1 • mi,j = min (mi,j-1+1 mi-1,j+1 mi-1,j-1+cij) • where cij = 0 for a match, 1 for a mismatch

Almost done... • - A C C A C C- 0 1 2 3 4 5 6A 1 0 1 2 3 4 5 C 2 1 0 1 2 3 4A 3 2 1 1 1 2 3C 4 3 2 1 2 1 2A 5 4 3 2 1 2 • mi,j = min (mi,j-1+1 mi-1,j+1 mi-1,j-1+cij) • where cij = 0 for a match, 1 for a mismatch

Finally, we can get a Global alignment • One of the least-cost routes • - A C C A C C- 0 1 2 3 4 5 6A 1 0 1 2 3 4 5 C 2 1 01 2 3 4A 3 2 1 1 1 2 3C 4 3 2 1 2 1 2A 5 4 3 2 1 2 2 • Can you see how this path leads to the alignment • ACCACCAC-ACA

Global alignment program • Distance measure • Runnable program • Dynamic Programming version

Global vs Local Alignment • Global alignment seeks the best alignment between the complete sequence and the complete sequenceA global alignment between GATCCACCA and GTAACACA might be • G-ATCCACCA|-|X|-||-|GTAAC-AC-A • A local alignment is the best alignment between subsequences. A local alignment between GATCCACCA and GTAACACA might be • gATCCACca |X|-||gtAAC-ACa • Best local alignment depends on scoring scheme

Local Alignment • For this demo, we will use a different measure • 2 for a match • -1 for a mismatch, -2 for a gap • Find best match withinG C T C T G C G A A T A GC G T T G A G A T A C T C

The solution • - G C T C T G C G A A T A G - 0 0 0 0 0 0 0 0 0 0 0 0 0 0C 0 0 2 0 2 0 0 2 0 0 0 0 0 0G 0 2 0 1 0 1 2 0 4 2 0 0 0 2T 0 0 1 2 0 2 0 1 2 3 1 2 0 0T 0 0 0 3 1 2 1 0 0 1 2 3 1 0G 0 2 0 1 2 0 4 2 2 0 0 1 2 3A 0 0 1 0 0 1 2 3 1 4 2 0 3 1G 0 2 0 0 0 0 3 1 5 3 3 1 1 5A 0 0 1 0 0 0 1 2 3 75 3 3 3T 0 0 0 3 1 2 0 0 1 5 6 7 5 3A 0 0 0 1 2 0 1 0 0 3 7 5 9 7C 0 0 2 0 3 1 0 3 1 1 5 6 7 8T 0 0 0 4 2 5 3 1 2 0 3 7 5 6C 0 0 2 2 6 4 4 5 3 1 1 5 6 4 • G C T C T G C G A A T A G | | x | | X | | C G T T G A G A - T A C T C

The Program • Has dynamic programming to make it fast! • This is basically Smith-Waterman • Work has been done on different scoring schemes, gap penalties, etc. • Runs in time O(mn)

Exercises • that we will attempt in class: • amend global alignment program to do the “backtracking” needed for the alignment • that will be homework • amend local alignment program to do the “backtracking” needed for the alignment

BIC I, Week 4 lectures