300 likes | 588 Views
Programming for Engineers in Python. Recitation 12. Plan. Dynamic Programming Coin Change problem Longest Common Subsequence Application to Bioinformatics. Teaching Survey . Please answer the teaching survey: https://www.ims.tau.ac.il/Tal / This will help us to improve the course
E N D
Programming for Engineers in Python Recitation 12
Plan • Dynamic Programming • Coin Change problem • Longest Common Subsequence • Application to Bioinformatics
Teaching Survey • Please answer the teaching survey: https://www.ims.tau.ac.il/Tal/ • This will help us to improve the course • Deadline: 4.2.12
Coin Change Problem • What is the smallest number of coins I can use to make exact change? • Greedy solution: pick the largest coin first, until you reach the change needed • In the US currency this works well: • Give change for 30 cents if you’ve got 1, 5, 10, and 25 cent coins: • 25 + 5 → 2 coins http://jeremykun.files.wordpress.com/2012/01/coins.jpg
The Sin of Greediness • What if you don’t have 5 cent coins? • You got 1, 10, and 25 • Greedy solution: 25+1+1+1+1+1 → 6 coins • But a better solution is: 10+10+10 → 3 coins! • So the greedy approach isn’t optimal The Seven Deadly Sins and the Four Last Things by Hieronymus Bosch http://en.wikipedia.org/wiki/File:Boschsevendeadlysins.jpg
Recursive Solution • Reminder – find the minimal # of coins needed to give exact change with coins of specified values • Assume that we can use 1 cent coins so there is always some solution • Denote our coin list by c1, c2, …, ck(c1=1) • k is the # of coins values we can use • Denote the change required by n • In the previous example: • n=30, k=3, c1=1, c2=10, c3=25
Recursive Solution • Recursion Base: • If n=0 then we need 0 coins • If k=1, c1=1, so we need n coins • Recursion Step: • If n<ck we can’t use ck→ We solve for n and c1,…,ck-1 • Otherwise, we can either use ckor not use ck • If we use ck → we solve for n-ckand c1,…,ck • If we don’t use ck→ we solve for n and c1,…,ck-1
Recursion Solution defcoins_change_rec( cents_needed, coin_values): ifcents_needed <= 0: # base 1 return 0 eliflen(coin_values) == 1: # base 2 returncents_needed# assume that coin_values[0]==1 elifcoin_values[-1] > cents_needed: # step 1 returncoins_change_rec( cents_needed, coin_values[:-1]) else: # step 2 s1 = coins_change_rec( cents_needed, coin_values[:-1] ) s2 = coins_change_rec( cents_needed-coin_values[-1], coin_values) returnmin(s1, s2+1) coins_rec.py
Repeated calls • We count how many times we call the recursive function for each set of arguments: calls = {} defcoins_change_rec(cents_needed, coin_values): global calls calls[(cents_needed, coin_values)] = calls.get( (cents_needed, coin_values) , 0) + 1 … >>> print'result', coins_change_rec(30, (1,5,10,25)) result 2 >>> print'max calls',max(calls.values()) max calls 4
Dynamic Programing - Memoization • We want to store the values of calculation so we don’t repeat them • We create a table called mem • # of columns: # of cents needed + 1 • # of rows: # of coin values + 1 • The table is initialized with some illegal value – for example -1: mem = [ [-1 fory inrange(cents_needed+1)] for x inrange(len(coin_values)) ]
Dynamic Programing - Memoization • For each call of the recursive function, we check if mem already has the answer: ifmem[len(coin_values)][cents_needed] == -1: • In case that it doesn’t (the above is True) we calculate it as before, and we store the result, for example: ifcents_needed <= 0: mem[len(coin_values)][cents_needed]= 0 • Eventually we return the value returnmem[len(coin_values)][cents_needed] coins_mem.py
Dynamic Programing - Iteration • Another approach is to first build the entire matrix • This matrix holds the minimal number of coins we need to get change for j cents using the first i coins (c1, c2, …, ci) • The solution will be min_coins[k,n] – the last element in the matrix • This will save us the recursive calls, but will enforce us to calculate all the values apriori • Bottom-up approach vs. the top-down approach of memoization
Dynamic Programming approach • The point of this approach is that we have a recursive formula to break apart a problem to sub problems • Then we can use different approaches to minimize the number of calculations by storing the sub solutions in memory
Bottom up - example matrix • Set n=4 and k=3 (coins are 1, 2, and 3 cents) • Base cases: • how many coins do I need to make change for zero cents? Zero! • So min_coins[i,0]=0 • And how many pennies do I need to make j cents? Exactly j (we assumed we can use pennies) • So min_coins[0,j]=j • So the base cases give us: • Next – the recursion step
Bottom up - example matrix • For particular choice of i,j (but not i=0 or j=0) • To determine min_coins[i,j] – the minimum # of coins to get exact change of j using the first i coins • We can use the coin ci and add +1 to min_coins[i,j-ci] (only valid if j>ci) • We can decide not to use ci, therefore to use only c0 ,..,ci-1, and therefore min_coins[i-1,j] . • So which way do we choose? • The one with the least coins! min_coins[i,j] = min(min_coins[i,j-ci] +1, min_coins[i-1,j])
Example matrix – recursion step • Set n=4 and k=3 (coins are 1, 2, and 3 cents) • So the base cases give us: • M(1,1)=1 • M(1,2)=1 • M(1,3)=min(M(1,1)+1,M(0,3))=min(2,2)=2 • M(1,4)=min(M(1,2)+1, M(0,4))=min(2,4)=2 • etc… coins_matrix.py The code for the matrix solution and the idea is from http://jeremykun.wordpress.com/2012/01/12/a-spoonful-of-python/
Longest Common Subsequence • Given two sequences (strings/lists) we want to find the longest common subsequence • Definition – subsequence: B is a subsequence of A if B can be derived from A by removing elements from A • Examples • [2,4,6] is a subsequence of [1,2,3,4,5,6] • [6,4,2] is NOT a subsequence of [1,2,3,4,5,6] • ‘is’ is a subsequence of ‘distance’ • ‘nice’ is NOT a subsequence of ‘distance’
Longest Common Subsequence • Given two subsequences (strings or lists) we want to find the longest common subsequence: • Example for a LCS: • Sequence 1: HUMAN • Sequence 2: CHIMPANZEE • Applications include: • BioInformatics (next up) • Version Control http://wordaligned.org/articles/longest-common-subsequence
The DNA • Our biological blue-print • A sequence made of four bases – A, G, C, T • Double strand: • A connects to T • G connects to C • Every triplet encodes for an amino-acid • Example: GAG→Glutamate • A chain of amino-acids is a protein – the biological machine! http://sips.inesc-id.pt/~nfvr/msc_theses/msc09b/
Longest common subsequence • The DNA changes: • Mutation: A→G, C→T, etc. • Insertion: AGC→ ATGC • Deletion: AGC → A‒C • Given two non-identical sequences, we want to find the parts that are common • So we can say how different they are • Which DNA is more similar to ours? The cat’s or the dog’s? http://palscience.com/wp-content/uploads/2010/09/DNA_with_mutation.jpg
Recursion • An LCS of two sequences can be built from the LCSes of prefixes of these sequences • Denote the sequences seq1 and seq2 • Base – check if either sequence is empty: Iflen(seq1) == 0 orlen(seq2) == 0: return [ ] • Step – build solution from shorter sequences: If seq1[-1] == seq2[-1]: returnlcs (seq1[:-1],seq2[:-1]) + [ seq1[-1] ] else: returnmax(lcs(seq1[:-1],seq2), lcs(seq1,seq2[:-1]), key = len) lcs_rec.py
Wasteful Recursion • For the inputs “MAN” and “PIG”, the calls are: (1, ('', 'PIG'))(1, ('M', 'PIG'))(1, ('MA', 'PIG'))(1, ('MAN', ''))(1, ('MAN', 'P'))(1, ('MAN', 'PI'))(1, ('MAN', 'PIG'))(2, ('MA', 'PI'))(3, ('', 'PI'))(3, ('M', 'PI'))(3, ('MA', ''))(3, ('MA', 'P'))(6, ('', 'P'))(6, ('M', ''))(6, ('M', 'P')) • 24 redundant calls! http://wordaligned.org/articles/longest-common-subsequence
Wasteful Recursion • When comparing longer sequences with a small number of letters the problem is worse • For example, DNA sequences are composed of A, G, T and C, and are long • For lcs('ACCGGTCGAGTGCGCGGAAGCCGGCCGAA', 'GTCGTTCGGAATGCCGTTGCTCTGTAAA') we get an absurd: (('', 'GT'), 13,182,769)(('A', 'GT'), 13,182,769)(('A', 'G'), 24,853,152)(('', 'G'), 24,853,152)(('A', ''), 24,853,152) http://blog.oncofertility.northwestern.edu/wp-content/uploads/2010/07/DNA-sequence.jpg
DP Saves the Day • We saw the overlapping sub problems emerge – comparing the same sequences over and over again • We saw how we can find the solution from solution of sub problems – a property we called optimal substructure • Therefore we will apply a dynamic programming approach • Start with top-down approach - memoization
Memoization • We save results of function calls to refrain from calculating them again deflcs_mem( seq1, seq2, mem=None ): ifnotmem: mem = { } key = (len(seq1), len(seq2)) # tuples are immutable if key notinmem: # result not saved yet iflen(seq1) == 0 or len(seq2) == 0: mem[key] = [ ] else: if seq1[-1] == seq2[-1]: mem[key]= lcs_mem(seq1[:-1], seq2[:-1], mem) + [ seq1[-1] ] else: mem[key]= max(lcs_mem(seq1[:-1], seq2 ,mem), lcs_mem(seq1, seq2[:-1], mem), key=len) returnmem[key]
“maximum recursion depth exceeded” • We want to use our memoized LCS algorithm on two long DNA sequences: >>> fromrandom import choice >>> defbase(): … return choice('AGCT') >>> seq1 = str([base() for x inrange(10000)]) >>> seq2 = str([base() for x inrange(10000)]) >>>printlcs(seq1, seq2) RuntimeError: maximum recursion depth exceeded in cmp • We need a different algorithm…
DNA Sequence Alignment • Needleman-Wunsch DP Algorithm: • Python package: http://pypi.python.org/pypi/nwalign • On-line example: http://alggen.lsi.upc.es/docencia/ember/frame-ember.html • Code: needleman_wunsch_algorithm.py • Lecture videos from TAU: • http://video.tau.ac.il/index.php?option=com_videos&view=video&id=4168&Itemid=53