210 likes | 371 Views
Computer Science Background for Biologists. What is algorithm. Well-defined computational procedure that takes some values as input and produces some value as output. We are interested in the correctness and efficiency of computer algorithms
E N D
What is algorithm • Well-defined computational procedure that takes some values as input and produces some value as output. • We are interested in the correctness and efficiency of computer algorithms • We seek to extract clean, well-defined problems from the typically messy “real” problem to gain insight into it.
Example of an algorithm • Input: A sequence of n numbers (a1, a2, …an). • Output: A permutation (a’1, a’2, …a’n) of the input sequence such that a’1≤ a’2≤ …a’n.
Exact String Matching • Input: A text string T, where |T| = n, and a pattern string P, where |P| = m. • Output: An index i such that Ti+k-1 = Pk for all 1 ≤ k ≤ m, i.e. showing that P is a substring of T. Text T: Pattern P:
Exact String Matching • Brute force search algorithm for i =1 to n-m+1 do j=1; while ( T[i+j-1] == P[j] ) and (j <= m) j=j+1; if (j > m) then print “pattern at position ”, i;
Algorithm Efficiency • Time efficiency of algorithms • Space efficiency of algorithms
Machine Independent Analysis We assume that every basic operation takes constant time: • Example Basic Operations: • Addition, Subtraction, Multiplication, Memory Access • Time efficiency of an algorithm is the number of basic operations it performs • We do not distinguish between the basic operations.
Time efficiency • In fact, we will not worry about the exact values, but will look at ``broad classes’ of values. • Let there be n inputs. • If an algorithm needs n basic operations and another needs 2n basic operations, we will consider them to be in the same efficiency category. • However, we distinguish between exp(n), n, log(n)
Example: Time Complexity • This algorithm might use only n steps if we are lucky. • We might need about n*m steps if we are unlucky
exp (n) n log n Order of Increase • We worry about the increase speed of our algorithms with increased input sizes.
Function Orders • A function f(n) is O(g(n)) if ``increase’’ of f(n) is not faster than that of g(n). • A function f(n) is O(g(n)) if there exists a number n0 and a nonnegative c such that for all n n0 , 0 f(n) cg(n). • If limnf(n)/g(n) exists and is finite, then f(n) is O(g(n))
Implication of Big oh notation • Big oh notation ― an upper bound on the number of steps that an algorithm takes in the worst case. • Suppose we know that our algorithm uses at most O(f(n)) basic steps for any n inputs, and n is sufficiently large, then we know that our algorithm will terminate after executing at most constant times f(n) basic steps. • We know that a basic step takes a constant time in a machine. • Hence, our algorithm will terminate in a constant times f(n) units of time, for all large n.
Algorithm Complexity • Thus the brute force string matching algorithm is O(mn), or takes quadratic time • An quadratic time algorithm is usually fast enough for small problems, but not big ones. • An exponential-time algorithm can only be fast enough for tiny problems
Any improvement based on brute force search? • Some of these comparisons are wasted work! • By being more clever, we can reduce the worst case running time to O(n+m) • Knuth-Morris-Pratt string matching
NP , NP hard, NP complete Problems • A problem is assigned to the NP class if it can be verified in polynomial time. • A problem is NP-hard if an algorithm for solving it can be translated into one for solving any other NP-problem • NP-hard therefore means "at least as hard as any NP-problem,“ • NP-complete: it is both NP problem and NP-hard problem
NP-Completeness • Unfortunately, for many problems, there is no known polynomial algorithm • Even worse, most of these problems can be proven NP-complete, meaning that no such algorithm can exist! • Heuristics , approximate
Shortest Common Superstring • Input: A set S = {s1, s2, … sm} of text strings on some alphabet £. • Output: the shortest possible string T such that each si is a substring of T. • This application arises in DNA sequencing
Shortest common superstring • NP-complete problems. • Can you suggest an algorithm to find the shortest common superstring? • Greedy heuristic ― approximate optimal solution
Greedy Heuristic • We always merge the two strings with the longest overlap • Put the combined string back • Repeat until only one string remains • GREEDY finds a superstring of length at most twice optimal
Time complexity of the greedy heuristic • We assume n strings, each string has a length of k. • N rounds • O(N2) strings comparisons • Each string comparison takes k2 steps.