Data Structures and Algorithms

Data Structures and Algorithms

XKCD

…Outline • Analyzing algorithms • Designing Algorithms • Profiling • Heuristics • Ex) hash-based sequence alignment

Insertion Sort Pseudocode(review conventions: arrays, indentation, loops, logic, etc.) “for” loop convention iterative or counting • Input: array A[0..n-1] • Insertion-Sort(A) • 1 for j = 1 to length[A]-1 • 2 key = A[ j] • 3 //insert A[ j] into the sorted sequence A[0..j-1] • 4 i = j-1 • 5 while i > -1 and A[i] > key • 6 A[i+1] = A[i] • 7 i = i -1 • 8 A[i+1] = key “while” loop convention do “while” expression is true Cormen, Intro to Algs.

Insertion Sort Design algorithm (as opposed to Bubble Sort) 2) Implement algorithm Left of key is sorted Right of key is unsorted • 1 for j = 1 to length[A]-1 • 2 key = A[ j] • //insert A[ j] into the sorted sequence A[0..j-1] • 4 i = j-1 • 5 while i > -1 and A[i] > key • 6 A[i+1] = A[i] • 7 i = i -1 • 8 A[i+1] = key

Analyzing Algorithms • predicting resources that an algorithm requires • memory • communication bandwidth • logic gates • computational time (most often measured) • In other words, how many “steps” does Insertion Sort take to complete???

Analyzing Insertion Sort • time taken by Insertion Sort depends on input: 1000 takes longer than 3 numbers • can take different amounts of time to sort 2 input sequences of the same size -- Why? • in general, the time taken by an algorithm grows with the size of the input • describe the running time of a program as function of the size of input

Analyzing Insertion Sort • running time • function of number of steps executed • assume a constant amount of time is required to execute each line of pseudocode

Analyzing Insertion Sort • Insertion-Sort(A) cost time • 1 for j = 1 to length[A]-1 c1 n • 2 key = A[ j] c2 n-1 • 3 //insert A[ j] 0 • 4 i = j-1 c4 n-1 • 5 while i > -1 and A[i] > key c5  nj=0 tj • 6 A[i+1] = A[i] c6  nj=0 tj -1 • 7 i = i -1 c7  nj=0 tj -1 • 8 A[i+1] = key c8 n-1 •  nj=0 tj = n(n+1)/2 -1 Algorithms, Cormen

Analyzing Insertion Sort • T(n) = c1n + c2(n-1) +c4(n-1) + c5 (n(n+1)/2 -1) + c6(n(n-1)/2) +c7(n(n-1)/2) + c8(n-1) • = (c5/2 + c6/2 + c7/2)n2 + (c1+c2+c4+c5/2 - c6/2 - c7/2 +c8)n - (c2 +c4+c5+c8) • = k1n2 + k2n -k3 • (n2) = asymptotic upper and lower bound • O(n2) = asymptotic upper bound • Typical complexities • O(1) < O(logn) < O(n) < O(nlogn) < O(n2) < O(n3) < O(2n) Linear time

Insertion Sort Observations • numbers are sorted “in place” • numbers are rearranged within the array, with at most a constant number of them stored outside of the array at any time • Run time depends on input • the number of operations for the following 3 sets will vary greatly due to level of “pre-sortedness” • a descending sorted order will actually take more operations than a random ordering • algorithm complexity analysis allows us to place upper and lower asymptotic bounds for comparison 3 5 6 7 9 8 10 15 20 30 69 3 5 6 7 9 8 10 15 1 12 20 20 12 15 10 9 8 7 6 5 3 1

Insertion Sort Profile • A = 3 2 1 • Step j key i A[i] A[i+1] • 1 1 • 2 1 2 • 3 1 2 0 3 2 • 4 0 > -1 and 3 > 2  true • 1 2 0 3 3 • A = 3 3 1 • 6 1 2 -1 Undef 3 • 4 -1 > -1 and Undef > 2  false • 7 1 2 -1 Undef 2 • A = 2 3 1 • 1 for j = 1 to length[A]-1 • 2 key = A[ j] • 3 i = j-1 • 4 while i > -1 and A[i] > key • 5 A[i+1] = A[i] • 6 i = i -1 • 7 A[i+1] = key

What about naïve.pl? • snt[] = array of subject nucleotides • qnt[] = array of query nucleotieds • for i = 0 to length(subject) – length(query) • j=0 • while (snt[i + j ] == qnt[j]) • j=j+1 • if (j == length (query)) • found sequence at position i • end c1 c2 c3 * n c4 * n c5 * n * n c6 * n * n c7 * n * n c8 * n * n O( n2)

Definitions procedural programming languages – tend to be action oriented (as opposed to Object Oriented Programming – OOP) • subroutine: a collection of high-level programming language operations procedure: (Pascal – did not return a value) • function: (Pascal – did return a value)

Machine Instructions: At the lowest level, every program consists of primitive machine instructions. move.L D0, 20004 Language Statements: High-level languages consist of statements that perform one or more machine instructions. $i = $k + 9; Subroutines: Subroutines consist of groups of language statements. $sequence = &print_formated_sequence(@qnts,$i); Programs: Programs consist of groups of subroutines C: a s/w engineering approach, Darnell

Subroutines • programs are developed with layers of functions • lower-level functions perform simple operations • higher-level functions are created from lower-level functions • analogous to abbreviations for long and complicated sets of commands • defined once, but invoked many times • ease of change • modular and re-usable • enhanced reliability (complicated tasks broken into simpler ones) • improved readability • with low-level details of algorithm compartmentalized, an algorithm may be easier to read, understand, and modify • good rule of thumb – if your subroutine spans more than 1 printed page, I would expect at least 1 bug

Bioinformatics example • Optimal sequence alignment (allowing for gaps and substitutions in either query or subject sequence)

Heuristics • What do you do when faced with an NP-complete problem, or problem size where algorithm takes too long? • Example – want to compare 2 genomes (brute force) • naïve.pl O( n2) • 3*10^9*3*10^9 * 1*10-9 S = 9* 10^9 S = 285 years • Alternative • hash k-tuples of nucleotides to a number, and compare numbers

Hash-Based Alignment base-10 numbers 5805 = 5*10^3 + 8*10^2 + 0*10^1+5*10^0 k=8 ATGCCTGGGCT A=0, C=1, G=2, and T=3 (base 4 number) ATGCCTGG = 0*4^7+3*4^6+2*4^5+1*4^4+1*4^3+3*4^2+2*4^1+2*4^0 = 14714 Now we can compare “chunks” of sequence much faster - speed increase by factor of 8 Can pre-compute hashes for entire genome, and only compare hashes Premise for popular alignment tools – BLAST, BLAT,and UIcluster

Heuristics • Usually a trade off • In sequence hashing example • accuracy is traded for speed • you cannot match/find sequences shorter than 8 nucleotides • How do you find optimal k-tuple? • depends on question • empirically

End

Overly simple example of compartmentalizing Count the number of nucleotides in a file. open file while there is more sequence read a nucleotide increment count print nt count close file

Another Example (divide and conquer) • Find the average intron size for all human genes • Get human genome • Get genes • Find indices of exons/introns • Size = index2 - index1 • Tabulate and average

Recursion • recursion – partially consists or is defined in terms of itself • examples • mirrors • video camera of television • factorial function for non-negative integers • n! = • a) 0! = 1 • b) if n >0, then n! = n(n-1)! • 3! = 3(2)! = 3(2(1!)) = 3(2(1(0!))) = 3(2(1(1)))= 6

Recursion • power is in ability to define an infinite set of objects by a finite statement • tool for expressing a program recursively is the subroutine (procedure/function) • directly recursive: subroutine P contains reference to itself • indirectly recursive: P contains reference to another subroutine Q, which contains a (direct or indirect) reference to P

#!/usr/bin/perl # # simple example perl program to calculate # the factorial of a number using (gasp) "Recursion" # # BUG found print "Enter integer number to determine factorial:"; $i=<STDIN>; # get number chomp($i); # remove "newline" $i = int $i; # removes any decimals if($i < 0) { die("Error: input must be positive integer"); } $j = &Fact($i); print "($i)!="; print "$j\n"; #end of program ###################### sub Fact() { my $num = shift; # How can $num be N, and then N-1, then N-2, etc.???? #print "num = $num\n"; $new_num = $num-1; if($new_num == 0) { return(1); } else { $fact = $num * &Fact($new_num); } return($fact); }

Program Iterations or Profile n=5 [tabraun@texas fact]$ ./fact-test.pl Enter integer number to determine factorial:5 num = 5 num = 4 num = 3 num = 2 num = 1 (5)!=120

Recursion • Cut and paste (*) here • Cut and paste (Cut and paste (*) here) here • Cut and paste (Cut and paste (Cut and paste (*) here) here) here • … Etc.

Variable Scope • global variables: variables that are accessible/visible from any part of a program • local variables: accessible to a limited portion of the program • ensures that variables are not unintentionally manipulated • perl • variables are always global unless you specify otherwise my $variable_name specifies a “local” variable • scope usually refers to “blocks of code” • for loop • while loop from insertion sort • example – scope.pl

#!/usr/bin/perl $i = 5; print "i=$i\n"; { print "i=$i\n"; $i = 3; print "i=$i\n"; } print "i=$i\n";

Can we do better than Insertion Sort O(n2)? • Merge-Sort(A,p,r) 1 if p < r 2 q = |(p+r)/2| 3 Merge-Sort(A,p,q) 4 Merge-Sort(A,q+1,r) 5 Merge(A,p,q,r) 6 return Divide and conquer example -- sorted array of length 1 is already sorted.

Merge-Sort Split Steps 5 2 4 6 1 3 2 6 5 2 4 6 1 3 2 6 1 3 2 6 5 2 4 6 2 6 1 3 2 4 5 6 Merge-sort changes the problem from one of sorting numbers, to one of simply combining stacks of numbers that are already sorted. At the leaf level of this graph, individual numbers are already sorted (because a single number by itself is sorted).

Merge-Sort Analysis O(nlog2n) -- can we do better???

Algorithmic Concepts • Greedy algorithms • always makes the choice that looks “best” at the moment. • makes a locally optimal choice in the hope that this choice will lead to a globally optimal solution • do not always yield optimal solutions

Knapsack Problem • A thief finds n items • item i is worth vi dollars, and weighs wi pounds, where vi and wi are integers • thief wants to maximize value, but is limited to W pounds • What items should thief take?

Knapsack

NP-Completeness and NP-Hard • polynomial-time algorithms • naïve, insertion-sort, merge-sort, fact • O(nk) • Can all problems be solved in polynomial time? • no • this class of problem is called “NP-Complete” • these problems are intractable • valuable to know when a problem is NP-Complete so that you do not waste time attempting to develop a solution • approach is to look for approximation of solution

NP-Complete example • Traveling-salesman problem • a salesman must visit N cities • wants to visit every city exactly once • wants to minimize travel distance

Dynamic Programming • like divide-and-conquer, DP solves problems by combining the solutions of subproblems (“Progamming” refers to a tabular method, NOT writing computer code.) • D and C generally have independent subproblems • DP is most applicable when subproblems are not independent - i.e., D and C does more work than necessary, repeatedly solving the common subsubproblems • DP solves each subsubproblem once, and saves the results in a table. • work is avoided since the answer does not have to be recomputed every time the subsubproblem is encountered • Example • global sequence alignment -- Smith-Waterman

Assignment: Debugging naïve.pl • Due: • bug exists in algorithm • find input scenario where algorithm breaks • Assignment 1 • Obtain naïve.pl (web) • Execute it (with your perl) • Alter input to determine bug • Submit a version of the program with the 2 input lines that illustrate the bug @snt = (A, A, …. ); @qnt = (A, T, ….); • Describe a solution by inserting comments into the program Submit your altered and commented program to icon

Data Structures and Algorithms