Programming, Data Structures and Algorithms (Algorithm Analysis)

Programming, Data Structures and Algorithms (Algorithm Analysis) Anton Biasizzo

Algorithm analysis • Design a “good” data structures and algorithms • Need a precise ways of analyzing them • Measure of “goodness” • Running time of the algorithm • Space usage of the data structure • In general, running time increases with input size • Running time varies on • Different inputs of same size • Hardware environment • Software environment • Characterizing running time as a function of input size

Experimental studies • Study running time by executing algorithm on various test inputs. • Several experiments on many different test inputs of various sizes. • Visualize the results of the experiments • Statistical analysis on gathered results • Limitation of experimental studies • Limited set of test inputs (other test inputs might be important). • Dependant on hardware and software environment. • Have to fully implement and execute an algorithm in order to study it.

Analysis Tool • Analytical method allows us to avoid experiments. • General way of analyzing the running time: • All possible inputs • Evaluate relative efficiency of any two algorithms • Performed on a high-level description of the algorithm • Associate a function f(n) with the algorithm that characterizes its running time, where n represents the number of input data.

Analysis Tool • Typical functions are • Constant function f(n) = 1 • Logarithm function f(n) = log(n) • Linear function f(n) = n • N-log-N function f(n) = n log(n) • Quadratic function f(n) = n2 • Cubic and Polynomial functions f(n) = n3 , f(n) = nd • Exponential function f(n) = 2n

Primitive operations • Define a set of primitive operations • Primitive operations correspond to low-level CPU instruction or set of instructions with constant execution time • Overall execution time is determined by counting the primitive operations executions • Assumption: different primitive operations have fairly similar execution time

Primitive operations • Examples of primitive operations are • Assigning a value to a variable • Arithmetic operation (addition, multiplication, …) • Comparison • Following a pointer (object reference) • Indexing into an array • Calling a function (or method) • Returning from a function (or method)

Simplify the analysis • Each basic step in pseudo-code corresponds to a small number of primitive operations. • Estimate number of primitive operations by counting pseudo steps up to a constant factor. • Focus on a growth rate of the running time. • Asymptotic notation • Main factor of the running time function determine the running time growth rate in terms of n • Characterize the running time of algorithms by using functions that map the size of the input, n, to values that corresponds to the main factor of the running time function.

“Big Oh” notation • An asymptotic way of saying that a function f(n) grows at lower rate than function g(n). • Let f(n) and g(n) be functions of mapping nonnegative integers to real numbers • f(n) is O(g(n)), if there are positive constants c and n0such that f(n) ≤ c g(n), for n ≥ n0.

Big Oh notation • Example: function 8n-2 is O(n) c = 8, n0 = 1 • Properties: • If f(n) is polynomial of degree d, then f(n) is O(nd). The highest-degree term determines the asymptotic growth rate. • a n + b log(n)+ c is O(n). • a log(n) + c is O(log(n)).

“Big Omega” • An asymptotic way of saying that a function f(n) grows at higher rate than function g(n). • Let f(n) and g(n) be functions of mapping nonnegative integers to real numbers • f(n) is Ω(g(n)), if there are positive constants c and n0such that f(n) ≥ c g(n), for n ≥ n0. • Example: function 3n log n + 2n is Ω(n log n) c =3, n0 = 1 3n log n + 2n ≥ 3n log n for n ≥ 1

“Big Theta” • An asymptotic way of saying that two functions grow at the same rate. • Let f(n) and g(n) be functions of mapping nonnegative integers to real numbers • f(n) is θ(g(n)), if there are positive constants c’, c’’, and n0such that c’ g(n) ≤ f(n) ≤ c’’ g(n), for n ≥ n0. • Example: function 3n log n + 4n is θ(n log n) 3n log n ≤ 3n log n + 4n ≤ (3 + 4) n log n for n ≥ 2

Running time calculations • Maximum subsequence sum: Given integers A1, A2, …, An, find the maximum value of Maximum subsequence sum is 0 if all the integers are negative • Brute-force algorithm (algorithm 1): • For every starting and ending position sum the intermediate elements. • Starting position i goes from 0 to n-1. • Ending position j goes from i to n-1; it is on the left side of starting position i.

Algorithm 1 int max_sub_sum ( int a[], int n) { int this_sum, max_sum, best_i, best_j, i, j, k; /*1*/ max_sum = 0; best_i = best_j = -1; /*2*/ for( i=0; i<n; i++ ) /*3*/ for (j=i; j<n; j++) { /*4*/ this_sum = 0; /*5*/ for (k=i; k<=j; k++) /*6*/ this_sum += a[k]; /*7*/ if (this_sum > max_sum) { /* update max_sum, best_i, best_j */ /*8*/ max_sum = this_sum; best_i = i; best_j = j; }} /*9*/ return max_sum; }

Algorithm 1 • Statements at line {1} and {9} take O(1) • Loop at line {2} is of size n • Loop at line {3} is of size • Statements {4}, {7}, and {8} take O(n2) • Statement {6} is most critical since it is in third nested loop and is executed • Algorithm 1 has O(n3) growth rate

Algorithm 2 • Property: • Cubic running time can be avoided by removing inner loop

Algorithm 2 int max_sub_sum ( int a[], int n) { int this_sum, max_sum, best_i, best_j, i, j; /*1*/ max_sum = 0; best_i = best_j = -1; /*2*/ for( i=0; i<n; i++ ) /*3*/ this_sum = 0; /*4*/ for (j=i; j<n; j++) { /*5*/ this_sum += a[j]; /*6*/ if (this_sum > max_sum) { /* update max_sum, best_i, best_j */ /*7*/ max_sum = this_sum; best_i = i; best_j = j; }}} /*8*/ return( max_sum ); }

Algorithm 2 • Statements {1} and {8} take O(1) • Loop at line {2} is of size n • Statement {3} takes O(n) • Statements {4}, {5}, {6}, and {7} are most critical and are executed • Algorithm 2 has O(n2) growth rate.

Algorithm 3 • Divide and conquer strategy results in recursive algorithm. • Split the problem into two roughly equal sub-problems (divide) – recursion: • Pass left and right border • There is only one element (same left and right borders) • Division is stopped. • If the element is positive it is also the maximum subsequence sum • If the element is negative, the maximum subsequence sum is 0 (empty subsequence)

Algorithm 3 • Two solutions of sub-problem • Solution may span over the center • Determine maximum sum, that crosses center • Patching together solutions of sub-problem (conquer): • Select maximum of three subsequence sums: • The left part, • The right part, • Sum that spans both parts over the center.

Algorithm 3 int max_sub_sum ( int a[], int left, int right) { int max_left_sum, max_right_sum, center, i; int max_center_sum, center_sum; /*1*/ if (left == right) return (a[left] > 0) ? a[left] : 0; /*2*/ center = (left + right) / 2; /*3*/ max_left_sum = max_sub_sum(a, left, center); /*4*/ max_right_sum = max_sub_sum(a, center+1, right); /*5*/ max_center_sum = center_sum = 0 /*6*/ for (i=center; i>=left; i--) { /*7*/ center_sum += a[i]; /*8*/ if (center_sum > max_center_sum) max_center_sum = center_sum; } /*9*/ for (center_sum=max_center_sum, i=center+1; i<=right; i++) { /*10*/ center_sum += a[i]; /*11*/ if (center_sum > max_center_sum) max_center_sum = center_sum; } /*12*/ return max3(max_left_sum, max_right_sum, max_center_sum); }

Algorithm 3 running time analysis • Let T(n) be the time required to solve a problem of size n. • If n=1, the algorithm takes a constant amount of time to execute line 1, which we shall call one unit (T(1) = 1). • Otherwise, the program must perform: • two recursive calls on lines 3 and 4 ( T(n/2) ), • two for loops between lines 6 and 11 (n pseudo steps), • Some small amount of bookkeeping on lines 2, 5, 12 (constant time). T(n) = 2*T(n/2) + O(n) + C • To simplify the calculation we replace O(n) with n and neglect the C: T(n) = 2*T(n/2) + n

Algorithm 3 running time analysis T(n) = 2*T(n/2) + n • Results for T(n) where n is first few power of 2: T(1) = 1 = 1·1 T(2) = 4 = 2·1 + 2 = 2·2 T(4) = 12 = 2·4 + 4 = 4·3 T(8) = 32 = 2·12 + 8 = 8·4 T(16) = 80 = 2·32 + 16 = 16·5 • If n=2k, T(n) = n*(k+1) • The algorithm 3 has O(n log n) growth rate.

Algorithm 4 • Property: If the maximal subsequence starts at position i, then the sum of subsequence (1, i-1) must be negative. • Algorithm: • Calculate sum from the beginning and record current maximum. • When sum is smaller then 0, reset the sum and move current starting location to current position.

Algorithm 4 • Linear growth rate solution int max_sub_sum ( int a[], int n) { int this_sum, max_sum, best_i, best_j, i, j; /*1*/ i = this_sum = max_sum = 0; best_i = best_j = -1; /*2*/ for( j=0; j<n; j++ ) { /*3*/ this_sum += a[j]; /*4*/ if (this_sum > max_sum) { /* update max_sum, best_i, best_j */ { /*5*/ max_sum = this_sum; best_i = i; best_j = j; } /*6*/ if (this_sum < 0) { /*7*/ i = j+1; /*8*/ this_sum = 0; }} /*9*/ return( max_sum ); }

Algorithm 4 running time analysis • Algorithm 4 has only single for loop with n iterations. • The algorithm 4 has O(n) growth rate (linear). • It makes only one pass through the data, and once a[i] is processed, it does not need to be remembered. • No need to store elements in the array. • At any point, the algorithm can give an answer to the subsequence problem for the already processed data • Such algorithms are called on-line algorithms. • An on-line algorithm that requires only constant space and runs in linear time is an optimal solution.

Algorithm comparison • Compare developed algorithms in a plot for small size

Algorithm comparison • Algorithms performance for larger data sets

Logarithmic order of growth • Most divide and conquer algorithms run in O(n log n) time. • An algorithm is O(log n) if it take constant time (O(1)) to cut the problem size by a fraction (typically ½). • If a constant time is required to reduce the problem by constant amount (e.g. by 1), then the algorithm is O(n). • If input is a list of n numbers, a program takes Ω(n) time to read them; when we talk about O(log n) algorithms, we presume that the input is pre-read. • Examples of algorithms with logarithmic behaviour: • Binary search, • Euclid’s Algorithm, • Exponentiation.

Binary search • In pre-sorted array a find i such that a[i] = x otherwise return 0. #define NOT_FOUND 0 int binary_search ( int a[], int x, int n) { int low, mid, high; /*1*/ low = 0; high = n - 1; /*2*/ while ( low <= high ) { /*3*/ mid = (low + high) / 2; /*4*/ if ( a[mid] < x ) low = mid + 1; /*5*/ else if ( a[mid > x ) high = mid – 1; /*6*/ else return mid; /* found */ } /*7*/ return( NOT_FOUND ); }

Euclid’s Algorithm • Compute greatest common denominator. • Euclid’s Algorithm: • GCD(a, 0) = a • GCD(a, b) = GCD(b, a mod b) unsigned int gcd ( unsigned int m, unsigned int n) { unsigned int rem; /*1*/ while ( n > 0 ) { /*2*/ rem = m % n; /*3*/ m = n; /*4*/ n = rem; } /*5*/ return( m ); }

Input variations • Algorithm running time for the same input data size may vary depending on input data values • Average case running time analysis: • Requires probability distribution of set of inputs • Difficult probability theory • Worst case running time analysis • Determine upper bound of the running time • If algorithm performs well in worst case it performs well on every input

Checking the analysis • It is desirable to verify if estimate is correct and as good as possible. • One way is to code the program and see if empirically observed running time matches the estimation. • The running time goes up by factor of 2 when n doubles if the program is linear, by factor of 4 if the program is quadratic, …. • If the running time takes an additive constant longer when n doubles, then the program is logarithmic. • If the running time takes slightly more then twice as long when n doubles, then the program is log-linear (O(n log n) ). These increases are hard to spot if n is not large enough. • Sometimes the analysis is shown empirically to be an over-estimate • The analysis needs to be tightened • The average running time might be significantly smaller then worst-case running time.

Programming, Data Structures and Algorithms (Algorithm Analysis)