Algorithms and Data Structures

Algorithms and Data Structures

Outline • Data Structures • Space Complexity • Case Study: string matching • Array implementation (e.g. KMP alg.) • Tree implementation (e.g. suffix tree) /course/eleg67701-f/Topic-1b

Algorithm in action: data structure transformation Algorithm Intermediate data structure Input data structure Output data structure /course/eleg67701-f/Topic-1b

Basic Data Structures • Scalar or “Atomic” data structures • Building blocks for other data structures • Cannot be divided into sub-elements • Integer, floating-point, character, access (pointer) types • Composite data structures • arrays, records • Data Abstraction • Abstract Data Types: A collection of data values together with a set of well-specified operations on that data, e.g. list, stack, queue, trees etc. /course/eleg67701-f/Topic-1b

Scalar Data Structure Physical Layout in the Computer Memory Conceptual View var1 value 0238 0239 0240 0241 0242 0243 0244 0245 Variable name value Assignment operation: var1  value; var2  var1; var1  var3; Memory address /course/eleg67701-f/Topic-1b

Composite Data Structure: Array Physical Layout in the Computer Memory Conceptual View v1 v2 v3 v4 v5 A 0 1 2 3 4 0238 0239 0240 0241 0242 0243 0244 0245 Variable name Array A[1..5] v1 v2 Accessing array elements: A[0]  5 k 1 A[k]  11 A[k+1]  A[k] + 3 v3 v4 v5 nil Memory address /course/eleg67701-f/Topic-1b

___ __ _ ___ __ _ ___ __ _ ___ __ _ ___ __ _ Data Abstraction: Tree Conceptual View Physical Layout in the Computer Memory T v1 T 0238 0238 0239 0240 0241 0242 0243 0244 0245 0241 v1 v2 v3 0244 nil v2 v4 nil Accessing the elements: T.value 12 T.left new(T) T.right new(T) nil v3 Memory address 0247 /course/eleg67701-f/Topic-1b . . .

Space Analysis • Storage space, like time, is another limited resource that is important to programmers • Space requirements are also expressed as a function of the input size • Space functions are classified in the same manner as running times /course/eleg67701-f/Topic-1b

Complexity Analysis: Sorting AlgorithmTime-Complexity Insertionsort O(n2) Quicksort O(n.log n) Space-Complexity O(n) O(n) /course/eleg67701-f/Topic-1b

Space-Time Tradeoff • Reductions in running time are often possible if we increase storage requirements • Decreasing the amount of storage used by an algorithm usually results in longer running times • Using an array to lookup previously computed values can drastically increase the speed of a function /course/eleg67701-f/Topic-1b

Case Study: Searching for Patterns Problem: find the first occurrence of pattern P of length m inside the text S of length n.  String matching problem /course/eleg67701-f/Topic-1b

String Matching - Applications • Text editing • Term rewriting • Lexical analysis • Information retrieval • And, bioinformatics /course/eleg67701-f/Topic-1b

Model for Pattern-Matching Problem Pattern P Pattern Matcher generator Yes No Pattern Matcher Input string S /course/eleg67701-f/Topic-1b

a g g a g a a g a g g g g a a a a g g a a a g g a a g g g a g g g g g a g g a g g a g g g a g a a a a g a g a g a a g a g g g g g g g g g g P P P P P P P Array Implementation TextS represented as an array of characters: S [1..n] PatternP represented as an array of characters: P [1..m] S a g c a g a a g a g t a Time complexity = O(m.n) Space complexity = O(m + n) /course/eleg67701-f/Topic-1b

a a g g a a g g a a a g g a g a g g g a g g a g g a g g a g a a g g g g a g g a g a a g a g g g g g P P P P P Can we be more clever ? • When a mismatch is detected, say at position k in the pattern string, we have already successfully matched k-1 characters. • We try to take advantage of this to decide where to restart matching S a g c a g a a g a g t a /course/eleg67701-f/Topic-1b

Problem of Matching Keyword PROBLEM. Given a pattern p consisting of a single keyword and an input string s, answer “yes” if p occurs as a substring of s, that is, if s=xpy, for some x and y; “no” otherwise. For convenience, we will assume p=p1p2…pm and s=s1s2…sn where pirepresents theith character of the pattern and sjthe jth character of the input string. /course/eleg67701-f/Topic-1b

The Knuth-Morris-Pratt Algorithm Observation: when a mismatch occurs, we may not need to restart the comparison all way back (from the next input position). What to do: Constructing a table h, called the next function, that determines how many characters to slide the pattern to the right in case of a mismatch during the pattern-matching process. Knuth, D. E., Morris, J.H. and Pratt, V. R., Fast Pattern Matching Algorithm for Strings, SIAM J. Comput Sci., 43, 1977, 323-350 /course/eleg67701-f/Topic-1b

The key idea is that if we have successfully matched the prefix p=p1p2…pi-1 of the keyword with the substring sj-i+1 sj-i+2… sj-1 of the input string and pi = sj, then we do not need to reprocess any of the suffix sj-i+1 sj-i+2… sj-1since we know this portion of the text string is the prefix of the keyword that we have just matched. /course/eleg67701-f/Topic-1b

Note that the inner while loop will iterate as long as p_i and s_j do not match each other. Once they match, the inner while loop terminate, both i and j will shift by one, and inner loop repeats ... /course/eleg67701-f/Topic-1b

An Important Property of the Next Function in KMP Algorithm The largest k less than i such that p1p2…pk-1 is a suffix of p1p2…pi-1 (i.e.,p1…pk-1 = pi-k+1…pi-1) and pi = pk. if there is no such i, then hi=0 /course/eleg67701-f/Topic-1b

P(i) = S(j) Backtrack or Not Backtrack ? Assume for some i and j, what should we do? • KMP algorithm chose not to backtrack on the text S (e.g. j) for a good reason • The choice is how to shift the pattern P (e.g. i) – i.e. by how much • If for each j, the shift of P is a small constant, then the total time complexity is clearly linear in n /course/eleg67701-f/Topic-1b

i = 12 Scenario 1: j = 12 i Scenario 2: h12 = 7, i = 7 j An Example Given: Next function: 0 1 0 2 1 0 4 0 2 1 0 7 1 Input string: What is hi = h12 = ? hi = 7 /course/eleg67701-f/Topic-1b

An Example (Contn’d) i Scenario 3: h7 = 4, i = 4 j Subsequently i = 2, 1, 0 Finally, a match is found: i j /course/eleg67701-f/Topic-1b

Question: when P(i) = S(j), how much should we shift? i=1 i Observations: • We should shift P to the right • But – by how much? • One answer is: do not backtrack S(j) Pattern P Pi j j=1 Input Sj S /course/eleg67701-f/Topic-1b

Observation: Never backtrack on the input string S. /course/eleg67701-f/Topic-1b

How to Compute the Next Function? j:= hj hi:= hj hi := j /course/eleg67701-f/Topic-1b

How to Compute the Next Function? j:= hj hi:= hj hi := j Note: once p_i does not match p_j -- we know that j should be the index to be found where a prefix before i matches a suffix ends at j /course/eleg67701-f/Topic-1b

1 2 3 4 5 6 7 8 9 a b a a b a b a a a b a a b a b a a Note: P2 = P5 P4 = P9 Interpretation of the Next Function • Interpretation • Question: how to compute the next function? 0 1 0 2 1 0 4 0 2 /course/eleg67701-f/Topic-1b

Note: P1 = P5 P4 = P9 Interpretation of the Next Function • Interpretation • Question: how to compute the next function? 0 1 0 2 1 0 4 0 2 /course/eleg67701-f/Topic-1b

Interpretation of the Next Function • Interpretation • Question: how to compute the next function? Note: P1 = P5 P4 = P9 /course/eleg67701-f/Topic-1b

preprocessing searching Time complexity = O(m + n) Space complexity = O(m + n) KMP - Analysis • The KMP algorithm never needs to backtrack on the text string. /course/eleg67701-f/Topic-1b

KMP Algorithm Complexity Analysis Hints • What is the cost in the building of the next function? (hint: in the code for the next function, the operation j=h_j in the inner loop is never executed more often than the statement i := i+1 in the outer loop) • What is the cost of the matching itself? (hint: similar to the above) /course/eleg67701-f/Topic-1b

Other String Matching Algorithms • The Boyer-Moore Algorithm [Boyer, R. S. and Moore, J. E., A Fast String Searching Algorithm, CACM, 20(10), 1977, 62-72] • The Karp-Rabin Algorithm [Karp, R. M. and Rpbin, M. O., Efficient Randomized Pattern-Matching Algorithm, IBM J. of Res. And Develop., 32(2), 1987, 249-260]. /course/eleg67701-f/Topic-1b

Matching of A Set of Key Words ? • Given a pattern of a set of keywords and an input string S, answer “yes” if some keywords occur as a substring of S, and “no” otherwise. • How to solve this ? /course/eleg67701-f/Topic-1b

How about repeatedly apply KMP ? What time complexity KMP algorithm will have when do a matching of k patterns?- Preprocessing each of the k patterns: assume each pattern has 0(m) in length, this will take 0(km) time - Searching each pattern will take o (n) time per pattern so, total time = k • o(m+n) /course/eleg67701-f/Topic-1b

Question: Can we improve the time complexity when k is large? Answer: Yes, preprocessing the input string – tree implementation. /course/eleg67701-f/Topic-1b

Model for Pattern-Matching Problem Pattern P Pattern Matcher generator Yes No Input string S Pattern Matcher Pre Pro- cessing /course/eleg67701-f/Topic-1b

Tree Implementation -- suffix tree • Instead of preprocessing the pattern (P), preprocess the text T ! • Use a tree structure where all suffixes of the text are represented; • Search for the pattern by looking for substrings of the text; • You can easily test whether P is a substring of T because any substring of T is the prefix of some suffix. /course/eleg67701-f/Topic-1b

x a b x a c w c u c a x b c 4 a b x a c c 6 2 Suffix Tree Con’d A suffix tree T for an m-character string S is a rooted directed tree with exactly m leaves numbered 1 to m. Each internal node, other than the root, has at least two children and each edge is labeled with a nonempty substring of S. No two edges out of a node can have edge-labels beginning with the same character. The key feature of the suffix tree is that for any leaf i, the concatenation of the edge-labels on the path from the root to leaf i exactly spells out the suffix of S that starts at position i. That is, it spells out S[i…m]. 3 Suffix tree for string xabxac. The node labels u and w on the two interior nodes will be used. /course/eleg67701-f/Topic-1b agagta$ agaaagta$

Note on Suffix Tree • Not all strings guaranteed to have corresponding suffix trees • For example: consider xabxa: it does not have a suffix tree: because here xa is both a prefix and suffix (I.e. xa does not necessarily ends at a leaf) • How to fix the problem: add $ - a special “termination” character to the alphabet. /course/eleg67701-f/Topic-1b

Algorithm for Constructing a Suffix Tree • A subtree can be constructed in linear time [Weiner73, McCreight76, Ukkonen95] /course/eleg67701-f/Topic-1b

preprocessing searching Time complexity = O(n + m) Space complexity = O(m + n) Suffix Tree /course/eleg67701-f/Topic-1b

Question • How to use suffix tree to help solving the string matching problem ? /course/eleg67701-f/Topic-1b

Other Tree based Methods • Suffix tree is not the only one .. /course/eleg67701-f/Topic-1b

Algorithms and Data Structures