1 / 44

Algorithms and Data Structures

Algorithms and Data Structures. Outline. Data Structures Space Complexity Case Study: string matching Array implementation (e.g. KMP alg.) Tree implementation (e.g. suffix tree). Algorithm in action: data structure transformation. Algorithm. Intermediate data structure. Input data

rasia
Download Presentation

Algorithms and Data Structures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Algorithms and Data Structures

  2. Outline • Data Structures • Space Complexity • Case Study: string matching • Array implementation (e.g. KMP alg.) • Tree implementation (e.g. suffix tree) /course/eleg67701-f/Topic-1b

  3. Algorithm in action: data structure transformation Algorithm Intermediate data structure Input data structure Output data structure /course/eleg67701-f/Topic-1b

  4. Basic Data Structures • Scalar or “Atomic” data structures • Building blocks for other data structures • Cannot be divided into sub-elements • Integer, floating-point, character, access (pointer) types • Composite data structures • arrays, records • Data Abstraction • Abstract Data Types: A collection of data values together with a set of well-specified operations on that data, e.g. list, stack, queue, trees etc. /course/eleg67701-f/Topic-1b

  5. Scalar Data Structure Physical Layout in the Computer Memory Conceptual View var1 value 0238 0239 0240 0241 0242 0243 0244 0245 Variable name value Assignment operation: var1  value; var2  var1; var1  var3; Memory address /course/eleg67701-f/Topic-1b

  6. Composite Data Structure: Array Physical Layout in the Computer Memory Conceptual View v1 v2 v3 v4 v5 A 0 1 2 3 4 0238 0239 0240 0241 0242 0243 0244 0245 Variable name Array A[1..5] v1 v2 Accessing array elements: A[0]  5 k 1 A[k]  11 A[k+1]  A[k] + 3 v3 v4 v5 nil Memory address /course/eleg67701-f/Topic-1b

  7. ___ __ _ ___ __ _ ___ __ _ ___ __ _ ___ __ _ Data Abstraction: Tree Conceptual View Physical Layout in the Computer Memory T v1 T 0238 0238 0239 0240 0241 0242 0243 0244 0245 0241 v1 v2 v3 0244 nil v2 v4 nil Accessing the elements: T.value 12 T.left new(T) T.right new(T) nil v3 Memory address 0247 /course/eleg67701-f/Topic-1b . . .

  8. Space Analysis • Storage space, like time, is another limited resource that is important to programmers • Space requirements are also expressed as a function of the input size • Space functions are classified in the same manner as running times /course/eleg67701-f/Topic-1b

  9. Complexity Analysis: Sorting AlgorithmTime-Complexity Insertionsort O(n2) Quicksort O(n.log n) Space-Complexity O(n) O(n) /course/eleg67701-f/Topic-1b

  10. Space-Time Tradeoff • Reductions in running time are often possible if we increase storage requirements • Decreasing the amount of storage used by an algorithm usually results in longer running times • Using an array to lookup previously computed values can drastically increase the speed of a function /course/eleg67701-f/Topic-1b

  11. Case Study: Searching for Patterns Problem: find the first occurrence of pattern P of length m inside the text S of length n.  String matching problem /course/eleg67701-f/Topic-1b

  12. String Matching - Applications • Text editing • Term rewriting • Lexical analysis • Information retrieval • And, bioinformatics /course/eleg67701-f/Topic-1b

  13. Model for Pattern-Matching Problem Pattern P Pattern Matcher generator Yes No Pattern Matcher Input string S /course/eleg67701-f/Topic-1b

  14. a g g a g a a g a g g g g a a a a g g a a a g g a a g g g a g g g g g a g g a g g a g g g a g a a a a g a g a g a a g a g g g g g g g g g g P P P P P P P Array Implementation TextS represented as an array of characters: S [1..n] PatternP represented as an array of characters: P [1..m] S a g c a g a a g a g t a Time complexity = O(m.n) Space complexity = O(m + n) /course/eleg67701-f/Topic-1b

  15. a a g g a a g g a a a g g a g a g g g a g g a g g a g g a g a a g g g g a g g a g a a g a g g g g g P P P P P Can we be more clever ? • When a mismatch is detected, say at position k in the pattern string, we have already successfully matched k-1 characters. • We try to take advantage of this to decide where to restart matching S a g c a g a a g a g t a /course/eleg67701-f/Topic-1b

  16. Problem of Matching Keyword PROBLEM. Given a pattern p consisting of a single keyword and an input string s, answer “yes” if p occurs as a substring of s, that is, if s=xpy, for some x and y; “no” otherwise. For convenience, we will assume p=p1p2…pm and s=s1s2…sn where pirepresents theith character of the pattern and sjthe jth character of the input string. /course/eleg67701-f/Topic-1b

  17. The Knuth-Morris-Pratt Algorithm Observation: when a mismatch occurs, we may not need to restart the comparison all way back (from the next input position). What to do: Constructing a table h, called the next function, that determines how many characters to slide the pattern to the right in case of a mismatch during the pattern-matching process. Knuth, D. E., Morris, J.H. and Pratt, V. R., Fast Pattern Matching Algorithm for Strings, SIAM J. Comput Sci., 43, 1977, 323-350 /course/eleg67701-f/Topic-1b

  18. The key idea is that if we have successfully matched the prefix p=p1p2…pi-1 of the keyword with the substring sj-i+1 sj-i+2… sj-1 of the input string and pi = sj, then we do not need to reprocess any of the suffix sj-i+1 sj-i+2… sj-1since we know this portion of the text string is the prefix of the keyword that we have just matched. /course/eleg67701-f/Topic-1b

  19. Note that the inner while loop will iterate as long as p_i and s_j do not match each other. Once they match, the inner while loop terminate, both i and j will shift by one, and inner loop repeats ... /course/eleg67701-f/Topic-1b

  20. An Important Property of the Next Function in KMP Algorithm The largest k less than i such that p1p2…pk-1 is a suffix of p1p2…pi-1 (i.e.,p1…pk-1 = pi-k+1…pi-1) and pi = pk. if there is no such i, then hi=0 /course/eleg67701-f/Topic-1b

  21. P(i) = S(j) Backtrack or Not Backtrack ? Assume for some i and j, what should we do? • KMP algorithm chose not to backtrack on the text S (e.g. j) for a good reason • The choice is how to shift the pattern P (e.g. i) – i.e. by how much • If for each j, the shift of P is a small constant, then the total time complexity is clearly linear in n /course/eleg67701-f/Topic-1b

  22. i = 12 Scenario 1: j = 12 i Scenario 2: h12 = 7, i = 7 j An Example Given: Next function: 0 1 0 2 1 0 4 0 2 1 0 7 1 Input string: What is hi = h12 = ? hi = 7 /course/eleg67701-f/Topic-1b

  23. An Example (Contn’d) i Scenario 3: h7 = 4, i = 4 j Subsequently i = 2, 1, 0 Finally, a match is found: i j /course/eleg67701-f/Topic-1b

  24. Question: when P(i) = S(j), how much should we shift? i=1 i Observations: • We should shift P to the right • But – by how much? • One answer is: do not backtrack S(j) Pattern P Pi j j=1 Input Sj S /course/eleg67701-f/Topic-1b

  25. Observation: Never backtrack on the input string S. /course/eleg67701-f/Topic-1b

  26. How to Compute the Next Function? j:= hj hi:= hj hi := j /course/eleg67701-f/Topic-1b

  27. How to Compute the Next Function? j:= hj hi:= hj hi := j Note: once p_i does not match p_j -- we know that j should be the index to be found where a prefix before i matches a suffix ends at j /course/eleg67701-f/Topic-1b

  28. 1 2 3 4 5 6 7 8 9 a b a a b a b a a a b a a b a b a a Note: P2 = P5 P4 = P9 Interpretation of the Next Function • Interpretation • Question: how to compute the next function? 0 1 0 2 1 0 4 0 2 /course/eleg67701-f/Topic-1b

  29. Note: P1 = P5 P4 = P9 Interpretation of the Next Function • Interpretation • Question: how to compute the next function? 0 1 0 2 1 0 4 0 2 /course/eleg67701-f/Topic-1b

  30. Interpretation of the Next Function • Interpretation • Question: how to compute the next function? Note: P1 = P5 P4 = P9 /course/eleg67701-f/Topic-1b

  31. preprocessing searching Time complexity = O(m + n) Space complexity = O(m + n) KMP - Analysis • The KMP algorithm never needs to backtrack on the text string. /course/eleg67701-f/Topic-1b

  32. KMP Algorithm Complexity Analysis Hints • What is the cost in the building of the next function? (hint: in the code for the next function, the operation j=h_j in the inner loop is never executed more often than the statement i := i+1 in the outer loop) • What is the cost of the matching itself? (hint: similar to the above) /course/eleg67701-f/Topic-1b

  33. Other String Matching Algorithms • The Boyer-Moore Algorithm [Boyer, R. S. and Moore, J. E., A Fast String Searching Algorithm, CACM, 20(10), 1977, 62-72] • The Karp-Rabin Algorithm [Karp, R. M. and Rpbin, M. O., Efficient Randomized Pattern-Matching Algorithm, IBM J. of Res. And Develop., 32(2), 1987, 249-260]. /course/eleg67701-f/Topic-1b

  34. Matching of A Set of Key Words ? • Given a pattern of a set of keywords and an input string S, answer “yes” if some keywords occur as a substring of S, and “no” otherwise. • How to solve this ? /course/eleg67701-f/Topic-1b

  35. How about repeatedly apply KMP ? What time complexity KMP algorithm will have when do a matching of k patterns?- Preprocessing each of the k patterns: assume each pattern has 0(m) in length, this will take 0(km) time - Searching each pattern will take o (n) time per pattern so, total time = k • o(m+n) /course/eleg67701-f/Topic-1b

  36. Question: Can we improve the time complexity when k is large? Answer: Yes, preprocessing the input string – tree implementation. /course/eleg67701-f/Topic-1b

  37. Model for Pattern-Matching Problem Pattern P Pattern Matcher generator Yes No Input string S Pattern Matcher Pre Pro- cessing /course/eleg67701-f/Topic-1b

  38. Tree Implementation -- suffix tree • Instead of preprocessing the pattern (P), preprocess the text T ! • Use a tree structure where all suffixes of the text are represented; • Search for the pattern by looking for substrings of the text; • You can easily test whether P is a substring of T because any substring of T is the prefix of some suffix. /course/eleg67701-f/Topic-1b

  39. x a b x a c w c u c a x b c 4 a b x a c c 6 2 Suffix Tree Con’d A suffix tree T for an m-character string S is a rooted directed tree with exactly m leaves numbered 1 to m. Each internal node, other than the root, has at least two children and each edge is labeled with a nonempty substring of S. No two edges out of a node can have edge-labels beginning with the same character. The key feature of the suffix tree is that for any leaf i, the concatenation of the edge-labels on the path from the root to leaf i exactly spells out the suffix of S that starts at position i. That is, it spells out S[i…m]. 3 Suffix tree for string xabxac. The node labels u and w on the two interior nodes will be used. /course/eleg67701-f/Topic-1b agagta$ agaaagta$

  40. Note on Suffix Tree • Not all strings guaranteed to have corresponding suffix trees • For example: consider xabxa: it does not have a suffix tree: because here xa is both a prefix and suffix (I.e. xa does not necessarily ends at a leaf) • How to fix the problem: add $ - a special “termination” character to the alphabet. /course/eleg67701-f/Topic-1b

  41. Algorithm for Constructing a Suffix Tree • A subtree can be constructed in linear time [Weiner73, McCreight76, Ukkonen95] /course/eleg67701-f/Topic-1b

  42. preprocessing searching Time complexity = O(n + m) Space complexity = O(m + n) Suffix Tree /course/eleg67701-f/Topic-1b

  43. Question • How to use suffix tree to help solving the string matching problem ? /course/eleg67701-f/Topic-1b

  44. Other Tree based Methods • Suffix tree is not the only one .. /course/eleg67701-f/Topic-1b

More Related