250 likes | 445 Views
Suffix Trees. Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …. Suffix Trees. String … any sequence of characters. Substring of string S … string composed of characters i through j , i <= j of S .
E N D
Suffix Trees • Suffix trees • Linearized suffix trees • Virtual suffix trees • Suffix arrays • Enhanced suffix arrays • Suffix cactus, suffix vectors, …
Suffix Trees • String … any sequence of characters. • Substring of string S … string composed of characters i through j, i <= j of S. • S = cater=>ate is a substring. • car is not a substring. • Empty string is a substring of S.
Subsequence • Subsequence of string S … string composed of characters i1 < i2 < … < ik of S. • S = cater=>ate is a subsequence. • car is a subsequence. • The empty string is a subsequence.
String/Pattern Matching • You are given a source string S. • Answer queries of the form: is the string pia substring of S? • Knuth-Morris-Pratt (KMP) string matching. • O(|S| + | pi |) time per query. • O(n|S| + Si | pi |) time for n queries. • Suffix tree solution. • O(|S| + Si | pi |) time for n queries.
String/Pattern Matching • KMP preprocesses the query string pi, whereas the suffix tree method preprocesses the source string S. • An application of string matching. • Genome project. • Databank of strings (gene sequences). • Character set is ATGC. • Determine if a “new” sequence is a substring of a databank sequence.
Definition Of Suffix Tree • Compressed trie with edge information. • Keys are the nonempty suffixes of a given string S. • Nonempty suffixes of S = sleeper are: • sleeper • leeper • eeper • eper • per, er, and r.
String Matching & Suffixes • pi isa substring of S iff pi isa prefix of some suffix of S. • Nonempty suffixes of S = sleeper are: • sleeper • leeper • eeper • eper • per, er, and r. • Which of these are substrings of S? • leep, eepe, pe, leap, peel
Last Character Of S Repeats • When the last character of S appears more than once in S, S has at least one suffix that is a proper prefix of another suffix. • S = creeper • creeper, reeper, eeper, eper, per, er, r • When the last character of S appears more than once in S, use an end of string character # to overcome this problem. • S = creeper# • creeper#, reeper#, eeper#, eper#, per#, er#, r#, #
1 abbb # b 5 2 abbbb# # b abbbb# b# 3 # abbbb# b 4 # abbbb# b# Suffix Tree For S = abbbabbbb#
abbb # b abbbb# # b abbbb# b# # abbbb# b # abbbb# b# Suffix Tree For S = abbbabbbb# 1 5 2 10 3 1 5 9 4 4 8 3 abbbabbbb# 7 2 6 12345678910
abbb # b abbbb# # b abbbb# b# # abbbb# b # abbbb# b# Suffix Tree For S = abbbabbbb# 1 1 5 4 2 10 1 3 8 1 5 9 4 4 2 8 3 abbbabbbb# 7 2 6 12345678910
Suffix Tree Construction • See Web write up for algorithm. • Time complexity • |S| = n, alphabet size = r. • O(nr) using array nodes. • This is O(n) for r a constant (or r <= c). • O(n) expected time using a hash table. • O(n) time algorithm for large r in reference cited in Web write up.
Suffix Array • Array that contains the start position of suffixes in lexicographic order. • abbbabbbb# • Assume # < a < b • # < abbbabbbb# < abbbb# < b# < babbbb# < bb# < bbabbbb# < bbb# < bbbabbbb# < bbbb# • SA = [10, 1, 5, 9, 4, 8, 3, 7, 2, 6] • LCP = length of longest common prefix between adjacent entries of SA. • LCP = [0, 4, 0, 1, 1, 2, 2, 3, 3, -]
Suffix Array • Less space than suffix tree • Linear time construction • Can be used to solve several of the problems solved by a suffix tree with same asymptotic complexity. • Substring matching binary search for p using SA. • O(|p| log |S|).
abbb # b 10 abbbb# # b abbbb# b# 1 5 9 4 # abbbb# b 8 3 # abbbb# b# abbbabbbb# 7 2 6 12345678910 O(|pi|) Time Substring Matching babb abbba baba
Find All Occurrences Of pi • Search suffix tree for pi. • Suppose the search for pi is successful. • When search terminates at an element node, pi appears exactly once in the source string S.
abbb # b 10 abbbb# # b abbbb# b# 1 5 9 4 # abbbb# b 8 3 # abbbb# b# abbbabbbb# 7 2 6 12345678910 Search Terminates At Element Node abbbb#
Search Terminates At Branch Node • When the search for pi terminates at a branch node, each element node in the subtree rooted at this branch node gives a different occurrence of pi.
abbb # b 10 abbbb# # b abbbb# b# 1 5 9 4 # abbbb# b 8 3 # abbbb# b# abbbabbbb# 7 2 6 12345678910 Search Terminates At Branch Node ab
Find All Occurrences Of pi • To find all occurrences of pi in time linear in the length of pi and linear in the number of occurrences of pi, augment suffix tree: • Link all element nodes into a chain in inorder. • Each branch node keeps a pointer to the left most and right most element node in its subtree.
abbb # b 10 abbbb# # b abbbb# b# 1 5 9 4 # abbbb# b 8 3 # abbbb# b# abbbabbbb# 7 2 6 12345678910 Augmented Suffix Tree b
Longest Repeating Substring • Find longest substring of S that occurs more than m > 1 times in S. • Label branch nodes with number of element nodes in subtree. • Find branch node with label >=m and max char# field.
10 5 7 2 3 abbb # b 10 abbbb# # b abbbb# b# 1 5 9 4 # abbbb# b 8 3 # abbbb# b# abbbabbbb# 7 2 6 12345678910 Longest Repeating Substring m = 2 m = 5
Longest Common Substring • Given two strings S and T. • Find the longest common substring. • S = carport, T = airports • Longest common substring = rport • Longest common subsequence = arport • Longest common subsequence may be found in O(|S|*|T|) time using dynamic programming. • Longest common substring may be found in O(|S|+|T|) time using a suffix tree.
Longest Common Substring • Let $ be a new symbol. • Construct the suffix tree for the string U = S$T#. • U = carport$airports# • No repeating substring includes $. • Find longest repeating substring that is both to left and right of $. • Find branch node that has max char# and has at least one element node in its subtree that represents a suffix that begins in S as well as at least one that begins in T.