390 likes | 754 Views
Applications. Exact string and substring matchingLongest common substringsFinding and representing repeated substrings efficientlyApplications that lead to alternative: space vs. efficient implementations Matching statisticsSuffix Arrays. Exact Set matching. InputA set of patterns P = {P1,P2
E N D
1. Pattern Matching:Suffix Tree Applications
2. Applications Exact string and substring matching
Longest common substrings
Finding and representing repeated substrings efficiently
Applications that lead to alternative: space vs. efficient implementations
Matching statistics
Suffix Arrays
3. Exact Set matching Input
A set of patterns P = {P1,P2,Pk} and |P| =
Text T of length m
Output
Positions of all occurrences of each pattern Pi in T
Solution method
Preprocess to create suffix tree for T
O(m) time, O(m) space
Maximally match each Pi in suffix tree
O(|P1| ) +O(|P2|) +O(|Pk| ) = O(n)
Output all leaf positions below match point
O(k) time where k is number of total matches
4. Exact set matching using Aho-Corasick Aho-Corasick algorithm is a classical solution to exact set matching
build keyword tree of set of patterns P
A keyword tree for a pattern set P is a rooted tree T such that:
Each edge e is labeled by a character
Any two edge from a node have different labels
Define L(v) of a node v are the concatenation of edge labels on the path from the root to v
For each Pi P there is a node v s.t L(v) = Pi and for each leaf v there is a Pi = L(v)
5. Example of Aho-Corasick Example P = {abce,abe,dce,ac}
6. Example of Aho-Corasick Example P = {abce,ababc,abac}
7. Aho-Corasick vs. Suffix tree Aho-Corasick Approach
O(n) preprocess time and space
to build keyword tree of set of patterns P
O(m+k) search time
Linear time by using the resume link
Suffix Tree Approach
O(m) preprocess time and space
to build suffix tree of T
O(n+k) search time
Using matching statistics to be defined, can make this tradeoff similar to that of Aho-Corasick
8. Substring problem Input
Pattern P of length n
A set of Text Ti of total length m
Output
Position of all occurrences of P in each Text Ti
Solution method
Preprocess to create generalized suffix tree for {Ti}
O(m) time, O(m) space
Maximally match P in generalized suffix tree
Output all leaf positions below match point
O(n+k) time where k is number of total matches
9. Generalized suffix tree T1 = ababc#
T2 = abd$
10. Longest Common Substring problem Input
Strings S and T
Output
The longest common substring of S and T (and its position in S and T)
Solution method
Preprocess to create generalized suffix tree for {S,T}
Mark each node by whether or not its subtree contains a leaf node of S, T, or both
Simple postfix tree traversal algorithm to do this
Path label of node with greatest string depth is the longest common substring of S and T
11. Common substrings of length k problem Input
Strings S and T
Integer k
Output
all substrings of S and T (and their positions in S and T) of length at least k
Solution method
Same as previous problem
Look for all nodes with 2 leaf labels of string depth at least k
12. Longest Common Substrings of more than two Strings Definition: For a given set of K strings, l(j) for 2 <= j <= K is the length of the longest common substring belong to at least j of the K strings
Example: {abcedfg, cbcedfa, dbcedg, cbceg, acea}
13. Longest Common Substrings of more than two Strings Input
Strings S1, , SK, total length =
Output
l(j) (and positions in Si) for 2 <= j <= K
Solution method
Build a generalized suffix tree for the K strings
each string has a unique end character, so each leaf shows up only once
14. Longest Common Substrings of more than two Strings Build a generalized suffix tree for the K strings
each string has a unique end character, so each leaf shows up only once
Define c(v): number of distinct leaf labels in subtree rooted at node v and d(v): string-depth from root to node v
Given c(v) and d(v), do a simple traversal of tree to find l(j) j = 2~K and pointers to locations in substrings
Computing c(v) efficiently
# of leaves is not correct as some leaves may have same label
length K bit vector, 1 bit per string in set
OR your way up the tree
Each OR op takes O(K) time which give O(Kn) running time
Can be improved to be O(n) later
15. Repeated Substrings Definition:
maximal pair in S is a pair of identical substrings a and b in S such that the character to the immediate left (right) of a is different than the character to the immediate left (right) of b.
Add unique characters to front and end of S to include prefixes and suffixes.
Representation: (p1, p2, n)
starting positions and length of the maximal pair
R(S) is the set of all triples representing maximal pairs in S
16. Example of Repeated substrings S =
(2, 7, 3) is a maximal pair
(7, 14, 3) is a maximal pair
(2, 14, 3) is not a maximal pair
(2, 14, 4) is a maximal pair
17. Repeated Substrings A maximal repeat a is a substring in S that is the substring defined by a maximal pair of S
R(S) is the set of maximal repeats and |R(S)| = |R(S)|
Previous example
xyz and xyzv are maximal repeats of Showever, xyz is represented only once in R(S), but there are (2, 7, 3) and (7, 14, 3) in R(S)
|R(S)| is smaller than |R(S)| as xyz shows up twice in R(S) but only once in R(S)
18. Maximal Repeated Substrings Maximal repeats
Input
String S (length n)
Output
R(S)
Lemma
If a is a maximal repeat in S, then a is the path-label of an internal node v in T
a does not end in the middle of an edge
19. Maximal Repeated substrings Definition left character of i is S[i-1]
The left character of a leaf of a suffix tree T is the left character of the suffix position represented by that leaf
A node v of T is called left diverse if at least 2 leaves in vs subtree have different left characters
Theorem
String a labeling the path to an internal node v of T is a maximal repeat if and only if v is left diverse
Capture that character before a is different
20. Example of left diverse S = ababc
21. Maximal Repeated substrings Solution method
Construct suffix tree for S
There are at most n maximal repeats
So that, there are n leaves
Because all internal nodes except the root have at least two children.
Therefore, at most n internal nodes
22. Maximal Repeated substrings Find all left diverse nodes in linear time
All nodes will have a left character label
Leaf node:
Label leaves with their left character
Internal node v:
If any child is left diverse, so is v
If two children have different left character labels, v is left diverse
Otherwise, take on left character value of children
Compact representation
Node v in T is a frontier node if:
v is a diverse
none of vs children are left diverse
23. Maximal Repeated substrings Time complexity
Construct suffix tree for S ? O(n)
Find all left diverse nodes in linear time ? O(n)
Compact representation ? O(k), where k is the number of maximal pairs
24. Supermaximal repeated substrings A supermaximal repeat a is a maximal repeat of S that never occurs as a substring of another maximal repeat of S
Previous example
xyzv is a supermaximal repeat of S
xyz is NOT a supermaximal repeat of S
25. Supermaximal repeated substrings Supermaximal repeats
Input
String S (length n)
Output
The set of supermaximal repeats of S
Theorem:
A left diverse node v represents a supermaximal repeat if and only if
all of vs children are leaves
and each has a distinct left character
26. Matching Statistics Input:
Pattern P of length n
Text T of length m
Output
Compute ms(i) for 1 <=i <= m
Definition of ms(i)
For 1 <= i <=m, matching statistic ms(i) is the length of the longest substring of T starting at position i that matches a substring somewhere in P.
27. Matching Statistics With matching statistics, one can solve several problems with less space than a suffix tree
Exact matching example
Well show an O(n) preprocessing time and O(m) search time solution matching the traditional methods
P matches substring starting at i in T if and only if ms(i) = |P|
28. Example of Matching Statistics
29. Matching Statistics Solution method
Compute suffix tree of P retaining suffix links
Adding location of substring in P
p(i): a location in P such that the substring at p(i) matches substring starting at T(i) for exactly ms(i) positions
Before computing ms(i) values, mark each node in T with the leaf number of one of its leaves
Simply output this value when outputting ms(i) values
30. Matching Statistics Count ms(1): match T against tree
Get ms(i+1) from ms(i)
Assume we are at some node v in the tree
If it is internal, follow suffix link to s(v)
Else if it is a leaf, go up one level to its parent w
If w is an internal node, follow suffix link to s(w)
Traverse downwards using skip/count trick until we have matched all the characters in edge label (w,v)
Now match against T character by character till we have a mismatch and can output ms(i+1)
31. Applying matching statistics to LCS problem Input
strings S and T
Output
longest common substring of S and T
Solution method
Compute suffix tree for shorter string, say S
Compute ms(i) values for T
Maximal ms(i) value identifies LCS
32. Suffix Arrays Input
Text T of length m
Output
Pos array
Definition of Pos array
A suffix array for T, called Pos, is an array of integers in the range 1 to m specifying the lexicographic order of the m suffixes of string T
Pos[k] = i iff Ti is the kth smallest suffix in the m suffixes
Add terminating character $ which is lexically smallest
33. Example of Suffix Arrays T = axfcaxgx#
Suffixes = 1. axfcaxgx#
2. xfcaxgx#
3. fcaxgx#
4. caxgx#
5. axgx#
6. xgx#
7. gx#
8. x#
9. #
34. Suffix Arrays Solution method
Compute suffix tree of T
Do a lexical depth-first traversal of T labeling Pos(k) with leafs in order of encountering them
Edge (v,u) is lexically smaller than edge (v,w) iff first character of (v,u) is lexically smaller than first character of (v,w)
35. Applying Suffix Arrays to exact pattern matching Input
Pattern P of length n
Text T of length m
Output
All occurrences of P in T
Solution method
Compute suffix array Pos for T
If P is in T, then all these locations will be grouped consecutively in Pos
36. Applying Suffix Arrays to exact pattern matching Using binary search, find smallest index i such that P exactly matches the n characters of suffix Pos(i)
Similarly, find largest index i such that P exactly matches the n characters of suffix Pos(i)
Time complexity O(n log m)
37. Longest common prefixes Input
Text T of length m
Output
Max(Lcp(i,j)) ,for 1= i,j = m and i ? j
Definition of Lcp(i,j): Lcp(i,j) is the length of the longest common prefix of the suffixes of T beginning at Pos[i] and Pos[j].
Example from Suffix Arrays
T = axfcaxgx#, Pos[2] = 1 (axfcaxgx#), Pos[3] = 5 (axgx#)
Lcp(2,3) = 2
38. Longest common prefixes Solution method
We want to get Lcp in O(m) time
However, there are potentially O(m2) different possible pairs of Lcp values
Crucial point
Since this is binary search, there are only O(m) values that are ever needed, and these have a lot of structure
39. Longest common prefixes Lcp(i,i+1): string depth of lowest common ancestor encountered during lexical depth-first traversal of suffix tree from Pos(i) leaf to Pos(i+1) leaf
Other Lcp values
Lcp(i,j): mink in 1 to j-1 Lcp(k,k+1)
Take min of Lcp values of children in the binary tree of needed Lcp values (not the suffix tree)