1 / 39

Pattern Matching: Suffix Tree Applications

Applications. Exact string and substring matchingLongest common substringsFinding and representing repeated substrings efficientlyApplications that lead to alternative: space vs. efficient implementations Matching statisticsSuffix Arrays. Exact Set matching. InputA set of patterns P = {P1,P2

elinor
Download Presentation

Pattern Matching: Suffix Tree Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Pattern Matching: Suffix Tree Applications

    2. Applications Exact string and substring matching Longest common substrings Finding and representing repeated substrings efficiently Applications that lead to alternative: space vs. efficient implementations Matching statistics Suffix Arrays

    3. Exact Set matching Input A set of patterns P = {P1,P2,Pk} and |P| = Text T of length m Output Positions of all occurrences of each pattern Pi in T Solution method Preprocess to create suffix tree for T O(m) time, O(m) space Maximally match each Pi in suffix tree O(|P1| ) +O(|P2|) +O(|Pk| ) = O(n) Output all leaf positions below match point O(k) time where k is number of total matches

    4. Exact set matching using Aho-Corasick Aho-Corasick algorithm is a classical solution to exact set matching build keyword tree of set of patterns P A keyword tree for a pattern set P is a rooted tree T such that: Each edge e is labeled by a character Any two edge from a node have different labels Define L(v) of a node v are the concatenation of edge labels on the path from the root to v For each Pi P there is a node v s.t L(v) = Pi and for each leaf v there is a Pi = L(v)

    5. Example of Aho-Corasick Example P = {abce,abe,dce,ac}

    6. Example of Aho-Corasick Example P = {abce,ababc,abac}

    7. Aho-Corasick vs. Suffix tree Aho-Corasick Approach O(n) preprocess time and space to build keyword tree of set of patterns P O(m+k) search time Linear time by using the resume link Suffix Tree Approach O(m) preprocess time and space to build suffix tree of T O(n+k) search time Using matching statistics to be defined, can make this tradeoff similar to that of Aho-Corasick

    8. Substring problem Input Pattern P of length n A set of Text Ti of total length m Output Position of all occurrences of P in each Text Ti Solution method Preprocess to create generalized suffix tree for {Ti} O(m) time, O(m) space Maximally match P in generalized suffix tree Output all leaf positions below match point O(n+k) time where k is number of total matches

    9. Generalized suffix tree T1 = ababc# T2 = abd$

    10. Longest Common Substring problem Input Strings S and T Output The longest common substring of S and T (and its position in S and T) Solution method Preprocess to create generalized suffix tree for {S,T} Mark each node by whether or not its subtree contains a leaf node of S, T, or both Simple postfix tree traversal algorithm to do this Path label of node with greatest string depth is the longest common substring of S and T

    11. Common substrings of length k problem Input Strings S and T Integer k Output all substrings of S and T (and their positions in S and T) of length at least k Solution method Same as previous problem Look for all nodes with 2 leaf labels of string depth at least k

    12. Longest Common Substrings of more than two Strings Definition: For a given set of K strings, l(j) for 2 <= j <= K is the length of the longest common substring belong to at least j of the K strings Example: {abcedfg, cbcedfa, dbcedg, cbceg, acea}

    13. Longest Common Substrings of more than two Strings Input Strings S1, , SK, total length = Output l(j) (and positions in Si) for 2 <= j <= K Solution method Build a generalized suffix tree for the K strings each string has a unique end character, so each leaf shows up only once

    14. Longest Common Substrings of more than two Strings Build a generalized suffix tree for the K strings each string has a unique end character, so each leaf shows up only once Define c(v): number of distinct leaf labels in subtree rooted at node v and d(v): string-depth from root to node v Given c(v) and d(v), do a simple traversal of tree to find l(j) j = 2~K and pointers to locations in substrings Computing c(v) efficiently # of leaves is not correct as some leaves may have same label length K bit vector, 1 bit per string in set OR your way up the tree Each OR op takes O(K) time which give O(Kn) running time Can be improved to be O(n) later

    15. Repeated Substrings Definition: maximal pair in S is a pair of identical substrings a and b in S such that the character to the immediate left (right) of a is different than the character to the immediate left (right) of b. Add unique characters to front and end of S to include prefixes and suffixes. Representation: (p1, p2, n) starting positions and length of the maximal pair R(S) is the set of all triples representing maximal pairs in S

    16. Example of Repeated substrings S = (2, 7, 3) is a maximal pair (7, 14, 3) is a maximal pair (2, 14, 3) is not a maximal pair (2, 14, 4) is a maximal pair

    17. Repeated Substrings A maximal repeat a is a substring in S that is the substring defined by a maximal pair of S R(S) is the set of maximal repeats and |R(S)| = |R(S)| Previous example xyz and xyzv are maximal repeats of S however, xyz is represented only once in R(S), but there are (2, 7, 3) and (7, 14, 3) in R(S) |R(S)| is smaller than |R(S)| as xyz shows up twice in R(S) but only once in R(S)

    18. Maximal Repeated Substrings Maximal repeats Input String S (length n) Output R(S) Lemma If a is a maximal repeat in S, then a is the path-label of an internal node v in T a does not end in the middle of an edge

    19. Maximal Repeated substrings Definition left character of i is S[i-1] The left character of a leaf of a suffix tree T is the left character of the suffix position represented by that leaf A node v of T is called left diverse if at least 2 leaves in vs subtree have different left characters Theorem String a labeling the path to an internal node v of T is a maximal repeat if and only if v is left diverse Capture that character before a is different

    20. Example of left diverse S = ababc

    21. Maximal Repeated substrings Solution method Construct suffix tree for S There are at most n maximal repeats So that, there are n leaves Because all internal nodes except the root have at least two children. Therefore, at most n internal nodes

    22. Maximal Repeated substrings Find all left diverse nodes in linear time All nodes will have a left character label Leaf node: Label leaves with their left character Internal node v: If any child is left diverse, so is v If two children have different left character labels, v is left diverse Otherwise, take on left character value of children Compact representation Node v in T is a frontier node if: v is a diverse none of vs children are left diverse

    23. Maximal Repeated substrings Time complexity Construct suffix tree for S ? O(n) Find all left diverse nodes in linear time ? O(n) Compact representation ? O(k), where k is the number of maximal pairs

    24. Supermaximal repeated substrings A supermaximal repeat a is a maximal repeat of S that never occurs as a substring of another maximal repeat of S Previous example xyzv is a supermaximal repeat of S xyz is NOT a supermaximal repeat of S

    25. Supermaximal repeated substrings Supermaximal repeats Input String S (length n) Output The set of supermaximal repeats of S Theorem: A left diverse node v represents a supermaximal repeat if and only if all of vs children are leaves and each has a distinct left character

    26. Matching Statistics Input: Pattern P of length n Text T of length m Output Compute ms(i) for 1 <=i <= m Definition of ms(i) For 1 <= i <=m, matching statistic ms(i) is the length of the longest substring of T starting at position i that matches a substring somewhere in P.

    27. Matching Statistics With matching statistics, one can solve several problems with less space than a suffix tree Exact matching example Well show an O(n) preprocessing time and O(m) search time solution matching the traditional methods P matches substring starting at i in T if and only if ms(i) = |P|

    28. Example of Matching Statistics

    29. Matching Statistics Solution method Compute suffix tree of P retaining suffix links Adding location of substring in P p(i): a location in P such that the substring at p(i) matches substring starting at T(i) for exactly ms(i) positions Before computing ms(i) values, mark each node in T with the leaf number of one of its leaves Simply output this value when outputting ms(i) values

    30. Matching Statistics Count ms(1): match T against tree Get ms(i+1) from ms(i) Assume we are at some node v in the tree If it is internal, follow suffix link to s(v) Else if it is a leaf, go up one level to its parent w If w is an internal node, follow suffix link to s(w) Traverse downwards using skip/count trick until we have matched all the characters in edge label (w,v) Now match against T character by character till we have a mismatch and can output ms(i+1)

    31. Applying matching statistics to LCS problem Input strings S and T Output longest common substring of S and T Solution method Compute suffix tree for shorter string, say S Compute ms(i) values for T Maximal ms(i) value identifies LCS

    32. Suffix Arrays Input Text T of length m Output Pos array Definition of Pos array A suffix array for T, called Pos, is an array of integers in the range 1 to m specifying the lexicographic order of the m suffixes of string T Pos[k] = i iff Ti is the kth smallest suffix in the m suffixes Add terminating character $ which is lexically smallest

    33. Example of Suffix Arrays T = axfcaxgx# Suffixes = 1. axfcaxgx# 2. xfcaxgx# 3. fcaxgx# 4. caxgx# 5. axgx# 6. xgx# 7. gx# 8. x# 9. #

    34. Suffix Arrays Solution method Compute suffix tree of T Do a lexical depth-first traversal of T labeling Pos(k) with leafs in order of encountering them Edge (v,u) is lexically smaller than edge (v,w) iff first character of (v,u) is lexically smaller than first character of (v,w)

    35. Applying Suffix Arrays to exact pattern matching Input Pattern P of length n Text T of length m Output All occurrences of P in T Solution method Compute suffix array Pos for T If P is in T, then all these locations will be grouped consecutively in Pos

    36. Applying Suffix Arrays to exact pattern matching Using binary search, find smallest index i such that P exactly matches the n characters of suffix Pos(i) Similarly, find largest index i such that P exactly matches the n characters of suffix Pos(i) Time complexity O(n log m)

    37. Longest common prefixes Input Text T of length m Output Max(Lcp(i,j)) ,for 1= i,j = m and i ? j Definition of Lcp(i,j): Lcp(i,j) is the length of the longest common prefix of the suffixes of T beginning at Pos[i] and Pos[j]. Example from Suffix Arrays T = axfcaxgx#, Pos[2] = 1 (axfcaxgx#), Pos[3] = 5 (axgx#) Lcp(2,3) = 2

    38. Longest common prefixes Solution method We want to get Lcp in O(m) time However, there are potentially O(m2) different possible pairs of Lcp values Crucial point Since this is binary search, there are only O(m) values that are ever needed, and these have a lot of structure

    39. Longest common prefixes Lcp(i,i+1): string depth of lowest common ancestor encountered during lexical depth-first traversal of suffix tree from Pos(i) leaf to Pos(i+1) leaf Other Lcp values Lcp(i,j): mink in 1 to j-1 Lcp(k,k+1) Take min of Lcp values of children in the binary tree of needed Lcp values (not the suffix tree)

More Related