Pattern Matching: Suffix Tree Applications

1. Pattern Matching:Suffix Tree Applications

2. Applications Exact string and substring matching Longest common substrings Finding and representing repeated substrings efficiently Applications that lead to alternative: space vs. efficient implementations Matching statistics Suffix Arrays

3. Exact Set matching Input A set of patterns P = {P1,P2�,Pk} and |P| = Text T of length m Output Positions of all occurrences of each pattern Pi in T Solution method Preprocess to create suffix tree for T O(m) time, O(m) space Maximally match each Pi in suffix tree O(|P1| ) +O(|P2|) � +O(|Pk| ) = O(n) Output all leaf positions below match point O(k) time where k is number of total matches

4. Exact set matching using Aho-Corasick Aho-Corasick algorithm is a classical solution to exact set matching build keyword tree of set of patterns P A keyword tree for a pattern set P is a rooted tree T such that: Each edge e is labeled by a character Any two edge from a node have different labels Define L(v) of a node v are the concatenation of edge labels on the path from the root to v For each Pi P there is a node v s.t L(v) = Pi and for each leaf v there is a Pi = L(v)

5. Example of Aho-Corasick Example P = {abce,abe,dce,ac}

6. Example of Aho-Corasick Example P = {abce,ababc,abac}

7. Aho-Corasick vs. Suffix tree Aho-Corasick Approach O(n) preprocess time and space to build keyword tree of set of patterns P O(m+k) search time Linear time by using the resume link Suffix Tree Approach O(m) preprocess time and space to build suffix tree of T O(n+k) search time Using matching statistics to be defined, can make this tradeoff similar to that of Aho-Corasick

8. Substring problem Input Pattern P of length n A set of Text Ti of total length m Output Position of all occurrences of P in each Text Ti Solution method Preprocess to create generalized suffix tree for {Ti} O(m) time, O(m) space Maximally match P in generalized suffix tree Output all leaf positions below match point O(n+k) time where k is number of total matches

9. Generalized suffix tree T1 = ababc# T2 = abd$

10. Longest Common Substring problem Input Strings S and T Output The longest common substring of S and T (and its position in S and T) Solution method Preprocess to create generalized suffix tree for {S,T} Mark each node by whether or not its subtree contains a leaf node of S, T, or both Simple postfix tree traversal algorithm to do this Path label of node with greatest string depth is the longest common substring of S and T

11. Common substrings of length k problem Input Strings S and T Integer k Output all substrings of S and T (and their positions in S and T) of length at least k Solution method Same as previous problem Look for all nodes with 2 leaf labels of string depth at least k

12. Longest Common Substrings of more than two Strings Definition: For a given set of K strings, l(j) for 2 <= j <= K is the length of the longest common substring belong to at least j of the K strings Example: {abcedfg, cbcedfa, dbcedg, cbceg, acea}

13. Longest Common Substrings of more than two Strings Input Strings S1, �, SK, total length = Output l(j) (and positions in Si) for 2 <= j <= K Solution method Build a generalized suffix tree for the K strings each string has a unique end character, so each leaf shows up only once

14. Longest Common Substrings of more than two Strings Build a generalized suffix tree for the K strings each string has a unique end character, so each leaf shows up only once Define c(v): number of distinct leaf labels in subtree rooted at node v and d(v): string-depth from root to node v Given c(v) and d(v), do a simple traversal of tree to find l(j) j = 2~K and pointers to locations in substrings Computing c(v) efficiently # of leaves is not correct as some leaves may have same label length K bit vector, 1 bit per string in set OR your way up the tree Each OR op takes O(K) time which give O(Kn) running time Can be improved to be O(n) later

15. Repeated Substrings Definition: maximal pair in S is a pair of identical substrings a and b in S such that the character to the immediate left (right) of a is different than the character to the immediate left (right) of b. Add unique characters to front and end of S to include prefixes and suffixes. Representation: (p1, p2, n�) starting positions and length of the maximal pair R(S) is the set of all triples representing maximal pairs in S

16. Example of Repeated substrings S = (2, 7, 3) is a maximal pair (7, 14, 3) is a maximal pair (2, 14, 3) is not a maximal pair (2, 14, 4) is a maximal pair

17. Repeated Substrings A maximal repeat a is a substring in S that is the substring defined by a maximal pair of S R�(S) is the set of maximal repeats and |R�(S)| = |R(S)| Previous example xyz and xyzv are maximal repeats of Showever, xyz is represented only once in R�(S), but there are (2, 7, 3) and (7, 14, 3) in R(S) |R�(S)| is smaller than |R(S)| as xyz shows up twice in R(S) but only once in R�(S)

18. Maximal Repeated Substrings Maximal repeats Input String S (length n) Output R�(S) Lemma If a is a maximal repeat in S, then a is the path-label of an internal node v in T a does not end in the middle of an edge

19. Maximal Repeated substrings Definition left character of i is S[i-1] The left character of a leaf of a suffix tree T is the left character of the suffix position represented by that leaf A node v of T is called left diverse if at least 2 leaves in v�s subtree have different left characters Theorem String a labeling the path to an internal node v of T is a maximal repeat if and only if v is left diverse Capture that character before a is different

20. Example of left diverse S = ababc

21. Maximal Repeated substrings Solution method Construct suffix tree for S There are at most n maximal repeats So that, there are n leaves Because all internal nodes except the root have at least two children. Therefore, at most n internal nodes

22. Maximal Repeated substrings Find all left diverse nodes in linear time All nodes will have a left character label Leaf node: Label leaves with their left character Internal node v: If any child is left diverse, so is v If two children have different left character labels, v is left diverse Otherwise, take on left character value of children Compact representation Node v in T is a frontier node if: v is a diverse none of v�s children are left diverse

23. Maximal Repeated substrings Time complexity Construct suffix tree for S ? O(n) Find all left diverse nodes in linear time ? O(n) Compact representation ? O(k), where k is the number of maximal pairs

24. Supermaximal repeated substrings A supermaximal repeat a is a maximal repeat of S that never occurs as a substring of another maximal repeat of S Previous example xyzv is a supermaximal repeat of S xyz is NOT a supermaximal repeat of S

25. Supermaximal repeated substrings Supermaximal repeats Input String S (length n) Output The set of supermaximal repeats of S Theorem: A left diverse node v represents a supermaximal repeat if and only if all of v�s children are leaves and each has a distinct left character

26. Matching Statistics Input: Pattern P of length n Text T of length m Output Compute ms(i) for 1 <=i <= m Definition of ms(i) For 1 <= i <=m, matching statistic ms(i) is the length of the longest substring of T starting at position i that matches a substring somewhere in P.

27. Matching Statistics With matching statistics, one can solve several problems with less space than a suffix tree Exact matching example We�ll show an O(n) preprocessing time and O(m) search time solution matching the traditional methods P matches substring starting at i in T if and only if ms(i) = |P|

28. Example of Matching Statistics

29. Matching Statistics Solution method Compute suffix tree of P retaining suffix links Adding location of substring in P p(i): a location in P such that the substring at p(i) matches substring starting at T(i) for exactly ms(i) positions Before computing ms(i) values, mark each node in T with the leaf number of one of its leaves Simply output this value when outputting ms(i) values

30. Matching Statistics Count ms(1): match T against tree Get ms(i+1) from ms(i) Assume we are at some node v in the tree If it is internal, follow suffix link to s(v) Else if it is a leaf, go up one level to its parent w If w is an internal node, follow suffix link to s(w) Traverse downwards using skip/count trick until we have matched all the characters in edge label (w,v) Now match against T character by character till we have a mismatch and can output ms(i+1)

31. Applying matching statistics to LCS problem Input strings S and T Output longest common substring of S and T Solution method Compute suffix tree for shorter string, say S Compute ms(i) values for T Maximal ms(i) value identifies LCS

32. Suffix Arrays Input Text T of length m Output Pos array Definition of Pos array A suffix array for T, called Pos, is an array of integers in the range 1 to m specifying the lexicographic order of the m suffixes of string T Pos[k] = i iff Ti is the kth smallest suffix in the m suffixes Add terminating character $ which is lexically smallest

33. Example of Suffix Arrays T = axfcaxgx# Suffixes = 1. axfcaxgx# 2. xfcaxgx# 3. fcaxgx# 4. caxgx# 5. axgx# 6. xgx# 7. gx# 8. x# 9. #

34. Suffix Arrays Solution method Compute suffix tree of T Do a lexical depth-first traversal of T labeling Pos(k) with leafs in order of encountering them Edge (v,u) is lexically smaller than edge (v,w) iff first character of (v,u) is lexically smaller than first character of (v,w)

35. Applying Suffix Arrays to exact pattern matching Input Pattern P of length n Text T of length m Output All occurrences of P in T Solution method Compute suffix array Pos for T If P is in T, then all these locations will be grouped consecutively in Pos

36. Applying Suffix Arrays to exact pattern matching Using binary search, find smallest index i� such that P exactly matches the n characters of suffix Pos(i�) Similarly, find largest index i such that P exactly matches the n characters of suffix Pos(i) Time complexity O(n log m)

37. Longest common prefixes Input Text T of length m Output Max(Lcp(i,j)) ,for 1= i,j = m and i ? j Definition of Lcp(i,j): Lcp(i,j) is the length of the longest common prefix of the suffixes of T beginning at Pos[i] and Pos[j]. Example from Suffix Arrays T = axfcaxgx#, Pos[2] = 1 (axfcaxgx#), Pos[3] = 5 (axgx#) Lcp(2,3) = 2

38. Longest common prefixes Solution method We want to get Lcp in O(m) time However, there are potentially O(m2) different possible pairs of Lcp values Crucial point Since this is binary search, there are only O(m) values that are ever needed, and these have a lot of structure

39. Longest common prefixes Lcp(i,i+1): string depth of lowest common ancestor encountered during lexical depth-first traversal of suffix tree from Pos(i) leaf to Pos(i+1) leaf Other Lcp values Lcp(i,j): mink in 1 to j-1 Lcp(k,k+1) Take min of Lcp values of children in the binary tree of needed Lcp values (not the suffix tree)

Pattern Matching: Suffix Tree Applications

Pattern Matching: Suffix Tree Applications

Presentation Transcript

SUFFIX TREES

Applications of Suffix Trees

Suffix tree and suffix array techniques for pattern analysis in strings

Tries

A Unifying Framework for Compressed Pattern Matching

Suffix Trees Come of Age in Bioinformatics

The Simplest NL Applications: Text Searching and Pattern Matching

In The Name Of God

Combinational Pattern Matching

Suffix Trees and their applications

Pattern Matching on Compressed Texts II

McCrieght’s algorithm for linear-time suffix tree construction

New Models for Graph Pattern Matching

Pattern Matching

Combinatorial Pattern Matching

Algorithms in Bioinformatics: A Practical Introduction

Dynamic Text and Static Pattern Matching

Lecture 18: Approximate Pattern Matching

COMP170 Tutorial 13: Pattern Matching

Faster Suffix Tree Construction With Missing Suffix Links

Querying Big Social Graphs