1 / 32

String Data Structures and Algorithms

This book explores the use of string data structures and algorithms in solving biological problems, including suffix trees, common substrings, lowest common ancestors (LCA), longest common extensions (LCE), and finding palindromes.

eblum
Download Presentation

String Data Structures and Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. String Data Structures and Algorithms David Fernández-Baca UNAM (Mexico) (based on notes by Srinivas Aluru) slightly modified by Benny Chor

  2. Why Strings? • Biological sequences can be viewed as strings, or finite series of characters, over an alphabet Σ. • There is a wealth of algorithmic theory developed for general strings that we can apply to specific biological problems. BBSI Summer School - Iowa State University

  3. Suffix Trees S = M A L A Y A L A M $ 1 2 3 4 5 6 7 8 9 10 A $ M LA YALAM$ AL 5 10 $M YALAM$ YALAM$ $M $ ALAYALAM$ 3 8 4 7 $M YALAM$ 1 9 6 2 BBSI Summer School - Iowa State University

  4. Suffix tree properties • For a string S of length n, there are n leaves and at most n internal nodes. • therefore requires only linear space • Each leaf represents a unique suffix. • Concatenation of edge labels from root to a leaf spells out the suffix. • Each internal node represents a distinct common prefix to at least two suffixes. BBSI Summer School - Iowa State University

  5. Edge Encoding S = M A L A Y A L A M $ 1 2 3 4 5 6 7 8 9 10 (2, 2) (10, 10) (5, 10) (3, 4) (1, 1) 10 5 (5, 10) (3, 4) (10, 10) (5, 10) (9, 10) (2, 10) (9, 10) 7 3 1 8 4 9 (9, 10) (5, 10) 6 2 BBSI Summer School - Iowa State University

  6. Näive Suffix Tree Construction Before starting: Why exactly do we need this $, which is not part of the alphabet? BBSI Summer School - Iowa State University

  7. Näive Suffix Tree Construction 3 4 2 A $MALAYALAM LAYALAM$ LAYALAM$ YALAM$ 2 1 3 4 BBSI Summer School - Iowa State University

  8. Finding a (short) Patternin a (long) String • Build a suffix tree of the string. • Starting from the root, traverse a path matching characters of the pattern. • If stuck, pattern not present in string. Otherwise, each leaf below gives a position of the pattern in the string. BBSI Summer School - Iowa State University

  9. Finding a Pattern in a String Find “ALA” A $ M LA YALAM$ AL 5 10 M$ YALAM$ YALAM$ M$ $ ALAYALAM$ 3 8 4 7 M$ YALAM$ 1 9 Two matches - at 6 and 2 6 2 BBSI Summer School - Iowa State University

  10. Finding Common Substrings • Construct a generalized suffix tree for two strings (each suffix of each string is represented). • Label each leaf with the suffix number and string label. • Each internal node with a leaf from both strings in its subtree gives a common substring. BBSI Summer School - Iowa State University

  11. Generalized Suffix Tree WINDOW$ INDIGO$ 1234567 1234567 $ D ND I $OG O W (1, 7) (2, 7) (2, 5) ND OW$ $ $OGI OW$ $OGI $OG $W INDOW$ $ (2, 4) (2, 2) (1, 3) (1, 5) (2, 6) (2, 3) (1, 4) $OGI OW$ (1, 6) (1, 1) (2, 1) (1, 2) BBSI Summer School - Iowa State University

  12. Lowest Common Ancestors • The lowest common ancestor (lca) of two nodes x and y in a rooted tree is the deepest node (farthest away from root) that is an ancestor of both x and y • Concatenation of edge labels from root to the lca of two leaves spells out the longest common prefix (lcp) of two strings • lca(x,y) an be found in constant time after linear preprocessing [Bender00] BBSI Summer School - Iowa State University

  13. A Useful Property String depth (lca (i , j)) = lcp (suffixi, suffixj) A A $ String depth = 3 M LA YALAM$ AL AL 5 lca 10 M$ YALAM$ YALAM$ M$ $ ALAYALAM$ 3 8 4 7 M$ YALAM$ 1 9 6 2 BBSI Summer School - Iowa State University

  14. Longest Common Extension RAILWAY$ 12345678 RAI GRAINY$ 1234567 RAI lce(1,1) = 0 lce(2,1) = 3 We’ll soon find lce’s useful in reconstructing phylogenetic trees based on whole genome/proteome sequences BBSI Summer School - Iowa State University

  15. lce’s and lca’s To compute lce’sfor two strings S1 and S2 • Build generalized suffix tree, T,of S1 and S2 • Compute string depth for each node in T • Preprocess T for lca queries • lce(i,j) = string depth of lca of suffix i ofS1 and suffix j ofS2 BBSI Summer School - Iowa State University

  16. Example WINDOW$ INDIGO$ 1234567 1234567 $ D ND I $OG O W (1, 7) (2, 7) (2, 5) ND OW$ $ $OGI OW$ $OGI $OG $W INDOW$ $ (2, 4) (2, 2) (1, 3) (1, 5) (2, 6) (2, 3) (1, 4) $OGI OW$ (1, 6) (1, 1) (2, 1) (1, 2) BBSI Summer School - Iowa State University

  17. lce’s, revisited Given two strings S1 and S2 , we are now interested in finding, for each i, the index j such that lce (i, j) is maximal. • What is the meaning of this task? • How do we accomplish it efficiently? • Notice that computing the values lce (i, j) for all j is inefficient! BBSI Summer School - Iowa State University

  18. Palindromes • A palindrome is a string that reads the same in both directions • E.g., CATGTAC • red rum, sir, is murder • Palindrome problem: Find all maximal palindromes in a string S BBSI Summer School - Iowa State University

  19. Finding Palindromes in S • Construct the reverse S’ of S • Build generalized suffix tree of S and S’ • Preprocess T for lce queries • Now what? Left as homework Requirement: Linear time (const. per query) S q + 1 BBSI Summer School - Iowa State University

  20. Palindromes in DNA sequences • We sometimes need to deal with complemented palindromes A  T C  G • E.g., ATCATGAT is a complemented palindrome • All complemented palindromes in S can be found using a GST of S and the complement of S’ BBSI Summer School - Iowa State University

  21. Suffix Array – Reducing Space M A L A Y A L A M $ 1 2 3 4 5 6 7 8 9 10 Suffix Array Longest common prefix Array Suffix 6 and 2 share “ALA” Suffix 2 and 8 share just “A”. lcp is always withadjacent. BBSI Summer School - Iowa State University

  22. Pattern Search in Suffix Array • All suffixes that share a common prefix appear in consecutive positions in the array. • Pattern P can be located in the string using a binary search on the suffix array. Naïve Run-time = O(|P|  log n). Improved to O(|P| + log n) [Manber&Myers93], and to O(|P|) [Abouelhoda et al. 02]. BBSI Summer School - Iowa State University

  23. Computing longest common prefix Values • Find where S1 is in the suffix array. • Compute lcp value of S1. • Find S2 in the suffix array. • Compute lcp value of S2. • Repeat for all suffixes. Run-time is linear (why?) BBSI Summer School - Iowa State University

  24. M A L A Y A L A M $ 1 2 3 4 5 6 7 8 9 10 6 2 8 4 7 3 1 9 5 10 Example Text Position Suffix Array lcp Array 3 1 1 0 2 0 1 0 0 BBSI Summer School - Iowa State University

  25. Suffix Trees vs. Suffix Arrays Suffix Array = Lexicographic order of the leaves of the Suffix Tree Suffix Tree = Suffix Array + lcp Array (why? Wait for next slide) BBSI Summer School - Iowa State University

  26. Building a ST from a SA and lcp D = 0 A LA D = 1 D = 2 AL $M YALAM$ YALAM$ $M D = 3 3 8 4 7 $M YALAM$ 6 2 SA lcp BBSI Summer School - Iowa State University

  27. Some Results • Suffix tree can be constructed in O(n)time and O(n |∑|)space [Weiner73, McCreight76, Ukkonen92]. • Suffix arrays can be constructed without using suffix trees in O(n)time [Pang&Aluru03]. BBSI Summer School - Iowa State University

  28. More Applications • Suffix-prefix overlaps in fragment assembly • Maximal and tandem repeats • Shortest unique substrings • Maximal unique matches [MUMmer] • Approximate matching BBSI Summer School - Iowa State University

  29. Dealing with errors • The basic string data structures can only extract information in the absence of errors. • To deal with errors, decompose into parts that do not involve errors. BBSI Summer School - Iowa State University

  30. The k-mismatch problem • Given a pattern P, a text T, and a number k, find all occurrences of P in T with at most k mismatches Example P = bend, T = abentbananaend, k = 2 Match 1: bent Match 2: bana Match 3: aend BBSI Summer School - Iowa State University

  31. Solution • Build GST of P and T and preprocess it for lce queries • For each starting index i in T, do at most klce queries to determine if there is a k-mismatch beginning at i T P Time = O(k |T |) BBSI Summer School - Iowa State University

  32. References • M. I. Abouelhoda, S. Kurtz and E. Ohlebusch, The enhanced suffix array and its applications to genome analysis, 2nd Workshop on Algorithms in Bioinformatics, pp. 449-463, 2002. • M. A. Bender and M. Farach-Colton, The LCA Problem Revisited, LATIN, pages 88-94, 2000. • P. Ko and S. Aluru, Linear time suffix sorting, CPM, pages 200-210, 2003. • U. Manber and G. Myers. Suffix arrays: a new method for on-line search, SIAM J. Comput., 22:935-948, 1993. • E. M. McCreight, A space-economical suffix tree construction algorithm, J. ACM, 23(2):262--272, 1976. • E. Ukkonen, Constructing suffix trees on-line in linear time. Intern. Federation ofInformation Processing, pp. 484-492,1992. Also in Algorithmica, 14(3):249--260, 1995. • P. Weiner, Linear pattern matching algorithms, Proc. of the 14th IEEE Annual Symp. on Switching and Automata Theory, pp. 1-11, 1973. BBSI Summer School - Iowa State University

More Related