Suffix Trees Come of Age in Bioinformatics

Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis

1 2 3 4 5 6 AABAAB S = Suffix Tree B A A A B $ B A 3 $ $ B A 5 6 A $ 4 B A $ A 2 B Every suffix of S is encoded by a root to leaf walk. Every substring encoded by a walk from the root. $ 1

1 2 3 4 5 6 AABAAB SUFFIX ARRAY B A A A B $ B A 3 $ $ B A 5 6 A $ 4 B A 4 1 5 2 3 6 $ A 2 The suffixes listed in lexicographic order. Found by lexicographic DFS B $ 1

Basic Facts about Suffix Trees • Suffix tree for a string of length n can be built in O(n) time. • Basic implementations need about 25 bytes per character. • Numerous applications in string algorithms achieving significant (sometimes amazing) speedups over naïve algorithms – O(m) time substring search for a string of length m in a string of length n, regardless of n.

Suffix Tree/Array Construction • Weiner 1973 • McCreight ~ 1974 • Ukkonen ~ 1996 • Farach-Colton ~ 1999 • Manber-Myers ~ 1991 (Suffix-Arrays) O(n log n time) with small space. O(n) time possible as on prior slide.

Bioinformatics Applications pre 1997 • A small number of applications • Limited by space requirements of the tree, small computer memories, complication of the algorithms, poor locality of reference

Post 1997 • Much has changed, and Suffix trees and their relatives (particularly suffix arrays) are more widely applied in bioinformatics • 20- 40 new publications with substantial connection to bioinformatics, post 1997

New Results since 1997 • Fundamental Algorithms: Farach-Colton construction; simpler algorithm for least-common-ancestor. • Implementation improvements (space) for suffix trees and arrays • New variants of suffix trees – affix trees, virtual suffix trees • New algorithms for tandem repeats

New Results since 1997 • Approximate repeats and large-scale repeat structure • Fast lookups in databases • Substring frequencies • Motif and pattern discovery • Hybrid dynamic programming • Oligo and probe construction

New Results since 1997 • Genome fragment assembly and resequencing • (multiple) whole genome comparison • Large-scale sequence comparison in resequencing projects • Misc. clever applications • Related less-used data structures

Suffix trees collect together repeated substring prefixes Example: Speeding up the use of Position Specific Scoring Matrices. Accelerating protein classification ISMB 2000 position 1 2 3 4 5 A 3 1 4 2 1 T 2 4 4 3 2 C 1 3 5 6 2 G 4 2 4 6 2 …AACTG…AACTG….AACTG… PSSM AACTG Walk around the tree to depth 5 Starting locations of AACTG

Whole Genome Alignment with MUM’s: • Delcher …Salzberg NAR ~ 1999 • A MUM is a Maximal Unique Matching substring in two strings S and S’ • MUMMER finds all MUM’s; selects and aligns non-overlapping pairs of MUM’s; and then recurses in the regions between adjacent selected MUMs. • Key issue: finding MUMs efficiently

Spotting MUMs with a suffix tree 7 Example: S = …AATCCGTG…. S’ = ………GATCCGTA… 20 So ATCCGT is a MUM if it only occurs in these two places root Path spelling ATCCGT Exactly two sibling leaves S7 S’20

Finding MUM’s in a suffix tree • Build a suffix tree for the concatenation of S and S’, noting at each leaf whether the suffix is from the S part or the S’ part. Also note the prior character in S or S’ for that position. • Do a DFS traversal of the S.T. to note at each node how many leaves below it are from S and how many from S’. • A node corresponds to a MUM if and only if the count there is 1 and 1, and the two prior characters are different. • For total string length n, the time used is O(n).

Reducing the space needed for MUM findingMUMMER II - NAR spring 2002 root Path to a node v, spelling xB, where B is a string Path to a node v’, spelling B v’ v Suffix link

Space reduction when finding MUMs • Given strings S and S’, build a suffix tree T, with suffix links , but for only the smaller of S and S’, say S. • Using T, find for each position i in S’, the longest substring starting at i which occurs in S exactly once. Mark the end location in T of that match. Each such match is a candidate MUM.

Finding MUMs in less space • To find the matches, walk S’ through T, using suffix links to reduce search time. • After S’ has been completely processed, every marked position in T that is visited only once and has no marked position below it, corresponds to a MUM. • So all MUMs can be found in O(n) time, but in much reduced space.

Finding Tandem Repeats Stoye and Gusfield, Theoretical Computer Science, 2001 …TATAACTAACTAAGATT.. For a string of length n, a naïve algorithm might use (on the order of) n^3 operations to find all T.R.s

Example: Finding Tandem Repeats in Strings TATAACTAACTAAGATT…. Every Tandem Repeat is part of a chain of tandem repeats ending with a Branching Tandem Repeat. To extend a chain, examine the first character of the T.R. and the character after the end of the T.R. If they are the same, a new T.R. exists on place to the right.

Example: Finding Tandem Repeats in Strings TATAACTAACTAAGATT…. Branching Tandem Repeats Every Tandem Repeat is part of a chain of tandem repeats ending with a Branching Tandem Repeat.

Example: Finding Tandem Repeats in Strings TATAACTAACTAAGATT…. Branching Tandem Repeat Every Tandem Repeat is part of a chain of tandem repeats ending with a Branching Tandem Repeat. So to find all T.R.s – find all Branching T.R’s Use a suffix tree to efficiently find Branching T.R.’s

Finding Branching T.R.’s • Every Branching T.R. ends at a node of the suffix tree – so examine each node. root String to here is length 8 leaves 5 , 13 and 21 are below v v 5 13 = 5 + 8 21

Basic Algorithm • Build the S.T. for string S; process it to find depth of each node, and do a DFS numbering of the nodes so that ancestry of two nodes can be determined in constant time. (Classic property of DFS) • Check each internal node to see if it is at the first or second part of a tandem repeat.

How to check • From a node v of depth d(v), walk the subtree to collect the leaf numbers below v. • If leaf k is collected, check if v is an ancestor of leaf k+d(v), and the next characters k+1 and k+d(v) + 1 differ. If so, then the string to v is the first part of a Branching T.R. • Alternatively, focus on k-d(v) to see if the string to v is the second part of a Branching T.R.

1 2 3 4 5 6 AABAAB S = Suffix Tree B A A A B $ B A 3 $ $ B A 5 6 A $ 4 B A $ A 2 B $ 1

Finding Tandem Repeats • This approach gives O(n^2) time to find all branching tandem repeats.The time can be reduced to O(n log n) with a classic trick. • Related more complex result: In O(n) time, one can mark in the suffix tree, the endpoints of all tandem repeat “species”. Gusfield and Stoye 2000 UCD techreport Uses the fact (Frankel et al 1999) that there are only 2n tandem repeat species – complex proof. Example: aaaaaaaaaa has only five tandem repeat species, but 25 tandem repeat occurrences

Space issues again • Use “virtual suffix tree” or “extended suffix array” to reduce space • Def: for positions i and j in string S, LCP(i,j) is the length of the longest matching substring starting at positions i and j in S.

1 2 3 4 5 6 AABAAB SUFFIX ARRAY With LCP or string depth information B A A A B $ B A 3 $ $ B A 5 6 A $ 4 B A $ 4 1 5 2 3 6 3 1 2 0 1 A 2 Node depths at time of descent after a backup B $ LCP of neighbors 1

Extended Suffix Array • String S, the suffix array A for S, and the LCP array for A, completely determine the suffix tree T for S. This is the “extended suffix array” for S. • The suffix array and the LCP array can be found in O(n log n) time and small space, without an explicit suffix tree. • For many efficient algorithms using an explicit suffix tree, it is possible to simulate the algorithm using only the extended suffix array, greatly reducing space usage without increasing time. • S. Kurtz WABI 2002

String Barcoding – Oligo Construction Uncovering Optimal Virus Signatures Sam Rash and Dan Gusfield RECOMB April 2002

Motivation • Need for rapid virus detection • Given • unknown virus • database known viruses • Problem • identify unknown virus quickly • Ideal solution • have sequence of • viruses in database • unknown virus • Solution • use BLAST (or any sequence similarity program/algorithm)

Motivation • Real World • only have sequence for pathogens in database • not possible to quickly sequence an unknown virus • can test for presence small (<= 50 bp) strings in unknown virus • substring tests • Another Idea • String Barcoding • use substring tests to uniquely identify each virus in the database • acquire unique barcode for each virus in database

Problem Definition • Formal Definition • given • set of strings S • goal • find set of strings S’, the testing set • wlog, for each s1,s2in S, there exists at least one u in S’ where u is a substring of only s1 • u is a signature substring • minimize |S’| • result • barcode for each element on S

v1 - {1,2,3} v2 - {1,2,3} v3 - {3} v4 - {1} v5 - {3} v6 - {1,2} v7 - {2} v8 - {1} v9 - {1,2,3} v10 - {1,2,3} v11 - {1,2} v12 - {1} v13 - {2} v14 - {3} v15 - {1,2,3} v16 - {2} v17 - {2} v18 - {1,3} v19 - {1} v20 - {3} v21 - {1,2,3} v22 - {3} v23 - {2} v24 - {1,2} v25 - {1} Idea • strings: 1. cagtgc 2. cagttc 3. catgga • Each node in the suffix tree has a corresponding set of string IDs below it Figure 1.1 - suffix tree for set of strings cagtgc, cagttc, and catgga

Practical Implementation: • If two substrings occur in exactly the same set of original strings, only one need be considered • Use strings from suffix tree for each uniquely labeled node • Build ILP as discussed • Solve ILP using CPLEX • Acquire barcode and signatures for each original string • signature is the set of substring tests occurring in a string

Implementation: Example minimize V18 + V22 + V11 + V17 + V8 #objective function st V18 + V22 + V11 + V17 + V8 >= 2 #this is the theoretical minimum V18 + V17 + V8 >= 1 #constraint to cover pair 1,2 V22 + V11 + V8 >= 1 #constraint to cover pair 1,3 V18 + V22 + V11 + V17 >= 1 #constraint to cover pair 2,3 binaries #all variables are 0/1 V18 V22 V11 V17 V8 end Figure 1.4 - barcodes Figure 1.5 - signatures

Implementation: Extensions • minimum and maximum lengths on signature substrings • acquire barcodes/signatures for only a subset of input strings (wrt to whole set) • minimum string edit distance between chosen signature substrings • redundancy • require r signature substrings to differentiate each pair • adds a higher level of confidence that signatures remain valid even with mutations

Results • Works quickly on most moderately sized datasets (especially when redundancy >= 2) • dataset properties • ~50k virus genomes taken from NCBI (Genbank) • 50-150 virus genomes • average length of each sequence ~1000 characters • total input size ranged from approximately 50,000 – 150,000 characters • increasing dataset size scaled approximately linearly • reach 25% gap (at most 1/3 more than optimum) in just a few minutes • reach small gap (often < 1%) in 4 hours

Summary • Numerous applications of Suffix Trees and relatives in Bioinformatics • More yet to be found • Suffix tree and array software at my UCD website.

Suffix Trees Come of Age in Bioinformatics

Suffix Trees Come of Age in Bioinformatics

Presentation Transcript

Suffix trees and suffix arrays

Selected Applications of Suffix Trees

Suffix Trees

Applications of Suffix Trees

Applications of Suffix Trees

Suffix Trees

Suffix Trees and Suffix Arrays

Suffix Trees, Suffix Arrays and Suffix Trays

Words with the Suffix of -AGE

Suffix trees

Suffix Trees and Suffix Arrays

Suffix Trees

Suffix Trees

Suffix Trees

Compressed Suffix Arrays and Suffix Trees

SUFFIX TREES

Suffix Trees

Suffix Trees and Suffix Arrays

Probabilistic Suffix Trees

Suffix Trees

Suffix Trees

Suffix Trees Come of Age in Bioinformatics