Genome-scale Disk-based Suffix Tree Indexing

Genome-scale Disk-based Suffix Tree Indexing Phoophakdee and Zaki

Outline • Suffix Tree introduction • Application in Bioinformatics • Trellis • Trellis performance • Conclusion

Example Suffix Tree • Sequence • ACGACG$ • What are Suffix Links

Suffix tree runtime • Time complexity • Construction of suffix tree: • O(n) time and space where n is the size of the text being searched • Substring Search: • O(m) time where m is size of substring/search pattern • Knuth-Morris-Pratt and Boyer-Moore algorithm comparison

Application in Bioinformatics • Database search • Exact matching • Approximate matching* • Longest common substring • Genome alignment* • Structural motifs* • Tandem repeats* • Sequence comparison

Problems with Genome-scale suffix trees • Efficient O(n) suffix tree generating algorithms • Tree must fit entirely in main memory • e.g. Ukkonen’s algorithm • Genomes are very large • Human genome is 3 Gbp (0.75 GB) • Data structure no longer able to fit in memory

What Trellis solves • Prevents data skew in prefix partitioning • Bad data skew with prefix partitioning leads to prefix partitions that may not fit into memory. • From non-uniform distribution of alphabit/DNA • Efficient disk-base implementation • Function under low memory constraints • Efficient disk IO usage • Able to recover suffix links

Trellis Steps • Prefix Creation Phase • Partitioning Phase • Merging Phase • Suffix Link Recovery Phase (Optional)

Trellis Overview

Merging Phase

Threshold (t) • Determines partition of sequence • Suffix subtree fits into memory during partitioning phase. • Determines cutoff for prefix set inclusion • Recombined prefixed suffix subtree will fit entirely into memory during merging phase. • Allows input string and two sets of internal nodes to fit entirely into memory during suffix link recovery phase

Trellis Overview

Performance • O(n2) time and O(n) space (where n is sequence length) • Comparison to TDD • Currently only other algorithm that scales up to genome level • Same time complexity • Does not calculate suffix links

Suffix Tree Construction

Query Times

Conclusion • Efficient disk-based suffix tree generation that works well with limited memory • Suffix links are recoverable • Future work • Extend to larger alphabets • Buffer input sequence • Parallelize partitioning and merging

Genome-scale Disk-based Suffix Tree Indexing

Genome-scale Disk-based Suffix Tree Indexing

Presentation Transcript

Large-scale genome projects

Pattern Matching: Suffix Tree Applications

Genome-Scale Mutagenesis

Genome-scale disk-based suffix tree indexing

Knowledge-based Analysis of Genome-scale Data

Genome-scale phylogenomics

Tree Indexing (1)

Tree-based Indexing

iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Suffix Tree Based Prediction for Pervasive Computing Environments

Tree-based Indexing

Suffix Tree

Suffix tree and suffix array techniques for pattern analysis in strings

Genome-scale constraint-based metabolic model of Clostridium thermocellum

Faster Suffix Tree Construction With Missing Suffix Links

Genome Scale Family Based Association Testing using Condor

Indexing Genome Sequences

Trie/Suffix Trie/Suffix Tree

B-Tree Indexing

Disk Based Storage

Suffix Tree and Suffix Array

Genome Scale Family Based Association Testing using Condor