220 likes | 235 Views
This presentation discusses the importance of suffix trees and arrays in bioinformatics, the main memory bottleneck, and improvements in compressed suffix tree implementation. It also highlights the use of LCP arrays for pattern searching and the author's current work on sequence processing and pattern finding.
E N D
Pattern Processing and Searching In RAM Michael Robinson Ph.D. candidate Advisor: Dr. Giri Narasimhan School of Computing and Information Sciences BioRG Bioinformatics Research Group Florida International University 11200 SW 8th Street Miami, FL 33199 {mrobi002, giri}@cs.fiu.edu Presented by Michael Robinson January 15, 2008 At: Florida International University Global CyberBridges, National Science Foundation Program Award Id: OCI-0636031 October 1, 2006-December 31, 2009
Agenda - What are Suffix Trees – Suffix Arrays - Suffix Trees – Importance in Bioinformatics - Main Memory Bottleneck - Sadakane’s Compressed Suffix Tree Implementation - Compressed Suffix Tree Problem - Engineering a Compressed Suffix Tree Implementation Authors Improvements and Algorithms results - Example Required File: LCP Array Solution - Implementation Design – Software - My Current Work - Experimental Results - My Future Work - References
Suffix Tree S B BANANAS ANANAS NANAS ANAS NAS AS S Suffix Trees inventor: Peter Wiener, 1973. N 7 A A N S A N A N S A 6 N N S A 5 A A 4 S S S 3 1 2
Suffix Array Implementation Simplified version of Suffix Array. Lexicographic ordered text. Sequence = ABRACADABRA Suffix ArrayIndex Index Sorted ABRACADABRA 0 10 A BRACADABRA 1 7 ABRA RACADABRA 2 0 ABRACADABRA ACADABRA 3 3 ACADABRA CADABRA 4 5 ADABRA ADABRA 5 8 BRA DABRA 6 1 BRACADABRA ABRA 7 4 CADABRA BRA 8 6 DABRA RA 9 9 RA A 10 2 RACADABRA [1]Suffix Arrays inventors: Udi Manber, Gene Myers 1989
Suffix TreesImportance in Bioinformatics Biological Data Type (A C G T) vs. Search Engines Data (inverted) Applying Suffix Trees to Real Genomic Sequences is Impractical
Main Memory Bottleneck Suffix ArrayIndex Storage ABRACADABRA 11 BRACADABRA 10 RACADABRA 9 ACADABRA 8 CADABRA 7 ADABRA 6 DABRA 5 ABRA 4 BRA 3 RA 2 A 1 66 = n(n+1)/2 = 11(12)/2 = 66 PA01 6Mg ~ 18 TeraBytes Human Genome 3,164,700,000 nucleotides (3,164,700,000* 3,164,700,001)/2 = 5,007,663,046,582,350,000 5,007,663 terabytes Suffix Arrays inventors: Udi Manber, Gene Myers 1989
Sadakane’s Compressed Suffix Implementation A = 00, C = 01, G = 10 T = 11 Storage = n log n bits = 2n bits, ~20% of original space Suffix ArrayIndex Storage uncompressedcompressed ABRACADABRA 0 22 bits = 2n = n log n BRACADABRA 1 20 RACADABRA 2 18 ACADABRA 3 16 CADABRA 4 14 ADABRA 5 12 DABRA 6 10 ABRA 7 8 BRA 8 6 RA 9 4 A 10 2 528 bits = 66 166 bits = (n(n+1)/2)*2 bits Unfortunately it is not linear, 100 mg ~ 5 gig [2]Kunihiko Sadakane
Compressed Suffix Tree Problem Unfortunately the Suffix Tree is not linear 100 mg ~ 5 gig The Sequence is linear ACGT = 4 bases = 2n bits = 8 bits = n log2 ∑(ACGT) GTCAAGTC = 8 bases = 2n bits = 16 bits = n log2 ∑(ACGT) But the Suffix Array is not: ACGT = 4 bases = (4(5)/2)*2 = 20 bits = (n(n+1)/2)*2 GTCAAGTC = 8 bases = (8(9)/2)*2 = 72 bits = (n(n+1)/2)*2 In first data structure, 2nd sequence is twice as long as the first one, but in second data structure, 2nd sequence is more than three times the first one. It is 30% slower than non-compressed trees.
Engineering a Compressed Suffix Tree Implementation Authors Improvements and Algorithms results Algorithms results: Authors Sadakane Space ∑(AGCT) log2 n n log2 ∑(AGCT) = 2n bits GTCA 4*2 = 8 bits 2*4 = 8 bits GTCAAGTC 4*3 = 12 bits 2*8 = 16 bits TACAAGTAGTCAAGTC 4*4 = 16 bits 2*16 = 32 bits A 2048 base sequence 4*11= 44 bits 2*2048= 4096 bits Space needed during construction 1.4 times final space Authors created an Abstract Suffix Array using: Succinct Suffix Array, based on Wavelet Tree (for sound), build on Burrows-Wheeler transform [3]Niko Välimäki1
Example Required File: LCP Array Solution A Useful additional Data Structure. An array of lengths of the Longest Common Prefixes, between each substring and it’s predecessor in the Suffix Array Lexicographic ordered text. Sequence = ABRACADABRA Suffix ArrayIndex Index LCP Sorted ABRACADABRA 0 10 0 A BRACADABRA 1 7 1 ABRA RACADABRA 2 0 4 ABRACADABRA ACADABRA 3 3 1 ACADABRA CADABRA 4 5 1 ADABRA ADABRA 5 8 0 BRA DABRA 6 1 3 BRACADABRA ABRA 7 4 0 CADABRA BRA 8 6 0 DABRA RA 9 9 0 RA A 10 2 2 RACADABRA Suffix Arrays inventors: Udi Manber, Gene Myers 1989
Implementation Design – Software - C++ object oriented. - Each Data Structure is its own class. - Generic Code, e.i. from Sadakane, retrieve short sequences. - For construction and retrieve long sequences, new code. - Tailored code is as time/space efficient as generic code.
My Current Work - Approach - Dissertation, Not Published Yet. - Suffix Arrays Approach. - Google’s Construction Approach. - Construction Time and Space Problems. Sequences From 11 Bases To PA01 with 6.2 Million Bases. PA01: Run on 7 different computers. Fastest Time 5 days. - All Files Contain Uncompressed Information - Space Required: Sequence File = n One Index File = from n to ~ 8 Times PA01 Sequence Size - Loading Time and RAM Space Problems. - Solution: Break Index File into 64, 1024 Sub Indexes Improving Loading, Processing Times and Allowing Processing of Larger Size Sequences.
My Current Work - Applications - Finding Patterns: How Many Times, and Where a Probe Appears in a Given Sequence acgttg ….. acgttg ….. acgttg ….. acgttg ….. acgttg ….. acgttg - Finding Inverted Patterns: Same as Finding Patterns plus inverted acgttg ….. gttgca ….. acgttg ….. gttgca ..… acgttg ….. gttgca - Finding Inverted Reciprocal Patterns: acgttg ….. caacgt ….. acgttg ….. caacgt ….. acgttg ….. caacgty - Above Programs Generate a Text File Report for Further Processing
Sadakane’s Experimental Results Using: One 2.4 Ghz Pentium 4 Computer, with 1 GB ram Red Hat OS, Compiled programs using g++ (GCC)
Engineering a Compressed Suffix Tree Implementation Experimental Results
My Future Work Do Construction for: Human Genome and All Pseudomonas aeruginosa Bacterias Consensus Pattern Search: • To solve the Bioinformatics Consensus Problem to n-1 of a given probe. At the present time there are applications that solve this problem to value 3. • For a probe with 50 bases, with alphabet A C G T, if we check for 3 mutations we need to do 4 * 4mutations-1 = 64 pattern searches, for each group of 3 bases. • For a sequence of 3.6 billion bases, a probe of 1,000 bases, and a mutation rank of n-1, we need to do 4 * 4999 pattern searches on the 3.5 billion sequence. • Excel calculates up to 4* 4511 = 4.4942E+307. The only way to do this work is with a Distributed System. • Solving this problem for proteins will require more time because proteins have an alphabet of length 20.
Conclusions • Due to Advances in computer hardware and reduction on prices, today a 1.5 terabyte hard disk costs around 400 US dollars. • Recent implementations of Suffix Trees an Suffix Arrays concentrate on compressing the data causing large delays in user processing. • We believe the previous bottleneck hard disk space problems have been resolved, therefore compressing data on hard disk is no longer necessary specially when the users applications slowdown to factors of 30 for Suffix Trees, and additionally log n for Suffix Arrays, when compared to uncompressed data. • With the advances on Operating Systems accessing ram memories of 128 gigabytes in workstations, and with advances in Distributing (Grid) Computing, we believe that using uncompressed data with new methods like our implementation, we can produce applications that were not possible before.
References [1] Udi Manber and Gene Myers (1991). "Suffix arrays: a new method for on-line string searches". SIAM Journal on Computing, Volume 22, Issue 5 (October 1993), pp. 935-948 [2] Kunihiko Sadakane, Department of Computer Science and Communication Engineering, Kyushu University, Hakozaki 6-10-1, Higashi-ku, Fukuoka 812-8581, Japan sada@csce.kyushu-u.ac.jp [3] Engineering a Compressed Suffix Tree Implementation Niko Välimäki1, Wolfgang Gerlach2, Kashyap Dixit3, and Veli Mäkinen1,Department of Computer Science, University of Helsinki, Finland {nvalimak,vmakinen}@cs.helsinki.fi Technische Fakultät, Universität Bielefeld, Germany wgerlach@cebitec.uni-bielefeld.de Department of Computer Science and Engineering Indian Institute of Technology, Kanpur, India kdixit@iitk.ac.in
Questions Thank you!! Presented by: Michael Robinson Florida International University mrobi002@cs.fiu.edu January 15, 2007