150 likes | 290 Views
Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler. Ben Langmead Based on work by Juha K ä rkk ä inen. Motivation. Burrows-Wheeler Transformation (BWT) of a large text allows: Fast exact matching Compact representation (compared to suffix tree/array)
E N D
Blockwise Suffix Sorting forSpace-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen
Motivation • Burrows-Wheeler Transformation (BWT) of a large text allows: • Fast exact matching • Compact representation (compared to suffix tree/array) • More readily compressible (basis of bzip) • The FM Index exploits an indexed and compressed BWT to allow: • Exact matching in time linear in the size of the pattern • Memory footprint as much as 50% smaller than original string • FM Index and related techniques may allow us to “map reads” (match a large set of small patterns) in a single pass over the reads on a typical workstation without spilling onto the hard disk
Background • Recall that BWT is derived from the Burrows-Wheeler matrix, which is related to the Suffix array a c a a c g $ g c $ a a a c BWT Text Burrows Wheeler Matrix Suffix array Last column
Problem • Memory footprint of building and storing suffix array is much larger than the BWT itself • Human genome: SA: ~12 GB, BWT: ~0.8 GB • Attempt to build BWT over whole human genome on a 32 GB server exhausts memory and crashes (I tried)
Solution • Kärkkäinen: “Fast BWT in Small Space by Blockwise Suffix Sorting” • Theoretical Computer Science, 387 (3), pp. 249-257, Sept. 2007 • Observation: • BWT[i] depends only on SA[i], not on any other element of SA • Corollary: • No need to keep all of SA in memory at once! • Solution: • Build SA and BWT a small “chunk” or “block” at a time • Greatly reduces the memory overhead • By something like a factor of B, where B = # of blocks
Solution • Typical suffix sort:
Solution • Blockwise suffix sort:
Solution • Calculate and sort a random sample of the suffixes
Solution • Samples are used as “bookends” for “buckets” ? $ B1 B2 B3 B4
Solution • In B linear-time passes over the text (B = # buckets), sort all suffixes into buckets, one bucket at a time, then sort the bucket $ Pass 1 B1 B2 B3 B4
Solution • After a bucket has been sorted and turned into a BWT segment, it is discarded $ Pass B B1 B2 B3 B4
Solution • Good time bounds in the presence of long repeats require use of a difference cover sample • Acts like an oracle that determines relative lexicographical order of two suffixes that share a prefix of some length v
Project Goals • Basic goal: • Write a correct, usable library implementing blockwise SA sort and BWT building • Characterize performance and time/space tradeoffs • Stretch goals: • Fine-tune for performance and memory usage • Implement difference cover sample • Question: is this necessary for good performance on real-life inputs?
Concluding Remarks • BWT is one application of Blockwise Suffix Sort, but any information derived locally from SA rows (e.g. LCP information) can be made more space-efficient this way