Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler

Blockwise Suffix Sorting forSpace-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen

Motivation • Burrows-Wheeler Transformation (BWT) of a large text allows: • Fast exact matching • Compact representation (compared to suffix tree/array) • More readily compressible (basis of bzip) • The FM Index exploits an indexed and compressed BWT to allow: • Exact matching in time linear in the size of the pattern • Memory footprint as much as 50% smaller than original string • FM Index and related techniques may allow us to “map reads” (match a large set of small patterns) in a single pass over the reads on a typical workstation without spilling onto the hard disk

Background • Recall that BWT is derived from the Burrows-Wheeler matrix, which is related to the Suffix array a c a a c g $ g c $ a a a c BWT Text Burrows Wheeler Matrix Suffix array Last column

Problem • Memory footprint of building and storing suffix array is much larger than the BWT itself • Human genome: SA: ~12 GB, BWT: ~0.8 GB • Attempt to build BWT over whole human genome on a 32 GB server exhausts memory and crashes (I tried)

Solution • Kärkkäinen: “Fast BWT in Small Space by Blockwise Suffix Sorting” • Theoretical Computer Science, 387 (3), pp. 249-257, Sept. 2007 • Observation: • BWT[i] depends only on SA[i], not on any other element of SA • Corollary: • No need to keep all of SA in memory at once! • Solution: • Build SA and BWT a small “chunk” or “block” at a time • Greatly reduces the memory overhead • By something like a factor of B, where B = # of blocks

Solution • Typical suffix sort:

Solution • Blockwise suffix sort:

Solution • Calculate and sort a random sample of the suffixes

Solution • Samples are used as “bookends” for “buckets” ? $ B1 B2 B3 B4

Solution • In B linear-time passes over the text (B = # buckets), sort all suffixes into buckets, one bucket at a time, then sort the bucket $ Pass 1 B1 B2 B3 B4

Solution • After a bucket has been sorted and turned into a BWT segment, it is discarded $ Pass B B1 B2 B3 B4

Solution • Good time bounds in the presence of long repeats require use of a difference cover sample • Acts like an oracle that determines relative lexicographical order of two suffixes that share a prefix of some length v

Project Goals • Basic goal: • Write a correct, usable library implementing blockwise SA sort and BWT building • Characterize performance and time/space tradeoffs • Stretch goals: • Fine-tune for performance and memory usage • Implement difference cover sample • Question: is this necessary for good performance on real-life inputs?

Concluding Remarks • BWT is one application of Blockwise Suffix Sort, but any information derived locally from SA rows (e.g. LCP information) can be made more space-efficient this way

Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler

Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler

Presentation Transcript

Suffix Sorting Related Algoritmics

Space-Efficient Algorithms for Document Retrieval

DNA Sequence Compression using the Burrows-Wheeler Transform

Burrows Wheeler Transform In Image Compression

A Simpler Analysis of Burrows-Wheeler Based Compression

On the Sorting-Complexity of Suffix Tree Construction

Suffix Trees, Suffix Arrays and Suffix Trays

Efficient Sorting Algorithm

LARRY BURROWS!

Biostatistics-Lecture 16 Sequence alignment based on Burrows-Wheeler Transformation

Truly Parallel Burrows-Wheeler Compression and Decompression

Nick Burrows

Amanda Burrows

The Burrows-Wheeler Transform: Theory and Practice

Lecture 17: Suffix Arrays and Burrows Wheeler Transforms

Burrows Wheeler Transform

Burrows Wheeler Transform

Combinatorial aspects of the Burrows-Wheeler transform

Back to Sorting – More efficient sorting algorithms

Burrows Wheeler Transform

Trie/Suffix Trie/Suffix Tree