Compressed Suffix Arrays

Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache Joint Advanced Student School 2004

Motivation • Problem: Suffix Arrays are datastructures that support fast searching of patterns in long texts, but require large amounts of memory • Goal: Find a compression that reduces the size of the Suffix Array while still allowing fast access. Joint Advanced Student School 2004

Outline • Trivial Compression • Grossi & Vitter • Outline • Algorithms and Analysis • Sadakane • Outline • Algorithms and Analysis (Sketch) Joint Advanced Student School 2004

Conventions used • Text T [1…n] • Binary text {a,b}n, terminated with # • Pattern P [1…m] • Binary text {a,b}m • Suffix Array SA [1…n] • Each entry points to a T [ i ] • Uses n log n bits Joint Advanced Student School 2004

Trivial Compression • Construct and store SA; a < # < b Joint Advanced Student School 2004

Trivial Compression • Recover T from SA # < b a < # Joint Advanced Student School 2004

Trivial Compression • Therefore each Suffix Array can be compressed to • Drawback: decompression takes Joint Advanced Student School 2004

Grossi & Vitter • Outline: • Recursive „Divide and Conquer“-type algorithm • Stores SA implicitly (for all but the last level) • Supported operations • lookup( i ) • compress Joint Advanced Student School 2004

G&V – compress • Structural Outline SA0 SA1 SA2 SAl Joint Advanced Student School 2004

G&V – compress Given a Text T; create SA Joint Advanced Student School 2004

G&V – compress Create array B [1...n] with n = |SA| • B [ i ] = 1, if T [ i ] even Joint Advanced Student School 2004

G&V – compress Create array B [1...n] with n = |SA| • B [ i ] = 1, if T [ i ] even • B [ i ] = 0, if T [ i ] odd Joint Advanced Student School 2004

G&V – compress • Create an array rank [ 1...n ] where rank [ i ] contains the number of 1s in B[ 1...i ] Joint Advanced Student School 2004

G&V – compress Define a mapping [ 1..n ] so that • If B[ i ] = 0: [ i ] = j | SA [ j ] = SA [ i ] +1 Joint Advanced Student School 2004

G&V – compress Define a mapping [ 1..n ] so that • If B[ i ] = 0: [ i ] = j | SA [ j ] = SA[ i ] +1 • If B[ i ] = 1: [ i ] = i Joint Advanced Student School 2004

G&V – compress Compressing from SAk to SAk+1 • Store only even values of SAk in SAk+1 • Divide each entry in SAk+1 by 2 Joint Advanced Student School 2004

G&V – lookup Reconstruction of SAk [ i ] using Bk, rankk, k and SAk+1 [ i ] SAk [ i ] = 2 SAk+1 [ rankk (k ( i ))] + (B [ i ] –1) Joint Advanced Student School 2004

G&V – lookup • Proof / Example part 1: B [ i ] = 1 SAk [ i ] = 2 SAk+1 [ rankk (k ( i ))] + (B [ i ] –1) SAk [ i ] = 2 SAk+1 [ rankk (k ( i ))] SAk [ i ] = 2 SAk+1 [ rankk ( i )] Joint Advanced Student School 2004

G&V – lookup • Proof / Example part 2: B [ i ] = 0 SAk [ i ] = 2 SAk+1 [ rankk (k ( i ))] + (B [ i ] –1) SAk [ i ] = 2 SAk+1 [ rankk (k ( i ))] - 1 Joint Advanced Student School 2004

G&V-lookup Stored information • For each level k = 0...l-1, • explicitly store Bk • rankk and k stored implicit • SA reconstructible by recursion • For level l store SAl explicit • No further information neede Joint Advanced Student School 2004

G&V – lookup lookup ( i ) = rlookup( i ,0 ) rlookup ( i, k ) if (k == l) return SA[i]; else return 2 * rlookup( rankk[ psik[i]], k+1) + (Bk[i]-1); end • Pseudocode for the lookup function Joint Advanced Student School 2004

G&V - details Speed versus Time  > 0 Joint Advanced Student School 2004

G&V – Quick and Large Storing rank spaceefficient and quickly accessible: Explicit storage of rank takes n log n bitsJacobson´s method uses o( n ) bits Both allow for constant time access Joint Advanced Student School 2004

G&V – Quick and Large Storing k efficiently (outline): • Create 2karrays; one for each possible substring over {a,b}2kusing thesubstring as label Example using k=2 Joint Advanced Student School 2004

G&V – Quick and Large • For each Bk [ j ] = 1 find the 2k literals preceding the suffix referenced by SAk [ j ] in T • Store j in the array according to T Example using k=2 Joint Advanced Student School 2004

G&V – Quick and Large In other words: For each i with Bk [ i ] = 0 and t the first 2k literals of the suffix referenced by SAk [ i ], insert [ i ] in array t Example using k=2 Joint Advanced Student School 2004

G&V – Quick and Large For each j with Bk [ j ] = 1 t = T [ 2k SAk[ j ] – 2k, ..., 2k SAk[ j ] – 1] add j to the array with label t Example using k=2 Joint Advanced Student School 2004

G&V – Quick and Large To calculate k ( i ), use i - rankk ( i ) as index to the concatenated arrays Example using k=2 2(8) = 6 Joint Advanced Student School 2004

G&V – Quick and Large l = log log n levels of compression • Space occupation:l levels, each occupying O ( n ) bits • O ( n log log n ) bits • Time requirement for lookup ( i ):l levels, each requiring O ( 1) • O ( log log n ) steps Joint Advanced Student School 2004

G&V – Small and Slow Reduction of size by allowing for higher time usage  > 0 Joint Advanced Student School 2004

G&V – Small and Slow Instead of storing all l = log log n levels, store only levels Example n = 32 store levels 0, 2, 3 Joint Advanced Student School 2004

G&V – Small and Slow Example using | T | = 32 SA0 SA1 SA2 SA3 Joint Advanced Student School 2004

G&V – Small and Slow Keep only 3 levels SA0 SA1 SA3 Joint Advanced Student School 2004

G&V – Small and Slow On levels 0 and l´, mark entries that are still present in the next level SA0 SA1 SA3 Joint Advanced Student School 2004

G&V – Small and Slow Before the modification: • Bk[ i ] = 1 SAk[ i ] is stored in SAk+1 • k used for each Bk[ i ] = 0 to find SAk[ [ i ] ] = SAk[ i ] +1 Modifications added: • Bo´[ i ] = 1  SA0[ i ] is stored in SAl´ • Bl´´[ i ] = 1  SAl´[ i ] is stored in SAl • ´k used for each Bk[ i ] = 1 to find SAk[ [ i ] ] = SAk[ i ] +1 Joint Advanced Student School 2004

G&V – Small and Slow Construction of ´ and  B´ markings of indices Joint Advanced Student School 2004

G&V – Small and Slow ´ and  in combination can be used to traverse the entire SA (ascending) Joint Advanced Student School 2004

G&V – Small and Slow Length of the traversal determines required time for lookup • Level 0 contains n entries • Level l´ was divided ½ log log n times Joint Advanced Student School 2004

G&V – Small and Slow Length of the traversal determines required time for lookup • 0s in B´ are evenly spaced Longest sequence of 0s Joint Advanced Student School 2004

G&V – Small and Slow Generalized for more than 2 additional levels (the number must be constant!): Let L be the number of levels,  = L-1 The longest sequence of 0s has length logn Joint Advanced Student School 2004

G&V – Small and Slow reconstruction of levels < lrequires • a vector describing which entries of level k´ can be found in k´+1 =>O ( n )bits • a function ´ that combined with  allows for complete traversal of SA =>O ( n )bits Joint Advanced Student School 2004

Sadakane Improvements on the datastructure and algorithms proposed by Grossi & Vitter • More operations • inverse( j ): return i so that SA[ i ] = j • search( P ): return l, r where P matches T • decompress( s, e ): return T[s...e] • Allow for alphabets || > 2 Joint Advanced Student School 2004

Sadakane – inverse( i ) Goal:For a suffix starting at position j, find the index i of the lexicographic order of all suffices Assuming:j = SA[ i ] Create SA-1 so that:i = SA-1[ j ] Joint Advanced Student School 2004

Sadakane – inverse( i ) Proposition: inverse( i ) can be computed in O( logn ) with explicit storage of SA-1 at the last level and a recursion for all above. Joint Advanced Student School 2004

Sadakane – search( P ) Goal:Find the interval [ l...r ] in SA so that P matches each of the suffices pointed to by SA. Do so without using T Joint Advanced Student School 2004

Sadakane – search( P ) Proposition: By augmenting the datastructure by a function C-1 (the „inverse of the array of cumulative frequencies“) it is possible to obtain the substring in O( |P| ) time Joint Advanced Student School 2004

Sadakane – decompress( I ) Goal:Using only SA and its functions, return the substring of T pointed to by I = [ s, e ]. Joint Advanced Student School 2004

Sadakane – decompress( I ) Proposition: A substring of length l = e-s+1can be decompressed using only SA, SA-1 and C-1 in O( l + logn ) time, where n is the length of the original text. Joint Advanced Student School 2004

Sadakane – Complexity Using inverse, search and decompress it is possible to implicitly store T. Therefore O( n ) words are no longer required. The space-complexity of the Sadakane-improved Suffix Array is only 37% of a Grossi&Vitter Suffix Array including the text Joint Advanced Student School 2004

Compressed Suffix Arrays

Compressed Suffix Arrays

Presentation Transcript

Suffix trees and suffix arrays

Compressed suffix arrays and suffix trees with applications to text indexing and string matching

Approximate String Matching using Compressed Suffix Arrays

Compressed Compact Suffix Arrays

Suffix Trees and Suffix Arrays

Compressed Suffix Arrays based on Run-Length Encoding

Suffix Trees, Suffix Arrays and Suffix Trays

Optimizing multi-pattern searches for compressed suffix arrays

Fine Tuning the Enhanced Suffix Arrays

Suffix Trees and Suffix Arrays

Lecture 17: Suffix Arrays and Burrows Wheeler Transforms

Counting Suffix Arrays and Strings

Suffix arrays

Compressed Suffix Arrays and Suffix Trees

Genomic Repeat Visualisation Using Suffix Arrays

Suffix Trees and Suffix Arrays

Linear-Time Search in Suffix Arrays

Suffix Arrays

compressed GETECNA compressed

Efficient Computation of Substring Equivalence Classes with Suffix Arrays

Trie/Suffix Trie/Suffix Tree