490 likes | 680 Views
Compressed Suffix Arrays. Compression of Suffix Arrays to linear size. Fabian Pache. Motivation. Problem: Suffix Arrays are datastructures that support fast searching of patterns in long texts, but require large amounts of memory
E N D
Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache Joint Advanced Student School 2004
Motivation • Problem: Suffix Arrays are datastructures that support fast searching of patterns in long texts, but require large amounts of memory • Goal: Find a compression that reduces the size of the Suffix Array while still allowing fast access. Joint Advanced Student School 2004
Outline • Trivial Compression • Grossi & Vitter • Outline • Algorithms and Analysis • Sadakane • Outline • Algorithms and Analysis (Sketch) Joint Advanced Student School 2004
Conventions used • Text T [1…n] • Binary text {a,b}n, terminated with # • Pattern P [1…m] • Binary text {a,b}m • Suffix Array SA [1…n] • Each entry points to a T [ i ] • Uses n log n bits Joint Advanced Student School 2004
Trivial Compression • Construct and store SA; a < # < b Joint Advanced Student School 2004
Trivial Compression • Recover T from SA # < b a < # Joint Advanced Student School 2004
Trivial Compression • Therefore each Suffix Array can be compressed to • Drawback: decompression takes Joint Advanced Student School 2004
Grossi & Vitter • Outline: • Recursive „Divide and Conquer“-type algorithm • Stores SA implicitly (for all but the last level) • Supported operations • lookup( i ) • compress Joint Advanced Student School 2004
G&V – compress • Structural Outline SA0 SA1 SA2 SAl Joint Advanced Student School 2004
G&V – compress Given a Text T; create SA Joint Advanced Student School 2004
G&V – compress Create array B [1...n] with n = |SA| • B [ i ] = 1, if T [ i ] even Joint Advanced Student School 2004
G&V – compress Create array B [1...n] with n = |SA| • B [ i ] = 1, if T [ i ] even • B [ i ] = 0, if T [ i ] odd Joint Advanced Student School 2004
G&V – compress • Create an array rank [ 1...n ] where rank [ i ] contains the number of 1s in B[ 1...i ] Joint Advanced Student School 2004
G&V – compress Define a mapping [ 1..n ] so that • If B[ i ] = 0: [ i ] = j | SA [ j ] = SA [ i ] +1 Joint Advanced Student School 2004
G&V – compress Define a mapping [ 1..n ] so that • If B[ i ] = 0: [ i ] = j | SA [ j ] = SA[ i ] +1 • If B[ i ] = 1: [ i ] = i Joint Advanced Student School 2004
G&V – compress Compressing from SAk to SAk+1 • Store only even values of SAk in SAk+1 • Divide each entry in SAk+1 by 2 Joint Advanced Student School 2004
G&V – lookup Reconstruction of SAk [ i ] using Bk, rankk, k and SAk+1 [ i ] SAk [ i ] = 2 SAk+1 [ rankk (k ( i ))] + (B [ i ] –1) Joint Advanced Student School 2004
G&V – lookup • Proof / Example part 1: B [ i ] = 1 SAk [ i ] = 2 SAk+1 [ rankk (k ( i ))] + (B [ i ] –1) SAk [ i ] = 2 SAk+1 [ rankk (k ( i ))] SAk [ i ] = 2 SAk+1 [ rankk ( i )] Joint Advanced Student School 2004
G&V – lookup • Proof / Example part 2: B [ i ] = 0 SAk [ i ] = 2 SAk+1 [ rankk (k ( i ))] + (B [ i ] –1) SAk [ i ] = 2 SAk+1 [ rankk (k ( i ))] - 1 Joint Advanced Student School 2004
G&V-lookup Stored information • For each level k = 0...l-1, • explicitly store Bk • rankk and k stored implicit • SA reconstructible by recursion • For level l store SAl explicit • No further information neede Joint Advanced Student School 2004
G&V – lookup lookup ( i ) = rlookup( i ,0 ) rlookup ( i, k ) if (k == l) return SA[i]; else return 2 * rlookup( rankk[ psik[i]], k+1) + (Bk[i]-1); end • Pseudocode for the lookup function Joint Advanced Student School 2004
G&V - details Speed versus Time > 0 Joint Advanced Student School 2004
G&V – Quick and Large Storing rank spaceefficient and quickly accessible: Explicit storage of rank takes n log n bitsJacobson´s method uses o( n ) bits Both allow for constant time access Joint Advanced Student School 2004
G&V – Quick and Large Storing k efficiently (outline): • Create 2karrays; one for each possible substring over {a,b}2kusing thesubstring as label Example using k=2 Joint Advanced Student School 2004
G&V – Quick and Large • For each Bk [ j ] = 1 find the 2k literals preceding the suffix referenced by SAk [ j ] in T • Store j in the array according to T Example using k=2 Joint Advanced Student School 2004
G&V – Quick and Large In other words: For each i with Bk [ i ] = 0 and t the first 2k literals of the suffix referenced by SAk [ i ], insert [ i ] in array t Example using k=2 Joint Advanced Student School 2004
G&V – Quick and Large For each j with Bk [ j ] = 1 t = T [ 2k SAk[ j ] – 2k, ..., 2k SAk[ j ] – 1] add j to the array with label t Example using k=2 Joint Advanced Student School 2004
G&V – Quick and Large To calculate k ( i ), use i - rankk ( i ) as index to the concatenated arrays Example using k=2 2(8) = 6 Joint Advanced Student School 2004
G&V – Quick and Large l = log log n levels of compression • Space occupation:l levels, each occupying O ( n ) bits • O ( n log log n ) bits • Time requirement for lookup ( i ):l levels, each requiring O ( 1) • O ( log log n ) steps Joint Advanced Student School 2004
G&V – Small and Slow Reduction of size by allowing for higher time usage > 0 Joint Advanced Student School 2004
G&V – Small and Slow Instead of storing all l = log log n levels, store only levels Example n = 32 store levels 0, 2, 3 Joint Advanced Student School 2004
G&V – Small and Slow Example using | T | = 32 SA0 SA1 SA2 SA3 Joint Advanced Student School 2004
G&V – Small and Slow Keep only 3 levels SA0 SA1 SA3 Joint Advanced Student School 2004
G&V – Small and Slow On levels 0 and l´, mark entries that are still present in the next level SA0 SA1 SA3 Joint Advanced Student School 2004
G&V – Small and Slow Before the modification: • Bk[ i ] = 1 SAk[ i ] is stored in SAk+1 • k used for each Bk[ i ] = 0 to find SAk[ [ i ] ] = SAk[ i ] +1 Modifications added: • Bo´[ i ] = 1 SA0[ i ] is stored in SAl´ • Bl´´[ i ] = 1 SAl´[ i ] is stored in SAl • ´k used for each Bk[ i ] = 1 to find SAk[ [ i ] ] = SAk[ i ] +1 Joint Advanced Student School 2004
G&V – Small and Slow Construction of ´ and B´ markings of indices Joint Advanced Student School 2004
G&V – Small and Slow ´ and in combination can be used to traverse the entire SA (ascending) Joint Advanced Student School 2004
G&V – Small and Slow Length of the traversal determines required time for lookup • Level 0 contains n entries • Level l´ was divided ½ log log n times Joint Advanced Student School 2004
G&V – Small and Slow Length of the traversal determines required time for lookup • 0s in B´ are evenly spaced Longest sequence of 0s Joint Advanced Student School 2004
G&V – Small and Slow Generalized for more than 2 additional levels (the number must be constant!): Let L be the number of levels, = L-1 The longest sequence of 0s has length logn Joint Advanced Student School 2004
G&V – Small and Slow reconstruction of levels < lrequires • a vector describing which entries of level k´ can be found in k´+1 =>O ( n )bits • a function ´ that combined with allows for complete traversal of SA =>O ( n )bits Joint Advanced Student School 2004
Sadakane Improvements on the datastructure and algorithms proposed by Grossi & Vitter • More operations • inverse( j ): return i so that SA[ i ] = j • search( P ): return l, r where P matches T • decompress( s, e ): return T[s...e] • Allow for alphabets || > 2 Joint Advanced Student School 2004
Sadakane – inverse( i ) Goal:For a suffix starting at position j, find the index i of the lexicographic order of all suffices Assuming:j = SA[ i ] Create SA-1 so that:i = SA-1[ j ] Joint Advanced Student School 2004
Sadakane – inverse( i ) Proposition: inverse( i ) can be computed in O( logn ) with explicit storage of SA-1 at the last level and a recursion for all above. Joint Advanced Student School 2004
Sadakane – search( P ) Goal:Find the interval [ l...r ] in SA so that P matches each of the suffices pointed to by SA. Do so without using T Joint Advanced Student School 2004
Sadakane – search( P ) Proposition: By augmenting the datastructure by a function C-1 (the „inverse of the array of cumulative frequencies“) it is possible to obtain the substring in O( |P| ) time Joint Advanced Student School 2004
Sadakane – decompress( I ) Goal:Using only SA and its functions, return the substring of T pointed to by I = [ s, e ]. Joint Advanced Student School 2004
Sadakane – decompress( I ) Proposition: A substring of length l = e-s+1can be decompressed using only SA, SA-1 and C-1 in O( l + logn ) time, where n is the length of the original text. Joint Advanced Student School 2004
Sadakane – Complexity Using inverse, search and decompress it is possible to implicitly store T. Therefore O( n ) words are no longer required. The space-complexity of the Sadakane-improved Suffix Array is only 37% of a Grossi&Vitter Suffix Array including the text Joint Advanced Student School 2004