1 / 49

Compressed Suffix Arrays

Compressed Suffix Arrays. Compression of Suffix Arrays to linear size. Fabian Pache. Motivation. Problem: Suffix Arrays are datastructures that support fast searching of patterns in long texts, but require large amounts of memory

arnold
Download Presentation

Compressed Suffix Arrays

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache Joint Advanced Student School 2004

  2. Motivation • Problem: Suffix Arrays are datastructures that support fast searching of patterns in long texts, but require large amounts of memory • Goal: Find a compression that reduces the size of the Suffix Array while still allowing fast access. Joint Advanced Student School 2004

  3. Outline • Trivial Compression • Grossi & Vitter • Outline • Algorithms and Analysis • Sadakane • Outline • Algorithms and Analysis (Sketch) Joint Advanced Student School 2004

  4. Conventions used • Text T [1…n] • Binary text {a,b}n, terminated with # • Pattern P [1…m] • Binary text {a,b}m • Suffix Array SA [1…n] • Each entry points to a T [ i ] • Uses n log n bits Joint Advanced Student School 2004

  5. Trivial Compression • Construct and store SA; a < # < b Joint Advanced Student School 2004

  6. Trivial Compression • Recover T from SA # < b a < # Joint Advanced Student School 2004

  7. Trivial Compression • Therefore each Suffix Array can be compressed to • Drawback: decompression takes Joint Advanced Student School 2004

  8. Grossi & Vitter • Outline: • Recursive „Divide and Conquer“-type algorithm • Stores SA implicitly (for all but the last level) • Supported operations • lookup( i ) • compress Joint Advanced Student School 2004

  9. G&V – compress • Structural Outline SA0 SA1 SA2 SAl Joint Advanced Student School 2004

  10. G&V – compress Given a Text T; create SA Joint Advanced Student School 2004

  11. G&V – compress Create array B [1...n] with n = |SA| • B [ i ] = 1, if T [ i ] even Joint Advanced Student School 2004

  12. G&V – compress Create array B [1...n] with n = |SA| • B [ i ] = 1, if T [ i ] even • B [ i ] = 0, if T [ i ] odd Joint Advanced Student School 2004

  13. G&V – compress • Create an array rank [ 1...n ] where rank [ i ] contains the number of 1s in B[ 1...i ] Joint Advanced Student School 2004

  14. G&V – compress Define a mapping [ 1..n ] so that • If B[ i ] = 0: [ i ] = j | SA [ j ] = SA [ i ] +1 Joint Advanced Student School 2004

  15. G&V – compress Define a mapping [ 1..n ] so that • If B[ i ] = 0: [ i ] = j | SA [ j ] = SA[ i ] +1 • If B[ i ] = 1: [ i ] = i Joint Advanced Student School 2004

  16. G&V – compress Compressing from SAk to SAk+1 • Store only even values of SAk in SAk+1 • Divide each entry in SAk+1 by 2 Joint Advanced Student School 2004

  17. G&V – lookup Reconstruction of SAk [ i ] using Bk, rankk, k and SAk+1 [ i ] SAk [ i ] = 2 SAk+1 [ rankk (k ( i ))] + (B [ i ] –1) Joint Advanced Student School 2004

  18. G&V – lookup • Proof / Example part 1: B [ i ] = 1 SAk [ i ] = 2 SAk+1 [ rankk (k ( i ))] + (B [ i ] –1) SAk [ i ] = 2 SAk+1 [ rankk (k ( i ))] SAk [ i ] = 2 SAk+1 [ rankk ( i )] Joint Advanced Student School 2004

  19. G&V – lookup • Proof / Example part 2: B [ i ] = 0 SAk [ i ] = 2 SAk+1 [ rankk (k ( i ))] + (B [ i ] –1) SAk [ i ] = 2 SAk+1 [ rankk (k ( i ))] - 1 Joint Advanced Student School 2004

  20. G&V-lookup Stored information • For each level k = 0...l-1, • explicitly store Bk • rankk and k stored implicit • SA reconstructible by recursion • For level l store SAl explicit • No further information neede Joint Advanced Student School 2004

  21. G&V – lookup lookup ( i ) = rlookup( i ,0 ) rlookup ( i, k ) if (k == l) return SA[i]; else return 2 * rlookup( rankk[ psik[i]], k+1) + (Bk[i]-1); end • Pseudocode for the lookup function Joint Advanced Student School 2004

  22. G&V - details Speed versus Time  > 0 Joint Advanced Student School 2004

  23. G&V – Quick and Large Storing rank spaceefficient and quickly accessible: Explicit storage of rank takes n log n bitsJacobson´s method uses o( n ) bits Both allow for constant time access Joint Advanced Student School 2004

  24. G&V – Quick and Large Storing k efficiently (outline): • Create 2karrays; one for each possible substring over {a,b}2kusing thesubstring as label Example using k=2 Joint Advanced Student School 2004

  25. G&V – Quick and Large • For each Bk [ j ] = 1 find the 2k literals preceding the suffix referenced by SAk [ j ] in T • Store j in the array according to T Example using k=2 Joint Advanced Student School 2004

  26. G&V – Quick and Large In other words: For each i with Bk [ i ] = 0 and t the first 2k literals of the suffix referenced by SAk [ i ], insert [ i ] in array t Example using k=2 Joint Advanced Student School 2004

  27. G&V – Quick and Large For each j with Bk [ j ] = 1 t = T [ 2k SAk[ j ] – 2k, ..., 2k SAk[ j ] – 1] add j to the array with label t Example using k=2 Joint Advanced Student School 2004

  28. G&V – Quick and Large To calculate k ( i ), use i - rankk ( i ) as index to the concatenated arrays Example using k=2 2(8) = 6 Joint Advanced Student School 2004

  29. G&V – Quick and Large l = log log n levels of compression • Space occupation:l levels, each occupying O ( n ) bits • O ( n log log n ) bits • Time requirement for lookup ( i ):l levels, each requiring O ( 1) • O ( log log n ) steps Joint Advanced Student School 2004

  30. G&V – Small and Slow Reduction of size by allowing for higher time usage  > 0 Joint Advanced Student School 2004

  31. G&V – Small and Slow Instead of storing all l = log log n levels, store only levels Example n = 32 store levels 0, 2, 3 Joint Advanced Student School 2004

  32. G&V – Small and Slow Example using | T | = 32 SA0 SA1 SA2 SA3 Joint Advanced Student School 2004

  33. G&V – Small and Slow Keep only 3 levels SA0 SA1 SA3 Joint Advanced Student School 2004

  34. G&V – Small and Slow On levels 0 and l´, mark entries that are still present in the next level SA0 SA1 SA3 Joint Advanced Student School 2004

  35. G&V – Small and Slow Before the modification: • Bk[ i ] = 1 SAk[ i ] is stored in SAk+1 • k used for each Bk[ i ] = 0 to find SAk[ [ i ] ] = SAk[ i ] +1 Modifications added: • Bo´[ i ] = 1  SA0[ i ] is stored in SAl´ • Bl´´[ i ] = 1  SAl´[ i ] is stored in SAl • ´k used for each Bk[ i ] = 1 to find SAk[ [ i ] ] = SAk[ i ] +1 Joint Advanced Student School 2004

  36. G&V – Small and Slow Construction of ´ and  B´ markings of indices Joint Advanced Student School 2004

  37. G&V – Small and Slow ´ and  in combination can be used to traverse the entire SA (ascending) Joint Advanced Student School 2004

  38. G&V – Small and Slow Length of the traversal determines required time for lookup • Level 0 contains n entries • Level l´ was divided ½ log log n times Joint Advanced Student School 2004

  39. G&V – Small and Slow Length of the traversal determines required time for lookup • 0s in B´ are evenly spaced Longest sequence of 0s Joint Advanced Student School 2004

  40. G&V – Small and Slow Generalized for more than 2 additional levels (the number must be constant!): Let L be the number of levels,  = L-1 The longest sequence of 0s has length logn Joint Advanced Student School 2004

  41. G&V – Small and Slow reconstruction of levels < lrequires • a vector describing which entries of level k´ can be found in k´+1 =>O ( n )bits • a function ´ that combined with  allows for complete traversal of SA =>O ( n )bits Joint Advanced Student School 2004

  42. Sadakane Improvements on the datastructure and algorithms proposed by Grossi & Vitter • More operations • inverse( j ): return i so that SA[ i ] = j • search( P ): return l, r where P matches T • decompress( s, e ): return T[s...e] • Allow for alphabets || > 2 Joint Advanced Student School 2004

  43. Sadakane – inverse( i ) Goal:For a suffix starting at position j, find the index i of the lexicographic order of all suffices Assuming:j = SA[ i ] Create SA-1 so that:i = SA-1[ j ] Joint Advanced Student School 2004

  44. Sadakane – inverse( i ) Proposition: inverse( i ) can be computed in O( logn ) with explicit storage of SA-1 at the last level and a recursion for all above. Joint Advanced Student School 2004

  45. Sadakane – search( P ) Goal:Find the interval [ l...r ] in SA so that P matches each of the suffices pointed to by SA. Do so without using T Joint Advanced Student School 2004

  46. Sadakane – search( P ) Proposition: By augmenting the datastructure by a function C-1 (the „inverse of the array of cumulative frequencies“) it is possible to obtain the substring in O( |P| ) time Joint Advanced Student School 2004

  47. Sadakane – decompress( I ) Goal:Using only SA and its functions, return the substring of T pointed to by I = [ s, e ]. Joint Advanced Student School 2004

  48. Sadakane – decompress( I ) Proposition: A substring of length l = e-s+1can be decompressed using only SA, SA-1 and C-1 in O( l + logn ) time, where n is the length of the original text. Joint Advanced Student School 2004

  49. Sadakane – Complexity Using inverse, search and decompress it is possible to implicitly store T. Therefore O( n ) words are no longer required. The space-complexity of the Sadakane-improved Suffix Array is only 37% of a Grossi&Vitter Suffix Array including the text Joint Advanced Student School 2004

More Related