1 / 21

Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms

Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms. Raffaele Giancarlo Marinella Sciortino raffaele@math.unipa.it mari@math.unipa.it. University of Palermo ITALY. The Burrows-Wheeler Transform [BW94] abraca bacraa, 1. I. O. BWT. MTF. H/AC.

nau
Download Presentation

Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortinoraffaele@math.unipa.itmari@math.unipa.it University of Palermo ITALY

  2. The Burrows-Wheeler Transform [BW94] • abraca bacraa, 1 I O BWT MTF H/AC Outline of the Talk • BWT Compression Algorithms • A New Class of Algorithms • Combinatorial Dependency [BCCFM00, BFG02] • Lower Bound on Compression Performance • Conjecture by Manzini [M01] • Universal Encoding of Integers [L68, E75]

  3. Burrows-Wheeler Transform • TRANSFORM • abraca bacraa, 1 • ANTI-TRANSFORM • bacraa, 1 abraca

  4. FL 0 b r a c a a 1 a b r a c a 2 c a a b r a 3 r a c a a b 4 a a b r a c 5 a c a a b r I • OUTPUT: BWT(w)=bacraa and the index I=1 The Transform • INPUT:w = abraca • Right-to-left lexicographically sorting of the cyclic rotations of w

  5. FL 0 b r a c a a 1 a b r a c a 2 c a a b r a 3 r a c a a b 4 a a b r a c 5 a c a a b r I The Transform Essential Properties: •  i I the character L[i] is followed in w by F[i]; • For each character x, the i-th occurrence of x in L corresponds to the i-th occurrence of x in F.

  6. F 0 b 1 a 2 c 3 r 4 a 5 a L a 0 a 1 a 2 b 3 c 4 r 5 I 0 1 2 3 4 5 1 3 4 5 0 2  = w= a The Anti-Transform • Given F=BWT(w)= bacraa and I=1: • Construct L by lexicographically sorting the element of F  : b r a c a

  7. FL e t h a t h e T h e t h e t h e T h o t h e t h Why Useful INTUITION Let us consider the effect on a single letter in a common word in a block of English text: w = … The…the… The… the…those…the…the…that…the… The characters following th are grouped togetherinside BWT(w). Extensive experimental work confirms this “clustering effect” [BW94, F96]

  8. MoveToFront Coding (MFT) [BeSTaWe86]: • Encodes an instance of a character x by an integer that counts the number of the distinct symbols seen after the latest occurrence of x. • EXAMPLE abaaaabbbbbcccccaaaaa 01100010000200020000 • BWT+MTF =many runs of zeroes good for order 0 encodersRelation between compressibility of files and high percentage of zeroes [F96] Why Useful • “Clustering” of Symbols and MTF

  9. Two Main Research Questions • Is MTF an essential step for the successful use of BWT [F96] ? • Experiments [AM97, BK98, WM01]; • Theory ? • Analysis of the compression performance of BWT-based algorithms. • Experiments (see DCC ) • Information Theory [Ef99, Sa98] • Worst Case Setting • Empirical Entropy of Strings [M01] - No Assumptions

  10. Zero-th Order Empirical Entropy • s is string over the alphabet ={a1, a2, …, ah} • ni number of occurrences of ai in s. Assume that nini+1 • The zero-th order empirical entropy of s: • H0(s)= - • The zero-th order modified empirical entropy [M01]:

  11. k-th Order Empirical Entropy • k the set of all strings of length k over  • kthe set of all strings of length at most k • Fixed an integer k0,for any string y in k, ys is the string consisting of the characters following y in s. • The k-th order empirical entropy of s is • The k-th order modified empirical entropy: where Tk denotes a set of strings in k such that each of them has a unique suffix in Tk and such that among the sets having this property, Tk is that one minimizing the right hand.

  12. Results by Manzini • Let BW0 be a BWT-based algorithm with Arithmetic coder as zero-th order compressor. Then, k0 • Let BW0RL be a BWT-based algorithm using run-length encoding with Arithmetic coder as zero-th order compressor. Then, k0 gk’ 0 such that • where =10-2.

  13. THEOREM (Manzini):Let s be a string. For each k0, there exists an fhk and a partition s’1, s’2, …,s’fof BWT(s) such that An analogous result holds for Hk*(s). REMARK:If thereexistedan ideal compressorA such that, for any partition s1,s2,…,sp of a string s then A(BWT(s))|s|Hk(s). Analogously for Hk*(s). Insights by Manzini We show that A does not exist. Fortunately we can approximate it

  14. Open Problems by Manzini • Conjectures by Manzini: • No BWT-compression method can get to a bound of the form |s\Hk*(s)+gk for k0 and gk0constant. • The ideal algorithm Adoes not exist. We prove that both conjectures are true

  15. Our Contributions • We provide a new class of BWT-based algorithms, based on partition of strings, that do not use MTF as a part of the compression process. • We analyze two of those new methods in the worst case setting. We obtain better theoretic boundsthan Manzini. • Under a natural hypothesis on the inner working of the algorithm no BWT-compressor using that type of algorithm can achieve • |w|Hk*(w) + gk

  16. Optimally Partition the transformed string; # # # # # Algorithms That Use Optimal Partitions of Strings(rather than MTF) • Compute BWT(s); • Compress each piece separately.

  17. Fix a data compressor C that adds a special end-of-string # before compressing the string. • DEFINITION: Two strings x and y are combinatorial dependent with respect the data compressor C if |C(xy#)|<|C(x#)|+|C(y#)|. • OPTIMAL PARTITION IN TERMS OF THE BASE COMPRESSOR C: By Dynamic Programming Combinatorial Dependency • Techniques by Buchsbaum et al. [BCCFM00, BFG02] for Table Compression. • Surprisingly, it specializes to strings

  18. The new class BWTOPT • ASSUMPTIONS:Let C be a data compressor such that: • given an input string x adds a special end-of-string # and compress x# • either # is really appended at the end of the string or the length of x is explicitly stored as a prefix of the compressed string (universal encoding of integer). Given the input string s 1. Compute BWT(s); 2. Optimally partition of BWT(s) using C as the base compressor; 3. Compress separately each pieces of the partition. TIME COMPLEXITY of BWTOPT: It depends critically on that of C and it is (n2). Fortunately, if C has a linear time decompressionalgorithm then BWTOPT also admits a linear time decompressionalgorithm.

  19. A prefix code compressor HC • # is an end-of-string marker • The base compressor C is a modification of Huffman encoding so that we can encode # basically for free. THEOREMConsider a string s. Let p1, p2, …, ph be the empirical probability distribution of s. Then

  20. PROBLEM Given two positive integers t and w, t<w, and the increasing sequence of integers d1,d2,…,dt in [1,w], find an algorithm to produce a binary encoding of d1,d2,…,dt and w. The solution we propose works well in conjunction with CD where the lengths of the strings we need to compress may even consists of few symbols. THEOREMConsider a string s. Let p1, p2, …, ph be the empirical probability distribution of s. Then A compressor RHC based on Prefix and Run Length Encoding • It combines Huffman encoding with Run length encoding. • It use knowledge about the symbol frequency in a string. For low entropy string it is essential to use RLE. • The RLE scheme we use depends critically on a variable length encoding of a sequence of integers.

  21. Lower Bound • ASSUMPTIONS: • Given a compressor C, we assume that {C(an) |n>0} is a codeword set for the integers. • For technical reasons we also assume that |C(an)| is non-decreasing function of n. The lower bound comes from a theorem in [Levenshtein,1968], which we restate in our notation: THEOREM There exists a countable number of string s such that |C(s)||s|Hk*(s)+(|s|)where (n) is a diverging function of n. COROLLARYNo compression algorithm satisfying previous assumptions can achieve the bound formulated in conjecture by Manzini, i.e.|s\Hk*(s)+gk for k0 and gk 0constant.Such a result holds independently of whether or not BWT is applied as a preprocessing step.

More Related