Data Compression

Data Compression Lempel-Ziv algorithms Burrows-Wheeler

Lempel-Ziv Algorithms LZ77 (Sliding Window) • Variants: LZSS (Lempel-Ziv-Storer-Szymanski) • Applications: gzip, Squeeze, LHA, PKZIP LZ78 (Dictionary Based) • Variants: LZW (Lempel-Ziv-Welch), LZC (Lempel-Ziv-Compress) • Applications: compress, GIF, CCITT (modems), ARC, PAK Traditionally LZ77 was better but slower, but the gzip version is almost as fast as any LZ78.

a a c a a c a b c a b a b a c previously coded LZ overview Cursor Find the longest prefix of S[i,m] that appears in S[1,i-1] write its position and length

a a c a a c a b c a b a b a c a(1,1)c(1,3)(1,1)b(5,3)(6,2)(2,2) Implement by maintaining a suffix tree for the coded part  linear time and space (e.g. via Ukkonen)

a a c a a c a b c a b a b a c Dictionary(previously coded) Lookahead Buffer LZ77: Sliding Window Cursor Dictionary and buffer “windows” are fixed length and slide with the cursor On each step: • Output (p,l,c)p = relative position of the longest match in the dictionaryl = length of longest matchc = next char in buffer beyond longest match • Advance window by l + 1

(1,1,c) a a c a a c a b c a b a a a c (3,4,b) a a c a a c a b c a b a a a c (3,3,a) a a c a a c a b c a b a a a c (1,2,c) a a c a a c a b c a b a a a c LZ77: Example (0,0,a) a a c a a c a b c a b a a a c Dictionary (size = 6) Longest match Next character

LZ77 Optimizations used by gzip LZSS: Output one of the following formats (0, position, length) or (1,char) Typically use the second format if length < 3. (1,a) a a c a a c a b c a b a a a c (1,a) a a c a a c a b c a b a a a c (1,c) a a c a a c a b c a b a a a c (0,3,4) a a c a a c a b c a b a a a c

Optimizations used by gzip (cont.) • Huffman code the positions, lengths and chars • Non greedy: possibly use shorter match so that next match is better • Use hash table to store dictionary. • Hash is based on strings of length 3. • Find the longest match within the correct hash bucket. • Limit on length of search. • Store within bucket in order of position

LZ78: Dictionary Lempel-Ziv Basic algorithm: • Keep dictionary of words with integer id for each entry (e.g. keep it as a trie). • Coding loop • find the longest match S in the dictionary • Output the entry id of the match and the next character past the match from the input (id,c) • Add the string Sc to the dictionary • Decoding keeps same dictionary and looks up ids

LZ78: Coding Example Dict. Output 1 = a (0,a) a a b a a c a b c a b c b 2 = ab (1,b) a a b a a c a b c a b c b 3 = aa (1,a) a a b a a c a b c a b c b 4 = c (0,c) a a b a a c a b c a b c b 5 = abc (2,c) a a b a a c a b c a b c b 6 = abcb (5,b) a a b a a c a b c a b c b

a a a b a a b a a a a b a a c a a b a a c a b c a a b a a c a b c a b c b LZ78: Decoding Example Dict. Input 1 = a (0,a) 2 = ab (1,b) 3 = aa (1,a) 4 = c (0,c) 5 = abc (2,c) 6 = abcb (5,b)

LZW (Lempel-Ziv-Welch) [Welch84] Don’t send extra character c, but still add Sc to the dictionary. The dictionary is initialized with byte values being the first 256 entries (e.g. a = 112, ascii), otherwise there is no way to start it up. The decoder is one step behind the coder since it does not know c • Only an issue for strings of the form SSc where S[0] = c, and these are handled specially

LZW: Encoding Example Output Dict. 256=aa 112 a a b a a c a b c a b c b 257=ab 112 a a b a a c a b c a b c b 258=ba 113 a a b a a c a b c a b c b 259=aac 256 a a b a a c a b c a b c b 260=ca 114 a a b a a c a b c a b c b 261=abc 257 a a b a a c a b c a b c b 262=cab 260 a a b a a c a b c a b c b

a a b a a c a b c a b c b LZW: Decoding Example Input Dict 112 112 256=aa a a b a a c a b c a b c b 113 257=ab a a b a a c a b c a b c b 256 258=ba a a b a a c a b c a b c b 114 259=aac a a b a a c a b c a b c b 257 260=ca a a b a a c a b c a b c b 260 261=abc a a b a a c a b c a b c b

LZ78 and LZW issues What happens when the dictionary gets too large? • Throw the dictionary away when it reaches a certain size (used in GIF) • Throw the dictionary away when it is no longer effective at compressing (used in unix compress) • Throw the least-recently-used (LRU) entry away when it reaches a certain size (used in BTLZ, the British Telecom standard)

Burrows -Wheeler Currently best “balanced” algorithm for text Basic Idea: • Sort the characters by their full context (typically done in blocks). This is called the block sorting transform. • Use move-to-front encoding to encode the sorted characters. The ingenious observation is that the decoder only needs the sorted characters and a pointer to the first character of the original sequence.

a b r a c a # b r a c a # a r a c a # a b a c a # a b r c a # a b r a a # a b r a c # a b r a c a שלב I : יצירת מטריצה Mשבנויה מכל ההזזות הציקליות של S: S = abraca M =

# a b r a c a a # a b r a c a b r a c a # a c a # a b r b r a c a # a c a # a b r a r a c a # a b שלב II : מיון השורות בסדר לקסיקוגרפי: F L L is the Burrows Wheeler Transform

Claim: Every column contains all chars. F L a b r a c a # # a b r a c a a # a b r a c b r a c a # a a b r a c a # r a c a # a b a c a # a b r a c a # a b r b r a c a # a c a # a b r a c a # a b r a a # a b r a c # a b r a c a r a c a # a b You can obtain F from L by sorting

F L # a b r a c a a # a b r a c a b r a c a # a c a # a b r b r a c a # a c a # a b r a r a c a # a b The “a’s” are in the same order in L and in F, Similarly for every other char.

From L you can reconstruct the string F L # a a c a # a r b a c a r b What is the first char of S ?

From L you can reconstruct the string F L # a a c a # a r b a c a r b a What is the first char of S ?

From L you can reconstruct the string F L # a a c a # a r b a c a r b ab

From L you can reconstruct the string F L # a a c a # a r b a c a r b abr

Compression ? L a c Compress the transform to a string of integers usign move to front # r a 0 2 3 2 0 3 a Then use Huffman to code the integers b

Why is it good ? F L a b r a c a # # a b r a c a a # a b r a c b r a c a # a a b r a c a # r a c a # a b a c a # a b r a c a # a b r b r a c a # a c a # a b r a c a # a b r a a # a b r a c # a b r a c a r a c a # a b Characters with the same (right) context appear together

Sorting is equivalent to computing the suffix array. F L a b r a c a # # a b r a c a a # a b r a c b r a c a # a a b r a c a # r a c a # a b a c a # a b r a c a # a b r b r a c a # a c a # a b r a c a # a b r a a # a b r a c # a b r a c a r a c a # a b Can encode and decode in linear time

Data Compression