1 / 95

15-211 Fundamental Data Structures and Algorithms

15-211 Fundamental Data Structures and Algorithms. LZW Compression. Aleks Nanevski February 10, 2004 based on a lecture by Peter Lee. Last Time…. Problem: data compression. Convert a string into a shorter string. Lossless – represents exactly the same information.

gen
Download Presentation

15-211 Fundamental Data Structures and Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 15-211Fundamental Data Structures and Algorithms LZW Compression Aleks Nanevski February 10, 2004 based on a lecture by Peter Lee

  2. Last Time…

  3. Problem: data compression • Convert a string into a shorter string. • Lossless – represents exactlythe same information. • Lossy – approximates the original information. • Uses of compression: • Images over the web: JPEG • Music: MP3 • General-purpose: ZIP, GZIP, JAR, …

  4. Huffman trees

  5. Huffman’s algorithm • Huffman’s algorithm gives the optimal prefix code. • For a nice online demo, see • http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/huffman.html

  6. Huffman compression • Huffman trees provide a straightforward method for file compression. • 1. Scan the file and compute frequencies • 2. Build the code tree • 3. Write code tree to the output file as a header • 4. Scan input, encode, and write into the output file

  7. Huffman decompression • Read the header in the compressed file, and build the code tree • Read the rest of the file, decode using the tree • Write to output

  8. Beating Huffman • How about doing better than Huffman! • Impossible! • Huffman’s algorithm gives the optimal prefix code! • Right. • But who says we have to use a prefix code?

  9. Example • Suppose we have a file containing • abcdabcdabcdabcdabcdabcd… abcdabcd • This could be expressed very compactly as • abcd^1000

  10. Dictionary-Based Compression

  11. Dictionary-based methods • Here is a simple idea: • Keep track of “words” that we have seen, and replace them with a code number when we see them again. • The code is typically shorter than the word • We can maintain dictionary entries • (word, code) • and make additions to the dictionary as we read the input file.

  12. Lempel & Ziv (1977/78)

  13. Fred Hacker’s algorithm… • Fred now knows what to do… • Create the dictionary: ( <the-whole-file>, 1 ) • Transmit 1, done.

  14. Right? • Fred’s algorithm provides excellent compression, but…

  15. Right? • Fred’s algorithm provides excellent compression, but… • …the receiver does not know what is in the dictionary! • And sending the dictionary is the same as sending the entire uncompressed file • Thus, we can’t decompress the “1”.

  16. Hence… • …we need to build our dictionary in such a way that the receiver can rebuild the dictionary easily.

  17. LZW Compression: The Binary Version LZW=variant of Lempel-Ziv Compression, by Terry Welch (1984)

  18. Maintaining a dictionary • We need a way of incrementally building up a dictionary during compression in such a way that… • …someone who wants to uncompress can “rediscover” the very same dictionary • And we already know that a convenient way to build a dictionary incrementally is to use a trie

  19. Binary LZW • In this method, we build up binary tries • In a binary trie, each node has two children • In addition, we will add the following: • each left edge is marked 0 • each right edge is marked 1 • each leaf has a label from the set {0,…,n}

  20. A binary trie 0 1 0 1 0 1 1 2 0 0 1 3 1 0 5 4

  21. Binary LZW: Compression • We start with a binary trie consisting of a root node and two children • left child labeled 0, and right labeled 1 • We read the bits of the input file, and follow the trie • When a leaf is reached, we emit the label at the leaf • Then, add two new children to that leaf (converting it into an internal node)

  22. Binary LZW: Compression, pt.2 • The new left child takes the old label • The new right child takes a new label value that is one greater than the current maximum label value

  23. Binary LZW: Compression example 10010110011 Input: ^ 0 1 Dictionary: 0 1 Output:

  24. Binary LZW: Compression example 10010110011 Input: ^ 0 1 Dictionary: 0 0 1 1 2 Output: 1

  25. Binary LZW: Compression example 10010110011 Input: ^ 0 1 Dictionary: 0 1 0 1 1 2 0 3 Output: 10

  26. Binary LZW: Compression example 10010110011 Input: ^ 0 1 Dictionary: 0 1 0 1 1 2 0 0 1 3 4 Output: 103

  27. Binary LZW: Compression example 10010110011 Input: ^ 0 1 Dictionary: 0 1 0 1 1 2 0 0 1 3 1 0 5 4 Output: 1034

  28. Binary LZW: Compression example 10010110011 Input: ^ 0 1 Dictionary: 0 1 0 1 1 2 0 1 0 6 3 1 0 5 4 Output: 10340

  29. Binary LZW: Compression example 10010110011 Input: ^ 0 1 Dictionary: 0 1 0 1 1 0 1 0 1 7 2 0 6 3 1 0 5 4 Output: 103402

  30. Binary LZW output • So from the input • 10010110011 • we get output • 103402 • To represent this output we can keep track of the number of labels n each time we emit a code • and use log(n) bits for that code

  31. Binary LZW output • We started with input 10010110011 • Encoded it as 103402, for which we get the bit sequence 001 000 011 100 000 010 • This looks like an expansion instead of a compression • But what if we have a larger input, with more repeating sequences? • Try it!

  32. Binary LZW output • One can also use Huffman compression on the output…

  33. Binary LZW termination • Note that binary LZW has a serious problem, in that the input might end while we are in the middle of the trie (instead of at a leaf node) • This is a nasty problem • which is why we won’t use this binary method • But this is still good for illustration purposes…

  34. Binary LZW: Uncompress • To uncompress, we need to read the compressed file and rebuild the same trie as we go along • To do this, we need to maintain the trie and also the maximum label value

  35. Binary LZW: Uncompress example 103402 Input: ^ 0 1 Dictionary: 0 1 Output:

  36. Binary LZW: Uncompress example 103402 Input: ^ 0 1 Dictionary: 0 0 1 2 1 Output: 1

  37. Binary LZW: Uncompress example 103402 Input: ^ 0 1 Dictionary: 0 1 0 1 0 2 1 3 Output: 10

  38. Binary LZW: Uncompress example 103402 Input: ^ 0 1 Dictionary: 0 1 0 1 0 2 1 0 1 3 4 Output: 1001

  39. Binary LZW: Uncompress example 103402 Input: ^ 0 1 Dictionary: 0 1 0 1 0 2 1 0 1 3 1 0 5 4 Output: 1001011

  40. Binary LZW: Uncompress example 103402 Input: ^ 0 1 Dictionary: 0 1 0 1 2 1 0 1 0 6 3 1 0 5 4 Output: 100101100

  41. Binary LZW: Uncompress example 103402 Input: ^ 0 1 Dictionary: 0 1 0 1 1 0 1 0 1 7 2 0 6 3 1 0 5 4 Output: 10010110011

  42. LZW Compression: The Byte Version

  43. Byte method • The binary LZW method doesn’t really work • we show it for illustrative purposes • Instead, we use a slightly more complicated version that works on bytes or characters • We can think of each byte as a “character” in the range {0…255}

  44. Byte method trie • Instead of a binary trie, we use a more general trie in which • each node can have up to n children (where n is the size of the alphabet), one for each byte/character • every node (not just the leaves) has an integer label from the set {0…m}, for some m • except the root node, which has no label

  45. Byte method LZW • We start with a trie that contains a root and n children • one child for each possible character • each child labeled 0…n • When we compress as before, by walking down the trie • but, after emitting a code and growing the trie, we must start from the root’s child labeled c, where c is the character that caused us to grow the trie

  46. LZW: Byte method example • Suppose our entire character set consists only of the four letters: • {a, b, c, d} • Let’s consider the compression of the string • baddad

  47. Byte LZW: Compress example baddad Input: ^ Dictionary: b a c d 0 1 2 3 Output:

  48. Byte LZW: Compress example baddad Input: ^ Dictionary: b a c d 0 1 2 3 a 4 Output: 1

  49. Byte LZW: Compress example baddad Input: ^ Dictionary: b a c d 0 1 2 3 d a 5 4 Output: 10

  50. Byte LZW: Compress example baddad Input: ^ Dictionary: b a c d 0 1 2 3 d a d 5 4 6 Output: 103

More Related