Cosc 2150: Computer Organization

Cosc 2150:Computer Organization Chapter 2a Data compression

Data Compression • As name implies, makes your data smaller, saving space • Looks for repetitive sequences or patterns in data • e.g. the the quick the brown fox the • We are more repetitive than we think – text often compresses over 50% • Lossless vs. lossy • example: image formats: gif is lossless and jpg is lossy

Data Compression (2) • History: • Not a new concept, dates back to infancy ofcomputers • Perhaps first mainstream use was for transmitting data over phone lines • saved time and money • Also used for archiving data • Cramming as much as possible onto disks,lowered distribution costs

Data Compression (3) • Years of research have gone into data compression, but findings can be compressed into a short list: • LZW compression - Lempel-Ziv-Welch – builds substring dictionary (GIF, TIFF, ZIP, gzip, many others) • Huffman Encoding - uses binary tree and symbols (TIFF, bzip2) • Burrows-Wheeler block sort - transform data to make more compressible (bzip2) • RLE - run length encoded (PCX) • others are lossy - JPG, MPG, MP3, WAV - many use FFT

Data Compression (4) • Disk space cheap now • But more drive space is now needed as well. • Examples: • Bioinformatics regularly deals with databases of information gigabytes in size • 1 DNA string can be 650MB of data • Images: • Uncompressed pictures in raw or TIFF format • say 800x600 can be 50MBs!!!! • But time is money, so want fast decompression too.

Text compression • A test document can be compress from 50% to 5% of the original size. • This is true for ASCII, EBCDIC, even UTC. • Think about the ASCII table, 8-bit used to represent each “character” • Next slide has the ascii table (127 codes). • How many “numbers” can be represented in 8-bit. • 256 • So we can already remove first bit during compression. About %87 compression in a lossless format.

ASCII table

Text compression (2) • With a only text file • We can remove all the non-printable characters from our list as well, dropping 32 more from the list. • But they are the first 32 numbers, so it’s harder. • But First question need to answer some questions then come back to text compression. • Streaming compression • Data is compressed between two points, say the internet • Fixed compression • We have the time/cpu power to analysis and compress a file.

Streaming compression • If the type of data is unknown, then the compression is normally very simple • Run length encoding • Count the bits. 00011000111000111000111111 • Don’t send every bit, instead send: 32333335 • Meaning 3 0s, 2 1s, 3 0s, 3 1s, 3 0s, 3 1s, 3 0s, 5 1s • Very good for data with longer runs of ones or zeros • Really bad for data like this: 1010101010101010101010101 • Will actually cause the data sent to be BIGGER! • If the data type is known, then use a standard compression algorithm. • Example: Audio then use mp3.

Fixed compression • Use a known bit reduction algorithm instead • where very common letters use fewer bits and very uncommon characters use more. Such a space, e, etc use 3 bits, while punctuation and z uses 6 bits. • based on a known symbol trees • In the file “header”, note which symbol tree is used, then compress the file. • This works great for large documents, but is not the best compression.

Fixed Compression (2) • We the cpu has the time to analysis the data • We build our own symbol tree • Add that symbol tree as the header to the file • Then compress • To decompress, it loads the symbol tree from the file and then can decompress file. • A note, for small files, the compressed file may actually be bigger! • Why?

Creating a symbol tree • Analysis of the file • Go through the whole file and count number of times each “character” appears • Build a tree based on the frequency each character appears. • There are more then a few variations on this • Huffman encoding, n-ary Huffman coding, adaptive huffman coding, etc… • Example on the next slide

Example • Using this sentence: • “this is an example of a huffman tree” To encode/decode using the tree: left is a 0 and right is a 1

Example (2) • To encode the 36 characters at 8 bits • 288 bits are needed. • To encode with the symbol tree • 135 bits • Except: it doesn’t account for adding the tree to the file. • Say we have a file containing 10,000 Gs and 5,000 end of line markers. 15,000 characters. • 120,000 bits or about 15KB • Gs could be encoded as 1, end of line markers as 0 • Assume 32 bits for the symbol table • Reduces to 15,032 bits or a little less then 2KB (1,879 bytes) • about 88% compression

Image compression • Like file compression • many different formats • common image formats: TIFF, JPG, GIF, and PNG • There are dozens of other formats. • We'll look at 2 different formats • lossless format: GIF • lossy format: JPG

Bitmaps • In a bitmap image • the image file has to define the exact color of every pixel in the image. • For example, imagine a typical bitmap on the web that is 400 by 400 pixels. • It would needs 24 bits per pixel for 160,000 pixels, or 480,000 bytes (480 MB). • That would be a huge image file • both the PNG/GIF and JPG formats compress the image in different ways.

GIF format • By default, converting to a GIF format, reduces the number of colors to 256 or 8 bits per pixel • and "runs" of same-color pixels are encoded in a color+numberOfPixels format. • For example, if there are 100 pixels on a line with the color 41, the image file stores the color (41) and the length of the run (100). • This makes a GIF file great for storing drawings that have lots of same-color pixels • A GIF format is an perfect reproduction of the original (when used at the same pixel depth).

Example with GIF • This image is 575x194 with 8 bits per pixel • It's about 48KB in a GIF file. • running the number: 575*194*1byte (or 24 bits) = 111550 bytes or 104KB • With the color compression, saved 56KB of space. • Mostly in the white

JPG format • JPEG/JPG is designed for compressing either full-color or gray-scale images of natural, real-world scenes. It works well on photographs, naturalistic artwork, and similar material; not so well on lettering, simple cartoons, or line drawings. • PEG is "lossy," and is designed to exploit known limitations of the human eye, notably the fact that small color changes are perceived less accurately than small changes in brightness.

JPG format (2) • Skipping the complexity of JPG format, you can control the quality of the picture, between 100 being the best (worst compression) and 1 being the worst (but best compression) • Somewhere between 65 to 70 being the "best quality and compression" • You can also choose progressive scan, which speed up rendering or Optimize Huffman codes

Example with JPG • This image is 575x194 with 24 bits per pixel • The file size is 23KB

MP3 • 'official' definition of MP3. • 'MP3' is an abbreviation of MPEG 1 audio layer 3. 'MPEG' is itself an abbreviation of 'Moving Picture Experts Group' • the official title of a committee formed under the auspices of the International Standards Organisation (ISO) and the International Electrotechnical Commission (IEC), • The plan was to find a method for compressing audio and video data streams sufficiently that it would be possible to store them on, and retrieve them from, a device delivering around 1.5 million bits per second — such as a conventional CD-ROM. • The 1993 standard for coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/S, more commonly known as the MPEG 1 standard.

MP3 (2) • The requirement that encoded video and audio must be retrievable from a medium delivering only 1.5 million bits per second dictated that one of the principal concerns was data compression, since the transmission of uncompressed audio and video streams would require many more bits per second. • Music is sampled 44,100 times per second. The samples are 2 bytes (16 bits) long. • Separate samples are taken for the left and right speakers in a stereo system. • So a CD stores a huge number of bits for each second of music: • 44,100 samples/second * 16 bits/sample * 2 channels = 1,411,200 bits per second • An average song last 3 minutes, that's about 32MBs per song.

MP3 (3) • The goal of the MP3 format is to compress a CD-quality song by a factor of 10 to 14 without noticably affecting the CD-quality sound. • With MP3, a 32MB song on a CD compresses down to about 3 MB. • Lossless vs. lossy compression?

MP3 (4) • To make a good compression algorithm for sound, a technique called perceptual noise shaping is used. It is "perceptual" partly because the MP3 format uses characteristics of the human ear to design the compression algorithm. For example: • There are certain sounds that the human ear cannot hear. • There are certain sounds that the human ear hears much better than others. • If there are two sounds playing simultaneously, we hear the louder one but cannot hear the softer one. • With this in mind we can make a "near CD quality" file. • An a very good reduction in size of the file, about 14..10:1 in reduction.

Cosc 2150: Computer Organization